Correlation Matrix

Aryan
3 min readNov 11, 2021

--

Introduction

A correlation matrix is a table showing correlation coefficients between variables. Each cell in a table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and is used for advanced analyses.

A correlation matrix consists of rows and columns that show the variables. Each cell in a table contains the correlation coefficient.

A correlation matrix is useful for showing the correlation coefficients between variables. The correlation matrix is symmetric, as the correlation between a variable V1 and variable V2 is the same as the correlation between V2 and variable V1. Also, the values on the diagonal are always equal to one, because a variable is always perfectly correlated with itself.

Description

A correlation is a number between -1 and +1 that measures the degree of association between two Attributes (call them X and Y). A positive value for the correlation implies a positive association. In this case large values of X tend to be associated with large values of Y and small values of X tend to be associated with small values of Y. A negative value for the correlation implies a negative or inverse association. In this case large values of X tend to be associated with small values of Y and vice versa.

Correlation coefficients are used to measure how strong a relationship is between two variables. There are several types of correlation coefficient, but the most popular is Pearson’s. Pearson’s correlation (also called Pearson’s R) is a correlation coefficient commonly used in linear regression. If you’re starting out in statistics, you’ll probably learn about Pearson’s R first. In fact, when anyone refers to the correlation coefficient, they are usually talking about Pearson’s.

· 1 indicates a strong positive relationship.

· -1 indicates a strong negative relationship.

· A result of zero indicates no relationship at all.

Applications of Correlation Matrix

1. To summarize a large amount of data where the goal is to see patterns. In our example above, the observable pattern is that all the variables highly correlate with each other.

2. To input into other analyses. For example, people commonly use correlation matrixes as inputs for exploratory factor analysis, confirmatory factor analysis, structural equation models, and linear regression when excluding missing values pairwise.

3. As a diagnostic when checking other analyses. For example, with linear regression, a high number of correlations suggests that the linear regression estimates will be unreliable.

Implementation of Correlation Matrix in R

As an example, let’s analyze the data at a technology survey in which respondents were asked which devices they owned. We want to check if there is a relationship between any of the devices owned by running a correlation matrix for the device ownership variables.

To do this in R, we first load the data into our session using the read.csv function:

mydata=read.csv("https://wiki.qresearchsoftware.com/Owner.csv", header = TRUE)

The simplest and most straight-forward to run a correlation in R is with the cor function:

mydata.cor = cor(mydata)

Visualizing the correlation matrix

There are many packages available for visualizing a correlation matrix in R. One of the most common is the corrplot function. We first need to install the corrplot package and load the library.

install.packages(“corrplot”)

library(corrplot)

Next, we’ll run the corrplot function providing our original correlation matrix as the data input to the function

corrplot(mydata.cor)

A default correlation matrix plot (called a Correlogram) is generated. Positive correlations are displayed in a blue scale while negative correlations are displayed in a red scale

Conclusion

Both Correlation and Covariance are very closely related to each other and yet there is a lot of difference between them.

When it comes to choosing between Covariance vs Correlation, the latter stands to be the first choice as it remains unaffected by the change in dimensions, location, and scale, and can also be used to make a comparison between two pairs of variables. Since it is limited to a range of -1 to +1, it is useful to draw comparisons between variables across domains. However, an important limitation is that both these concepts will measure the only linear relationship.

--

--