In many real-life data science projects, we find the existence of complex and unknown relationships between the variables of data. It becomes a more crucial situation because of lesser domain knowledge. The knowledge of these relationships helps understand the data better and meet the better-modelled machine learning algorithms.
In one of our articles, we have seen that algorithms such as linear regression performance degrade because of there is the existence of interdependencies. In this article, we will learn about correlation and how we can calculate the strength of correlation. This article will follow the below-given table of content.
Table of content
- What is Correlation?
- Covariance
- Pearson’s Correlation Coefficient
- Spearman’s Correlation Coefficient
Let’s start with understanding correlation.
What is Correlation?
Usually, variables can be related for many reasons, such as because of one variable value generation of the second variable, variable associations. Two variables can be dependent on any third variable etc.
Finding and Understanding relationships between variables is one of the important factors for better data analysis, and modelling and correlation can be thought of as the statistical relationship between two variables.
There can be three types of correlation:
- Positive correlation: When variables are changing in the same direction due to their statistical relationship.
- Negative correlation: When variables are changing in the opposite direction due to their statistical relationship.
- Neutral or zero correlation: When variables are not affected by each other.
Many machine learning algorithms show degradation in their performance because two or more variables are highly correlated. This situation is called multicollinearity. As discussed above, when modelling data using linear regression algorithms, we are required to remove one of the offending correlated variables to improve the algorithm’s performance.
There may be various situations in which we may know about the relationship, and we have no idea about any relationship. So in such situations, we seek for following three stats:
- Type of relationship
- Distribution of data and variables.
- Different correlation scores
In the following sections, we will learn about two scores, one that we use when data includes gaussian distribution and linear relationships and a second that works when data has monotonic(increasing or decreasing) relationships.
The dataset
Here we need a dataset that can hold relationships between its variables. For that purpose, we have created two synthetic variables using the NumPy library and the below line of codes.
mport numpy as np
var_1 = 20 * np.random.randn(1000) + 100
var_2 = var_1 + (10 * np.random.randn(1000) + 50)
In the above codes, we generated two variables. Where in the first variable, random numbers are generated using the standard deviation of nearly 20 and a mean of approximately 100. In the second variable, we added noise and variable. Let’s plot the data to tell us how these two variables correlate.
import matplotlib.pyplot as plt
plt.scatter(var_1, var_2)
plt.title(print(‘variable_1: mean=%.3f stdv=%.3f’ % (np.mean(var_1), np.std(var_1))),
print(‘variable_2: mean=%.3f stdv=%.3f’ % (np.mean(var_2), np.std(var_2))))
Output:
Here we can see that our generated variables are positively correlated. Now let’s discuss a critical concept of this section called covariance.
Covariance
Covariance is also a measure of the relationship between two variables using which we can measure how much variables can change together. One should not be confused between correlation and covariance because covariance is a measure of relationship, whereas correlation tells about the relationship between variables.
Using the below formula, we can calculate the covariance:
covariance(X, Y) = (sum (x — mean(X)) * (y — mean(Y)) ) * 1/(n-1)
In the above formula, the mean of variables is used, suggesting that each data point has at least a gaussian-like distribution.
So if the calculated covariance is negative, there is a negative correlation. If it is positive, it means there is a positive correlation, or variables will change in the same direction.
Using NumPy, we can calculate the covariance in python.
import numpy as np
covariance = np.cov(var_1, var_2)
print(covariance)
Output:
Note: Using the below matrix, we can calculate the exact correlation.
Cov( X, X ) Cov( X, Y )
Cov( Y, X ) Cov( Y, Y )
Here in the above output, we can see that we got a covariance matrix that has all the values positive. This shows that our variable has a positive correlation with each other. We can consider covariance a nice approach to describing the relationship between variables if the following conditions are satisfied:
- Data points follow the Gaussian distribution.
- Variables are linearly correlated.
Using only one tool to describe the relationship between variables is always not a trust-worthy process, and this leads us to use Pearson’s correlation.
Pearson’s correlation
This method is named after Karl Pearson and tells us the summarized reports of the linear relationship’s strength between two variables.
Mathematically, it can be calculated by finding the covariance between two data variables and dividing by the standard variable of each data variable. As given in below:
Pearson’s correlation coefficient = covariance(X, Y) / (std(X) * std(Y))
We can think of it as a normalized form of covariance, and as covariance requires gaussian distribution, it also requires the Gaussian distribution. This calculation returns a value between -1 to 1 and can be interpreted to understand the relationship.
Here negative values represent negative correlation, and a positive value represents positive correlation. Often values of this coefficient below -0.5 and above +0.5 interpretation high or notable correlation.
Using NumPy, we can calculate Pearson’s correlation.
import numpy as np
PCC = np.corrcoef(var_1, var_2)
print(PCC)
Output:
Here we can see that as discussed above these two variables are strongly correlated.
Spearman’s Correlation
In the above, we have discussed two methods of measuring the correlation between variables, and both variables were in a linear relationship. Unlike the above two methods, Spearman’s correlation does not assume that the data has gaussian distribution.
However, we can also use this method to summarize the strength of the relationship between two gaussian distributed variables or linearly related variables, but it will provide less power.
To calculate the relationship, this method utilizes the relative rank of values on each sample. Often we find its usage in non-parametric statistics. The below formula is used to calculate Spearman’s correlation.
Spearman’s correlation coefficient = covariance(rank(X), rank(Y)) / (std(rank(X)) * std(rank(Y)))
We can use this method when we are not sure about the possible distribution and relationship.
We can use pandas to calculate Spearman’s correlation coefficient.
#merging Arrays into pandas DataFrame
import pandas as pd
data = pd.DataFrame({‘var_1’: var_1, ‘var_2’: var_2})
#calculating the spearman’s correlation coefficient
SCC = data.corr(method = ‘spearman’)
print(SCC)
Output:
Here we can see that we get the values of this coefficient 1 or below 1. The value of this coefficient also varies between 1 to -1 and interprets a relationship similar to Pearson’s correlation coefficient.
Final words
In this article, we have discussed what correlation between data variables means and what the coefficients are, using which we can interpret the strength of the relationship between variables in our data. As discussed above, knowing the correlation between variables is crucial before modelling the data. Also, it becomes a crucial part of data analysis, where we need to find out the strength of the relationship to predict the future.