How to Test Variables Correlation in Data Science?

Eqibuana
7 min readJul 7, 2021

--

INTRODUCTION

First of all, you have to understand what the difference between categorical and continuous variables/features/values in data science. Categorical variable (also called qualitative variable) is a variable used in data science as a group, e.g. gender(male and female, smoking and non-smoking customers, and so on), while continuous variable is described as numerical value that can take any value between its minimum and maximum value (such as: age, salary, etc).

After knowing what the difference of these two variables is, we are going through some approaches that will be discussed below for categorical-categorical, categorical-continuous, continuous-categorical, and continuous-continuous variables relationship.

picture 1. Correlation Approach for Categorical and Continuous Variables

PEARSON’S CORRELATION COEFFICIENT

Pearson’s correlation coefficient (PCC) is a measure of linear correlation between 2 data set. the formula of PCC is the covariance of two variables, divided by the product of their standard deviations as shown below, so it is essentially normalized measurement of the covariance and the result is in the range of -1 (negatively perfect correlation) and +1(positively perfect correlation).

picture 2. Pearson’s Correlation Coefficient Formula

The Pearson’s Correlation Coefficient us calculated automatically in python using the code given on the picture 3. How the r value represents the shape of the curve is given below:

picture 3. Pearson Correlation Value (r) Represents The Shape of The Curve

We can classify the strength of linear correlation (r) into:
- very strong: r>0.8 or r<-0.8
- strong: 0.6>r≥0.8 or -0.8≥r>-0.6
- moderate: 0.4>r≥0.6 or -0.6≥r>-0.4
- weak: 0.2>r≥0.4 or -0.4≥r>-0.2
- very weak: 0>r≥0.2 or -0.2≥r>-0
- no relationship: r=0

Some requirements need to be met for calculating the Pearson’s correlation coefficient:
1. Scale of measurement should be interval or ratio
2. Variables should be approximately normally distributed
3. The correlation should be linear
4. No outliers in the data set

ANALYSIS OF VARIANCE(ANOVA)

Analysis of Variance (ANOVA) is a statistical test used to compare two or more means of groups together. ANOVA analysis is suitable for situations in which one variable is categorical and the other variable is quantitative variable. We can transform the data from continuous variable into quantitative by selecting a category and binning the variable. The independent variable should be transformed into a categorical variable, while the dependent variable will stay continuous.

After the variables have been transformed and are ready to be analyzed, we can use the statsmodel library to carry out an ANOVA test on selected features.

picture 4. Anova test

ANOVA Hypotheses:
- Null Hypothesis: Groups’ means are equal (no variation in means of groups)/ no significant difference amongst the groups
H0: μ1=μ2=…=μp
- Alternative hypothesis: At least, one group mean is different from other groups/ there is a significant difference amongst the groups
H1: All μ are not equal

How ANOVA works:
- Check sample sizes: equal number of observation in each group
- Calculate Mean Square for each group (MS) (SS of group/level-1): level-1 is a degree of freedom (df) for a group
- Calculate Mean Square Error (MSE) (SS error/df of residuals)
- Calculate F value (MS of group (MSE)
- Calculate p-value based on F value and degrees of freedom (df)

Here, how to interpret the ANOVA. The p value obtained from ANOVA analysis is significant (p < 0.05), therefore, we conclude that there are significant differences among treatments. As seen on the ANOVA above, C(partner_status, sum) and C(C(fcategory, Sum):C(partner_status, Sum) have p-value < 0.05, so we can conclude that there is a significant difference on these two variables, in other words, the null hypothesis is rejected and the alternative hypothesis is supported. On the other hand, F value is inversely related to p value and higher F value (greater than F critical value) indicates a significant p value.

There are certain assumptions, we need to make before performing ANOVA:
1. The observations are obtained independently and randomly from the population defined by the factor levels
2. The data for each factor level is normally distributed
3. Independence of cases: the sample cases should be independent of each other
4. Homogeneity of variance: Homogeneity means that the variance among the groups should be approximately equal

The assumption of homogeneity of variance can be tested by using tests such as Levene’s test or the Brown-Forsythe Test, while the normality of the distribution can be tested using histogram, the values of skewness and kurtosis or using test such as Shapiro-Wilk. The assumption of independence can be determined from the design of the study.

ANOVA is classified into 3 types:
1. One-way ANOVA: just has one independent variable
e.g. differences in corona cases can be assessed by Country (2 or more categories to compare)
2. Two-way ANOVA (also called factorial ANOVA): using 2 independent variables
e.g. examine difference in corona cases (dependent variable/y) by age group (independent variable 1) and gender (independent variable 2)
3. N-Way Anova: we can utilize more than two independent variables (widely known as MANOVA test)
e.g. potensial differences in Corona cases can be examined by Country, Gender, Age group, ethnicity at the same time.

CHI-SQUARED TEST

Chi-squared test is a statistical test to find out difference between the observed and the expected value used for classification problems where input variables (independent variables) are also categorical. While conducting chi-squared test we need to initial 2 hypotheses:
1. H0 (Null hypothesis): the 2 compared variables are independent
2. H1 (Alternative Hypothesis): the 2 variables are dependent

If the p-value obtained is less than 0.05, we reject the H0 and accept the H1, and if p-value is higher than 0.05 we accept the H0.

picture 5. Chi-squared Test

as seen on the picture 6, we are using 2 categorical variables which are gender (male or female) and approve_loan(yes or no), the p-value is higher than 0.05, so we can conclude that no correlation between these two categorical variables.

There are two types of chi-square test:
1. The chi-square test for goodness of fit, which compares the expected and observed values to determine how well the predictions fit the data
2. The chi-square test for independence, which compares two sets of categories to determine whether the two groups are distributed differently among the categories

When we are researching the chi-square test, we often run into phrase ‘’statistical significance”. What does this phrase actually mean? Statistical significance simply refers to the probability of being wrong about stating that a relationship exists when in fact it does not, in other words, how often we would incorrectly reject the null hypothesis when in fact it is true.

All the statistical test follow the same 5 steps of hypothesis testing:
1. State the null hypothesis
2. Choose a statistical test
3. Calculate the test statistic (t.s) and evaluate test assumptions
4. Look up the critical value (c.v) of the test
5. Drag the conclusion:
If |t.s.| < c.v., do not reject the null hypothesis
If |t.s.| ≥ c.v., reject the null hypothesis

Three chi-square assumptions:
1. The variables must be categorical
2. The observations are independent
3. All cells must have a minimum of five expected observations. When this condition is not met, it is usually because of the contingency table contains a large number of rows and columns relative to the number of observations. To counter this problem, we can simply redefine the data categories (combine adjacent rows or columns) to create a smaller number of cells

LOGISTIC REGRESSION

Logistic regression is a machine learning classification algorithm that is used to predict the probability of a categorical dependent variable (classification). In a binary classification, the data contains 1 (yes, success, etc) and 0 (no, failure, etc). In other words, the logistic regression model predict P(Y=1) as a function of X.

We have several logistic regression assumptions as follows:
- Binary logistic regression requires the dependent variable (or y) to be binary
- For a binary regression, the factor level 1 of the dependent variable should be represent the desired outcome
- Only the meaningful variables should be included
- The independent variables should be independent of each other. The model should have little or no multicollinearity
- The independent variables are linearly related to the log odds
- Logistic regression requires quite large sample sizes

picture 6. Diabetes Dataset
picture 7. Logistic Regression Multivariate Model

as seen above on the model.summary, it indicates that SkinThickness, Insulin, and Age have p-value higher than 0.05 so we can conclude that these variables seem to be insignificant predictors, where as the other variables on the other way around.

Logistic Regression is basically divided into 3 groups:
1. Binary Logistic Regression:the target variable has only 2 possible outcomes e.g. yes or no, spam or not spam, etc
2. Multinomial Logistic Regression: the target variable has three or more nominal categories, such as predicting type of virus
3. Ordinal Logistic Regression: the target variable has three or more oridinal outcomes, such as product rating 1 to 5

CONCLUSION
In this guide, we have learned about how to test correlation between categorical-categorical, categorical-continuous, continuous-categorical, continuous-continuous correlation in machine learning models — univariate as well as multivariate by using hypothesis test and pearson correlation with statsmodels and scipy library. You will get familiar with these approaches by practice more with your very own data and you will be able to test for significant relationship between dependent and independent variables by yourself.

Eqi Buana is on LinkedIn.

--

--

Eqibuana
Eqibuana

No responses yet