“Politicians use statistics in the same way that a drunk uses lamp-posts for support rather than illumination” -A. Lang

Statistics is a very complex subject and because of its complexity statistics and its terms are frequently misunderstood and misused. I came across an excellent article on The datapine Blog about how statistics can be misleading.  One of the statistical terms “correlation” is a term that may be used improperly.

Many people use the term “correlation” when they mean the more general term “association”.

What is the difference?

Association is a vague, general term that describes the relationship between the two data sets or two variables.

Correlation on the other hand measures the strength and direction of the linear relationship between the data sets or a pair of variables.

Three important conditions should be checked before calculating the correlation:

  1. The association between the two data sets has to be linear or in other words, the association between the data sets looks like a straight line.
  2. Data sets have to be quantitative.
  3. The outliers (unusual data) in the data set need to be checked and eliminated if possible because they can skew the results.

The best way to look at the association between the two data sets is to make a Scatter plot.

Let us look at the two Scatter diagrams which were used to graph the following data set:

Figure 1: Scatter Plot in Excel

Keeping in mind the 3 conditions that should be checked, we can see that all the variables are quantitative and that there is an association between the cost of Big Mac and Movie Tickets in the counties listed in the table.

We also need to check for the third condition; the presence of the outliers.

Interestingly, Russia has the lowest price of the Big Mac and slightly below the average price of the Movie Ticket.  A discrepancy like that is unique to this data set wherein the all-other countries the higher price of the Big Mac corresponds to higher price of the Movie Tickets and vice versa. Because of this discrepancy we should treat the data from Russia as an outlier (marked in red in Figure 1).  It is common practice to eliminate the outlier(s) when we understand its genesis and make another Scatter diagram without it (Figure 2). Now let’s see how our Scatter diagram looks if we exclude Russia from our data set.

Figure 2: Scatter plot with the linear relationship and squared r value in Excel.

The pattern in the second graph (Figure 2) looks more consistent after we removed the outlier. Let’s look at the direction of the association. As we said before, countries with a higher price of BigMac (first data set) also have a higher price of  Movie Tickets (second data set).  We can see on the Scatter plot that the association between both variables has a pattern running from the lower left-hand corner to the upper right corner of the plot.  This pattern has a positive slope.  Thanks to Excel’s capability we can also plot a straight line relationship with the squared value of Pearson’s correlation coefficient.

The other data sets that we are going to look at have different behaviour. Let’s look at the patterns in the two scatter diagrams below: Figure 3 and Figure 4.

Figure 3: Scatter plot with the linear relationship and squared r value in Excel.

The pattern in Figure 3 runs from the upper left to the lower right corner of the Scatter plot.  This pattern has a negative slope. In this case with the increase of temperature (first variable) the average heating bill (second variable) decreases.

It is also very important to look for the form in the scatter plot.  Is it straight, curved or doesn’t have any pattern at all?  Both examples; Big Mac Cost vs Movie Ticket Cost and Edmonton’s Average Heating Bill vs Average Temperature have definitely a straight-line relationship.  So far so good.  Let’s look at the Scatter diagram of the Total $ Value of all the homes sold in Alberta every month between January 2016 and December 2020 versus Bank of Canada (BoC) Interest Rates (Figure 4).

Figure 4: Scatter plot with the linear relationship and squared r value in Excel.

You would assume that the lower interest rates for the mortgage, the more houses would be purchased.  Apparently, this is not the case in Alberta.  Looking at Figure 4 we can barely see any trend or pattern.

Since we know now what to look for in Scatter diagrams it would be interesting to find out more about the linear relationships between the data sets.

A measure of the strength of the relationship between the two data sets is called a sample correlation coefficient or Pearson’s correlation coefficient.  It also measures the direction of this relationship.

Mathematically, Pearson’s correlation coefficient is computed by dividing the sample covariance by the product of both data sets’ standard deviations. By doing it we standardize the correlation values whereas the covariance depends on the units of measurement for both data sets.

The correlation coefficient measures the strength and direction of the linear relationship between the two data sets and its values are scaled down to a limited range of -1 to +1.

A positive correlation (+) means that when the values in one data set increase, the values in the other data set increase too (Figure 2).

A negative correlation (–) means that when the values in one data set increase, the values in the other data set decrease (Figure 3).

If all the values in the Scatter diagram fall on the positively sloped straight line, the value of the correlation coefficient is +1 which is a perfect positive relationship between the two data sets.

On the other hand, if all the values in the Scatter diagram fall on the negatively sloped straight line, the value of the correlation coefficient is -1 which is a perfect negative relationship between the two data sets.

Let’s go back to our examples and calculate Pearson’s correlation coefficient using Excel’s Analysis ToolPack.

Figure 5: Correlation coefficient r calculated using Analysis ToolPack in Excel

We established earlier that there is a linear relationship between the Big Mac and Movie Ticket costs.  Let’s use the Correlation Command in Excel’s Analysis ToolPack to see what is the correlation coefficient in our example.  Here are the links to How to Enable Analysis ToolPack and How to use the Correlation Command in Analysis ToolPack if you need to refresh Excel’s Data Analysis ToolPack. The Pearson’s correlation coefficient (r) in our example is equal to 0.85 (Figure 5). We can also read the value of r from the Scatter plots by taking the square root from R^{2} calculated by Excel. It means that there is a strong positive relationship (r= +0.85 is relatively close to the perfect positive relationship which is equal to +1) between the cost of Big Mac and the cost of Movie Ticket.

The Scatter diagram for the second data set (Figure 3) where we compared Edmonton’s average heating bill to the average monthly temperatures has a negative slope and a straight-line relationship.  Let’s check the correlation coefficient for those data sets.

Figure 6: Correlation coefficient r calculated using Analysis ToolPack in Excel

The correlation coefficient in Figure 6 shows us an almost perfect negative relationship between the average heating bill and the temperature or in other words the higher temperature the lower the heating bill. That makes lots of sense, doesn’t it? The last Scatter diagram in Figure 4 shows BoC interest rates versus the value of houses sold in Alberta. We already established that there is no trend or pattern for this data set. Our correlation coefficient value in Figure 7 seems to confirm what we saw on the Scatter diagram.  The coefficient’s value of -0.297 is close to 0 (zero) which indicates that there is almost no linear relationship however the negative sign indicates that if the BoC interest rates go up, the value of all homes sold in Alberta goes down.  Overall, we can say that between January 2016 and December 2020 Albertans have been buying houses without being very concerned about the BoC interest rates.

Figure 7: Correlation coefficient r calculated using Analysis ToolPack in Excel

Statisticians like to calculate the Pearson’s correlation coefficient for the data sets because the sign and value of r help them to predict how the change in one data set will affect the change in the other data set.