Correlations in Data


Correlations in Data

Undoubtedly the biggest arguments that exist in social and/or natural sciences are about correlations. The rift between social scientists and physical scientists is largely driven by disagreements about how rigorous correlations have been derived.

Yet correlation analysis is argueably the single most important thing that one does with a data set. Such an analysis can help define trends, make predictions and uncover root causes for certain phenomena. But in order to do this properly, one must examine and test how good the correlation actually is.

For our purposes, the most critical parameter is the scatter or dispersion around the fit. We will deal with this issue explicilty later below.

An example of a "False Correleation"

An Example of a correlation that has a lot of "scatter" around it:

An Example of outliers at the data extremes

An Example of uneven data sampling in X:


An Example of uneven scatter as a function of X:

However, one has to be extremely careful about the form of the data and whether or not a linear function is the best approximation. As an example, consider the following data set which is the time evolution of the world record in the 100 meter dash:

While there are standard tools for performing correlation analyses (this will be provided to you later) it is often done poorly. As a result, lots of erroneous analysis gets published, in virtually all fields.

Often times, there is simply not enough data to adequately define a correlation. This allows one to make ridiculous predictions which, although they can be supported by the data, make no sense.

A favorite example:

Here is a prediction that I made in the year 1839 (that was in the pre-internet era):

All presidents that are elected in a year that ends with a zero will die in office:

Basics of Correlation:

Correlation can be used to summarise the amount of association between 2 continuous variables. Plotting a "scatter" yields a "cloud"of points :

A positive association between the x and y variable helps you to predict the value of the other. If there is little or no association the "cloud"is more spread out and information about one variable can not really be discerned from the other variable.

These "clouds" have the same values for the centre, defined by the mean x and y values. Furthermore, the dispersion in the X variable is the same as that in the Y variable. But (A) is tightly clustered and (B) is loosely clustered. The amount of clustering, i.e. the strength of association, is summarised by the correlation coefficient.

In general, we measure correlation by a parameter known as the correlation coefficient, r .

r is between - 1 and 1

Mathematically, r is defined as

But we don't really care about this - we only care about using the value of r as a rough guide to how well two variables are correlated. Usually your eye is a good estimator of r.

Regression is now built into the tool

What we care about most is is the amount of diserpsion (scatter) that exists around the fitted linear relation

Final points about regression:

JAVA Applet

Previous Lecture Next Lecture Course Page