More Correlations in Data

Correlation Leads to Discovery
About 75 years ago, Astronomers used the simple technique of
correlation to discover the Universe was expanding. For
nearby galaxies they measured a redshift and plotted that
against the distance to the galaxy. Here is the data:
|
The line through the data is a "best fit" linear relationship
which shows that there is a linear relationship between the
the velocity at which a galaxy moves away from us and its
distance. This linear relatinship is consistent with a model
of uniform expansion
for the Universe. |
Returning to Salmon:
For the Bonneville Dam data:
- Is their a correlation between
Chinook salmon counts and year (i.e. have the counts been steadily
decreasing?
- Is there a correlation between various species decline?
Here is the data. We want to mine this data statisically and
visually in order to see what we might be able to infer. In
other words, we don't just want to use a machine but our brains
to help interpret the data. This is the only way to understand
and appreciate the true complexity of a particular environmental
problem.
There is no correlation in this case over the whole time period.
In fact there appears to be:
- quasi-cyclical behavior with a period of 15--20 years between
peaks and valleys (note there is not enough data to really say this)
- There seems to be a rapid decline since the last peak (1984) -->
could it mean that these Salmon counts are anomolously high
- Recent Salmon Levels are consistent with those in 1980, 1960
and 1940
What about Steelhead vs Chinook at Bonneville Dam:
Formally there is a very little correlation. The correlation
coefficient, r, is 0.31. But look at the data closer to notice
that its kind of odd.
There are 9 distinct occurences where the
Steelhead Count is significantly above average (this corresponds
to counts above 250,000). If we ignore those 9 points (years) out
of the total of 57 years worth of data, the average Steelhead count
is
143,000 +/- 32,000 (N=48)
The mean count for those 9 higher years is
306,000 +/- 35,000 (N=9)
Is the difference in these means significant?
- E1 = 32,000/7 = 4500
- E2 = 35,000/3 = 12000
- E2 > 2*E1
- so use (M1-M2)/E2
(306-143)/12 = 13 !!
One can therefore to conclude that something produces very high
Steelhead Counts. Examining the data in time shows that the
high Steelhead Counts occured in 1952--1953 and again in 1984-1989
and 1991-1992. High Steelhead count, however, does not mean
high chinook count (nor does it correlate with anyother species)
For the whole data set, the weak correlation (r = 0.31) is shown below:
While a social scientist might argue that a correlation exists, you
should be able to do better than that.
- The formal correlation is Y = 0.25*X + 76 but the scatter
around that correlation is 66,000 Steelhead. Since the average Steelhead
count is 143,000 then using Chinook as the tracer of the Steelhead
poulation only predicts Steelhead to an accuracy of 66,000/143,000 or
around 45% (pretty lousy)
- But look at the data and notice that the weak correlation is almost
entirely driven by the two points with the highest X-values (highest
Chinook counts). If we eliminate those two points, then r lowers
from 0.31 to 0.10 which is no correlation at all.
- Removing the 9 periods of high Steelhead Counts from the data shows
no correlation at all. In fact, the average Steelhead counts is the
same over a range of Chinook counts from 200--500 (thousand).
Okay, what about using just the chinook counts as a tracer of the
entire salmon population. How well does that work? Here is
the data:
Your eye sees a correlation and indeed r = 0.79 for this data set.
Of course, some trend is expected since roughly 30--40% of the total
Salmon Population is chinook; the question is, what is the dispersion
in total salmon counts that results from using chinook as the
tracer?
The formal fit is:
Y = 1.50X + 106 with a dispersion of 97
This means that chinook counts can be used to predict the total
Salmon counts to an accuracy of 97,000. Since the Salmon count
ranges from 500,000 to 1 million, that means an accuracy of 10-20%.
This suggests that, if you are only interested in total Salmon,
you can use chinook as a reliable tracer, provided that you don't
require accuracy better than 20%.
The fit as applied to the data is shown here. In this case, r =0.79
and the fit is a good fit. There are no strongly abberant data
points.
Some Final Remarks
And so after this evolution we arrive at a crossroads, strongly
driven by non-equilibrium growth, and we look for solutions about
how to better manage the planet.
Much of the current dialogue in environmental studies or management
needs to shift away from belief to a position of knowledge.
The acquisition of knowledge requires gathering good data, analyzing
it correctly, and then forming new questions on the basis of the
data.
The Data Commandments: (Apply them often)
- Always, always ALWAYS plot your
data.
- Never, never NEVER put data through
some blackbox reduction routine without examining the data themselves.
- The average of some distribution is not very meaningful unless
you also know the dispersion. Always calculate the dispersion.
- Always exam correlation data for points that could be rejected.
Never reject them just because they are "too far from the line" but rather
examine if poor measurement or some other error is responsible for
these peculiar data values.
- Always present and plot data without any compression in the axis
so that you don't distort the data by fostering an unfair visual
impression.
- Always compute the level of significance when comparing two
distributions. Just because they might have different mean values
doesn't necessarily mean they are significantly different.
- Always know your measuring errors.
- Always require someone to back up their "belief statments"
with data
- Always calculate the dispersion in any correlative analysis and
always look to see if the residuals correlate with another parameter
- Always remember that unambiguous data resolves conflict.
GOOD LUCK
- Add your questions or comments about this particular assignment
Previous Lecture
Course Page