More Correlations in Data

Correlation Leads to Discovery

About 75 years ago, Astronomers used the simple technique of correlation to discover the Universe was expanding. For nearby galaxies they measured a redshift and plotted that against the distance to the galaxy. Here is the data:

The line through the data is a "best fit" linear relationship which shows that there is a linear relationship between the the velocity at which a galaxy moves away from us and its distance. This linear relatinship is consistent with a model of uniform expansion for the Universe.

Returning to Salmon:

For the Bonneville Dam data:

Here is the data. We want to mine this data statisically and visually in order to see what we might be able to infer. In other words, we don't just want to use a machine but our brains to help interpret the data. This is the only way to understand and appreciate the true complexity of a particular environmental problem.

There is no correlation in this case over the whole time period. In fact there appears to be:

What about Steelhead vs Chinook at Bonneville Dam:

Formally there is a very little correlation. The correlation coefficient, r, is 0.31. But look at the data closer to notice that its kind of odd.

There are 9 distinct occurences where the Steelhead Count is significantly above average (this corresponds to counts above 250,000). If we ignore those 9 points (years) out of the total of 57 years worth of data, the average Steelhead count is

143,000 +/- 32,000 (N=48)

The mean count for those 9 higher years is

306,000 +/- 35,000 (N=9)

Is the difference in these means significant?

One can therefore to conclude that something produces very high Steelhead Counts. Examining the data in time shows that the high Steelhead Counts occured in 1952--1953 and again in 1984-1989 and 1991-1992. High Steelhead count, however, does not mean high chinook count (nor does it correlate with anyother species)

For the whole data set, the weak correlation (r = 0.31) is shown below:

While a social scientist might argue that a correlation exists, you should be able to do better than that.

Okay, what about using just the chinook counts as a tracer of the entire salmon population. How well does that work? Here is the data:

Your eye sees a correlation and indeed r = 0.79 for this data set. Of course, some trend is expected since roughly 30--40% of the total Salmon Population is chinook; the question is, what is the dispersion in total salmon counts that results from using chinook as the tracer?

The formal fit is:

Y = 1.50X + 106 with a dispersion of 97

This means that chinook counts can be used to predict the total Salmon counts to an accuracy of 97,000. Since the Salmon count ranges from 500,000 to 1 million, that means an accuracy of 10-20%. This suggests that, if you are only interested in total Salmon, you can use chinook as a reliable tracer, provided that you don't require accuracy better than 20%.

The fit as applied to the data is shown here. In this case, r =0.79 and the fit is a good fit. There are no strongly abberant data points.

The Federal Budget

Historical Federal Outlay by Agency

Defense Spending Since 1976: Characterized by very rapid increases 1980-1990 followed by a slow decrease:

NASA Budget Since 1976: Slow steady growth then explosive 4 year period followed by a levelling off --> this is devastating funding pattern for an agency

Education Spending Since 1976: Rapid growth separated by stagnant periods

Correlation of education spending vs defense spending:

Correlation of NASA spending vs defense spending:

Previous Lecture Next Lecture Course Page