Let's look at the some examples using correlation and regression analysis.
An example data set:
The Goal here is to find the best relation between, Y the dependent variable, and X- the independent variable.
X is the variable that would measure because Y is more difficult, and in some cases might be impossble to Measure.
Since we are measuring X - the role of measurement error will be come important. More on that later.
X Y
10.0 12.5
8.5 11.1
16.8 22.3
11.2 15.4
17.8 25.3
5.4 8.4
21.6 32.6
9.6 18.5
14.0 15.3
13.5 16.8
The correlation between X and Y is shown here:
Ypred = 1.39X + 0.03 ; dispersion = 2.53 ; r = 0.94
Let's calculate the residuals for each data point now.
X Y Y-pred Residual Significance
10.0 12.5 13.93 1.43 0.56
8.5 11.1 11.85 0.75 0.29
16.8 22.3 23.38 1.08 0.43
11.2 15.4 15.60 0.20 0.08
17.8 25.3 24.77 -0.52 -0.21
5.4 8.4 7.53 -0.86 -0.34
21.6 32.6 30.05 -2.54 -1.00
9.6 18.5 13.37 -5.12 -2.02
14.0 15.3 19.49 4.19 1.65
13.5 16.8 18.80 2.00 0.79
Try rejection analysis to improve the fit (mainly lower the scatter). Reject the most deviant point in the above.
That new relation is plotted here:
Ypred = 1.48X -1.76 ; dispersion = 1.96
This representation of the data is a more reliable and robust and allows Y to be more accurately estimated.
About measurement errors in X.
Suppose I have two relations involving different quantities but both use the same independent variable X.
Relation 1:
Y1 = 1.5X + 1.5 ; with a dispersion of 0.5 units
Relation 2:
Y2 = 6.0X + 2.5 ; with a dispersion of 0.3 units
Suppose that I can only make measurements of X which are accurate to 10%. This means that, despite a lower disperions, Y2 is less well determined than Y1!
Example: x = 10 +/- 1
Y1 = 1.5*10 +1.5 = 16.5
Y1 = 1.5*11 +1.5 = 18.0
So 10% uncertainty in X translates into +/- 1.5 unit uncertainty
in Y.
For Y2 = 6.0*10 +2.5 = 62.5
For Y2 = 6.0*11 +2.5 = 68.5
So 10% uncertainty in X translates into +/- 6.0 unit uncertainty
So relations which have steep slopes require that X be measured very accurately.
About 75 years ago, Astronomers used the simple technique of correlation to discover the Universe was expanding. For nearby galaxies they measured a redshift and plotted that against the distance to the galaxy. Here is the data:
|
The line through the data is a "best fit" linear relationship which shows that there is a linear relationship between the the velocity at which a galaxy moves away from us and its distance. This linear relatinship is consistent with a model of uniform expansion for the Universe. |
Binning this noisy data:
There are 12 points between x = 0 and 1 and 9 points between x = 1 and 2.
The means, dispersions, and errors in the mean for that binned data set are as follows:
X = 0.58 +/- 0.32 +/- .09 | Y = 373 +/- 265 +/- 76 |
X = 1.74 +/- 0.30 +/- 0.10 | Y = 744 +/- 206 +/- 72 |

Is there a significant difference between the means?
X = (1.74) - (0.58)/1.5(.10) = 7.7
Y = (744) - (373)/1.5(70) = 3.5
So yes, at higher values of X, the mean value of Y is significantly larger so a correlation exists.
So if you have enough data and you can bin in
X, often the binned correlation will be obvious even in noisy data.
The PNI is a good example of this averaging process.
For the Bonneville Dam data:



What about Steelhead vs Chinook at Bonneville Dam. This is an exercise in data inspection and thinking. A superficial glance or a machine processing of the data will not reveal what may actually be in there.

Formally there is a very little correlation. The correlation coefficient, r, is 0.31. But look at the data closer to notice that its kind of odd.
There are 9 distinct occurences where the Steelhead Count is significantly above average (this corresponds to counts above 250,000). If we ignore those 9 points (years) out of the total of 57 years worth of data, the average Steelhead count is
The mean count for those 9 higher years is
Is the difference in these means significant?
(306-143)/12 = 13 !!
One can therefore to conclude that something produces very high Steelhead Counts. Examining the data in time shows that the high Steelhead Counts occured in 1952--1953 and again in 1984-1989 and 1991-1992. High Steelhead count, however, does not mean high chinook count (nor does it correlate with anyother species)
For the whole data set, the weak correlation (r = 0.31) is shown below:

While a social scientist might argue that a correlation exists, you should be able to do better than that.
Okay, what about using just the chinook counts as a tracer of the entire salmon population. How well does that work? Here is the data:

Your eye sees a correlation and indeed r = 0.79 for this data set. Of course, some trend is expected since roughly 30--40% of the total Salmon Population is chinook; the question is, what is the dispersion in total salmon counts that results from using chinook as the tracer?
The formal fit is:
This means that chinook counts can be used to predict the total Salmon counts to an accuracy of 97,000. Since the Salmon count ranges from 500,000 to 1 million, that means an accuracy of 10-20%. This suggests that, if you are only interested in total Salmon, you can use chinook as a reliable tracer, provided that you don't require accuracy better than 20%.
The fit as applied to the data is shown here. In this case, r =0.79 and the fit is a good fit. There are no strongly abberant data points.
