Let's look at the some examples using correlation and regression analysis.

An example data set:

The Goal here is to find the best relation between, Y the dependent variable, and X- the independent variable.

X is the variable that would measure because Y is more difficult, and in some cases might be impossble to Measure.

Since we are measuring X - the role of measurement error will be come important. More on that later.

X Y 10.0 12.5 8.5 11.1 16.8 22.3 11.2 15.4 17.8 25.3 5.4 8.4 21.6 32.6 9.6 18.5 14.0 15.3 13.5 16.8

The correlation between X and Y is shown here:

Y_{pred} = 1.39X + 0.03 ; dispersion = 2.53 ; r = 0.94

Let's calculate the residuals for each data point now.

X Y Y-pred Residual Significance 10.0 12.5 13.93 1.43 0.56 8.5 11.1 11.85 0.75 0.29 16.8 22.3 23.38 1.08 0.43 11.2 15.4 15.60 0.20 0.08 17.8 25.3 24.77 -0.52 -0.21 5.4 8.4 7.53 -0.86 -0.34 21.6 32.6 30.05 -2.54 -1.00 9.6 18.5 13.37 -5.12 -2.02 14.0 15.3 19.49 4.19 1.65 13.5 16.8 18.80 2.00 0.79

Try rejection analysis to improve the fit (mainly lower the scatter). Reject the most deviant point in the above.

That new relation is plotted here:

Y_{pred} = 1.48X -1.76 ; dispersion = 1.96

This representation of the data is a more reliable and robust and allows Y to be more accurately estimated.

About measurement errors in X.

Suppose I have two relations involving different quantities but both use the same independent variable X.

Relation 1:

Y_{1} = 1.5X + 1.5 ; with a dispersion of 0.5 units

Relation 2:

Y_{2} = 6.0X + 2.5 ; with a dispersion of 0.3 units

Suppose that I can only make measurements of X which are accurate
to 10%. This means that, despite a lower disperions,
Y_{2} is less well determined than Y_{1}!

Example: x = 10 +/- 1

Y_{1} = 1.5*10 +1.5 = 16.5

Y_{1} = 1.5*11 +1.5 = 18.0

So 10% uncertainty in X translates into +/- 1.5 unit uncertainty
in Y.

For Y_{2} = 6.0*10 +2.5 = 62.5

For Y_{2} = 6.0*11 +2.5 = 68.5

So 10% uncertainty in X translates into +/- 6.0 unit uncertainty

So relations which have steep slopes require that X be measured very accurately.

About 75 years ago, Astronomers used the simple technique of correlation to discover the Universe was expanding. For nearby galaxies they measured a redshift and plotted that against the distance to the galaxy. Here is the data:

## The line through the data is a "best fit" linear relationship which shows that there is a linear relationship between the the velocity at which a galaxy moves away from us and its distance. This linear relatinship is consistent with a model of uniform expansion for the Universe. |

Binning this noisy data:

There are 12 points between x = 0 and 1 and 9 points between x = 1 and 2.

The means, dispersions, and errors in the mean for that binned data set are as follows:

## X = 0.58 +/- 0.32 +/- .09 | ## Y = 373 +/- 265 +/- 76 |

## X = 1.74 +/- 0.30 +/- 0.10 | ## Y = 744 +/- 206 +/- 72 |

Is there a significant difference between the means?

X = (1.74) - (0.58)/1.5(.10) = 7.7

Y = (744) - (373)/1.5(70) = 3.5

So yes, at higher values of X, the mean value of Y is significantly larger so a correlation exists.

So if you have enough data and you can bin in X, often the binned correlation will be obvious even in noisy data.

The PNI is a good example of this averaging process.

For the Bonneville Dam data:

- Is their a correlation between Chinook salmon counts and year (i.e. have the counts been steadily decreasing?
- Is there a correlation between various species decline?

- quasi-cyclical behavior with a period of 15--20 years between
peaks and valleys (note there is not enough data to really say this).
This behavior becomes more apparent if we average the data on 3
and 5 year intervals:
- There appear to be low periods in Salmon Levels in 1995, 1980, 1960
and 1940.
- Salmon counts in 2001 and 2002 are historically high.

Formally there is a very little correlation. The correlation coefficient, r, is 0.31. But look at the data closer to notice that its kind of odd.

There are 9 distinct occurences where the Steelhead Count is significantly above average (this corresponds to counts above 250,000). If we ignore those 9 points (years) out of the total of 57 years worth of data, the average Steelhead count is

The mean count for those 9 higher years is

Is the difference in these means significant?

- E1 = 32,000/7 = 4500
- E2 = 35,000/3 = 12000
- E2 > 2*E1
- so use (M1-M2)/E2 (306-143)/12 = 13 !!

One can therefore to conclude that something produces very high Steelhead Counts. Examining the data in time shows that the high Steelhead Counts occured in 1952--1953 and again in 1984-1989 and 1991-1992. High Steelhead count, however, does not mean high chinook count (nor does it correlate with anyother species)

For the whole data set, the weak correlation (r = 0.31) is shown below:

While a social scientist might argue that a correlation exists, you should be able to do better than that.

- The formal correlation is Y = 0.25*X + 76 but the scatter
around that correlation is 66,000 Steelhead. Since the average Steelhead
count is 143,000 then using Chinook as the tracer of the Steelhead
poulation only predicts Steelhead to an accuracy of 66,000/143,000 or
around 45% (pretty lousy)
- But look at the data and notice that the weak correlation is almost
entirely driven by the two points with the highest X-values (highest
Chinook counts). If we eliminate those two points, then r lowers
from 0.31 to 0.10 which is no correlation at all.
- Removing the 9 periods of high Steelhead Counts from the data shows no correlation at all. In fact, the average Steelhead counts is the same over a range of Chinook counts from 200--500 (thousand).

Okay, what about using just the chinook counts as a tracer of the entire salmon population. How well does that work? Here is the data:

Your eye sees a correlation and indeed r = 0.79 for this data set. Of course, some trend is expected since roughly 30--40% of the total Salmon Population is chinook; the question is, what is the dispersion in total salmon counts that results from using chinook as the tracer?

The formal fit is:

This means that chinook counts can be used to predict the total Salmon counts to an accuracy of 97,000. Since the Salmon count ranges from 500,000 to 1 million, that means an accuracy of 10-20%. This suggests that, if you are only interested in total Salmon, you can use chinook as a reliable tracer, provided that you don't require accuracy better than 20%.

The fit as applied to the data is shown here. In this case, r =0.79 and the fit is a good fit. There are no strongly abberant data points.