Eugene | Seattle |
---|---|

## mean = 51.5 inches | ## mean = 39.5 inches |

## dispersion = 8.5 | ## dispersion = 7.0 |

On average, does it rain significantly more in Eugene than Seattle?

Here is the __ wrong way __ to do this problem:

- If you follow the procedure before, you would note the difference
in mean rainfall between Seattle and Eugene is 12 inches.
- 77 - 42 = 35 inches above the mean
- 35/9 = 3.9 dispersion units above the mean
- 3.9 dispersion units is about 1 chance in 10,000 therefore 1996 was not a statistical fluctuation in normal weather patterns; it was a systematic departure.

12 inches is 12/8 = 1.5 dispersion units and therefore not significant.

But this is not the correct procedure to use when comparing two
separate distributions.
*It is only the correct procedure to use
when comparing one data point to the rest of the same distribution. **
*

Again, here is an example of that:

The figure below shows the histogram of rainfall in Eugene from 1900-2000. The bin width in this case is 2 inches of rainfall.

For this case, the 100 year data set gives a mean of 42 inches and a dispersion of 9 inches. This means that 2/3 of the time, the mean annual rainfaill in Eugene can be expected to be between 33 and 51 inches. (Note: The mean rainfall using just the last 30 years worth of data is higher we will show this below).

The official rainfall in Eugene in 1996 was 77 inches. Is this an expected 1 in a hundred year rainfall amount? Note, a 1 in 100 chance corresponds to 2.5 dispersion units. A 1 in 100 chance is 3 dispersion units.

The error in the mean can be thought of as a measure of how realiable a mean value has been determined. The more samples you have, the more reliable the mean is. But, it goes as the square root of the number of samples! So if you want to improve the reliability of the mean value you would have to get 100 times more samples. This can be difficult and often your stuck with what you got. You then have to make use of it.

Back to the Eugene/Seattle comparison based on the last 25 years worth of data (so N = number of samples = 25).

Eugene | Seattle |
---|---|

## mean = 51.5 inches | ## mean = 39.5 inches |

## dispersion = 8.1 | ## dispersion = 7.0 |

## N = 25 | ## N = 25 |

## error in mean = 8.1/5 | ## error in mean = 7.0/5 |

## error in mean = 1.6 | ## error in mean = 1.4 |

The difference in mean rainfall between Seattle and Eugene is (51.5 - 39.5) = 12 inches which is 12/1.6 = 7.5 dispersion units difference in the mean value.

Thus there is a highly significant difference in the mean annual rainfall between Eugene and Seattle.

Note this method is only an approximation. A more exact and proper way to compare two sample means will be given later.

Comparing Two Sample Means - Find the difference of the two sample means in units of sample mean errors. This works as follows:

- Sample 1 has mean M1 and error in the mean E1
- Sample 2 has mean M2 and error in the mean E2
Difference in terms of signifance is:

In general, in more qualitative terms:

- If the difference in means between two samples is less than
2.0 dispersion units, the two samples are the same.
- If the difference in means between two samples is between 2.0 and
2.5 dispersion units, the two samples are
__marginally different__ - If the difference in means between two samples is between 2.5 and
3.0 disperions units, the two samples are
__significantly different__ - If the difference in means between the two samples is more then
3.0 disperion units, the two smapels are
__highly signficantly different__

Simple Approximation:

- If E1 and E2 are similar then use
(M1-M2)/1.5E1
- If E1 > 2*E2 then use (M1-M2)/E1

Let's now apply this principle to some real data.

First back to rainfall data. I stated earlier that the mean annual precipiation in Eugene was higher over the last 30 years than it has been over the last 100 years. Let's see if that difference is significant. To do this, we break the 100 year data set into two.

- Data set 1 runs from 1900-1969
- Data set 2 runs from 1970-2000 (with 1996 thrown out)

1900 - 1970 1970 - 2000 ## mean = 39.6 inches

## mean = 49.9 inches

## dispersion = 7.7

## dispersion = 8.4

## N = 70

## N = 30

## error in mean = 0.9

## error in mean = 1.5

Is the difference in means significant?

- 49.9 - 39.6 = 10.3
- square root of .9
^{2}+ 1.5^{2}= 1.75 - 10.3/1.75 = 5.9 ! Damn right something is significant

In fact, one can also note that the actual dispersion between the two data sets is similar (about 8 inches) which indicates similar year to year variations, its just that the mean level has gone way up This is weird since Eugene is the only site in the PNW that shows this kind of trend.

Now let's focus on another example, based on Salmon Count Data at Bonneville Dam.

The actual salmon count data:

This distribution, defined by 44 points, has a mean of 358,000 salmon with a dispersion of 82,000 salmon. The error in the mean is 12,000 (82000/(square root of 44))

Points to note about the distribution:

- The dispersion is fairly large. Is this intrinsic to the population
or a reflection of measuring errors because salmon counting is difficult
and unreliable.?
- There seems to be a hard lower limit in the data of around 225,000
salmon
- There is a tail towards very high salmon counts (> 500,000 salmon).
Tails like this have a significant impact on the mean value and might
represent some kind of anamoly in the data.
- Overall, the distribution is not real well fit by a bell curve but
the median value of 340,000 is similar to the mean so we can use our
principles of dispersion to calculate significant differences.

- If the difference in means between two samples is less than
2.0 dispersion units, the two samples are the same.

Here is the distribution of the data with the last 5 years subtracted out, so there are 39 years worth of data:

This distribution, defined by 39 points, has a mean of 368,000 salmon with a dispersion of 81,000 salmon and a mean error of 13,000.

Note: The dispersion for the 39 year sample and the 44 year sample are similar this indicates that we have enough data to accurately determine the dispersion.

Over the last 5 years, the data are defined by an average of 278,000 salmon with a dispersion of 33,000 and a mean error of 15,000 = (33,000/(sqrt of 5)). Does this data show a significant decline of salmon?

Since the mean errors are similar we can use (M1-M2)/1.5E1 for an approximation:

- M1-M2 = 368,000 - 278,000 = 90,000
- 1.5E1 = 1.5*13,000 = 20,000
- difference is 90,000/20,000 = 4.5 dispersion units HIGHLY SIGNIFICANT!