Comparing Means and Deviations

ENVS 202 On-line Access


Now that we have an understanding of means and dispersions we have a simple way for determining if two distributions are fundamentally different. Again let's use the example of rain.

Eugene Seattle

mean = 51.5 inches

mean = 39.5 inches

dispersion = 8.5

dispersion = 7.0

On average, does it rain significantly more in Eugene than Seattle?

Here is the wrong way to do this problem:

To compare two different distributions, as opposed to a single point against one distributuion, one makes use of a tenant of statistical theory which states that

    The error in the mean is calculated by dividing the dispersion by the square root of the number of data points.

The error in the mean can be thought of as a measure of how realiable a mean value has been determined. The more samples you have, the more reliable the mean is. But, it goes as the square root of the number of samples! So if you want to improve the reliability of the mean value you would have to get 100 times more samples. This can be difficult and often your stuck with what you got. You then have to make use of it.

Back to the Eugene/Seattle comparison based on the last 25 years worth of data (so N = number of samples = 25).

Eugene Seattle

mean = 51.5 inches

mean = 39.5 inches

dispersion = 8.1

dispersion = 7.0

N = 25

N = 25

error in mean = 8.1/5

error in mean = 7.0/5

error in mean = 1.6

error in mean = 1.4

The difference in mean rainfall between Seattle and Eugene is (51.5 - 39.5) = 12 inches which is 12/1.6 = 7.5 dispersion units difference in the mean value.

Thus there is a highly significant difference in the mean annual rainfall between Eugene and Seattle.

Note this method is only an approximation. A more exact and proper way to compare two sample means will be given later.

Comparing Two Sample Means - Find the difference of the two sample means in units of sample mean errors. This works as follows:

  • Sample 1 has mean M1 and error in the mean E1
  • Sample 2 has mean M2 and error in the mean E2

    Difference in terms of signifance is:

    In general, in more qualitative terms:

    • If the difference in means between two samples is less than 2.0 dispersion units, the two samples are the same.

    • If the difference in means between two samples is between 2.0 and 2.5 dispersion units, the two samples are marginally different

    • If the difference in means between two samples is between 2.5 and 3.0 disperions units, the two samples are significantly different

    • If the difference in means between the two samples is more then 3.0 disperion units, the two smapels are highly signficantly different

    Simple Approximation:

    • If E1 and E2 are similar then use (M1-M2)/1.5E1

    • If E1 > 2*E2 then use (M1-M2)/E1

    Let's now apply this principle to some real data.

    First back to rainfall data. I stated earlier that the mean annual precipiation in Eugene was higher over the last 30 years than it has been over the last 100 years. Let's see if that difference is significant. To do this, we break the 100 year data set into two.

    • Data set 1 runs from 1900-1969
    • Data set 2 runs from 1970-2000 (with 1996 thrown out)

    1900 - 1970 1970 - 2000

    mean = 39.6 inches

    mean = 49.9 inches

    dispersion = 7.7

    dispersion = 8.4

    N = 70

    N = 30

    error in mean = 0.9

    error in mean = 1.5

    Is the difference in means significant?

    • 49.9 - 39.6 = 10.3
    • square root of .92 + 1.52 = 1.75
    • 10.3/1.75 = 5.9 ! Damn right something is significant

    In fact, one can also note that the actual dispersion between the two data sets is similar (about 8 inches) which indicates similar year to year variations, its just that the mean level has gone way up This is weird since Eugene is the only site in the PNW that shows this kind of trend.

    Now let's focus on another example, based on Salmon Count Data at Bonneville Dam.

    The actual salmon count data:

    This distribution, defined by 44 points, has a mean of 358,000 salmon with a dispersion of 82,000 salmon. The error in the mean is 12,000 (82000/(square root of 44))

    Points to note about the distribution:

    1. The dispersion is fairly large. Is this intrinsic to the population or a reflection of measuring errors because salmon counting is difficult and unreliable.?

    2. There seems to be a hard lower limit in the data of around 225,000 salmon

    3. There is a tail towards very high salmon counts (> 500,000 salmon). Tails like this have a significant impact on the mean value and might represent some kind of anamoly in the data.

    4. Overall, the distribution is not real well fit by a bell curve but the median value of 340,000 is similar to the mean so we can use our principles of dispersion to calculate significant differences.

There has been some speculation and data that suggest there has been a decline of salmon recently in the Columbia River System. What do these data say.?

Here is the distribution of the data with the last 5 years subtracted out, so there are 39 years worth of data:

This distribution, defined by 39 points, has a mean of 368,000 salmon with a dispersion of 81,000 salmon and a mean error of 13,000.

Note: The dispersion for the 39 year sample and the 44 year sample are similar this indicates that we have enough data to accurately determine the dispersion.

Over the last 5 years, the data are defined by an average of 278,000 salmon with a dispersion of 33,000 and a mean error of 15,000 = (33,000/(sqrt of 5)). Does this data show a significant decline of salmon?

Since the mean errors are similar we can use (M1-M2)/1.5E1 for an approximation:

  • M1-M2 = 368,000 - 278,000 = 90,000
  • 1.5E1 = 1.5*13,000 = 20,000

  • difference is 90,000/20,000 = 4.5 dispersion units HIGHLY SIGNIFICANT!

Previous Lecture Next Lecture Course Page