Comparing Means and Deviations

ENVS 202 On-line Access

In class exercise. Class do the following:

  1. Draw a number from the Sacred Box of Sampling in the front of the Lecture room
  2. Return to your seat with that number and note your seat number
  3. Be prepared to tell the instructor your number if your seat numbers is randomly called
  4. That is all

In the Sacred Box of Sampling there are 170 numbers which define this Intrinsic Distribution .

The point of the in class exercises is to demonstarte that a random sampling process done for an intrinsic distribution which is normally distributed (i.e. a bell curve) will provide a robust estimate of the mean and dispersion after just a small number of samples.. While this can be proved with calculus (and in statistics is known as the Central Limit Theorem), the in class example is the probably the best means of demonstrating this.

For this sample as a whole:

The point of the demo in the class is to see how close we can come to recovering this population mean and dispersion from the sample mean and dispersion.

Comparing Sample Means - What is Significance?

Summary from last time:

In general, we map dispersion units on to probabilities

For instance:

The calculation of dispersion in a distribution is very important because it represents a uniform way to determine probabilities and therefore to determine if some event in the data is expected (i.e. probable) or is significantly different than expected (i.e. improbable).

Now that we have an understanding of means and dispersions we have a simple way for determining if two distributions are fundamentally different. Again let's use the example of rain.

Seattle Eugene

mean = 51.5 inches

mean = 39.5 inches

dispersion = 8.5

dispersion = 7.0

On average, does it rain significantly more in Eugene than Seattle?

Here is the wrong way to do this problem:

A proper comparison makes use of a tenant of statistical theory which states that

    The error in the mean is calculated by dividing the dispersion by the square root of the number of data points.

Seattle Eugene

mean = 51.5 inches

mean = 39.5 inches

dispersion = 8.1

dispersion = 7.0

N = 25

N = 25

error in mean = 8.1/5

error in mean = 7.0/5

error in mean = 1.6

error in mean = 1.4

The difference in mean rainfall between Seattle and Eugene is (51.5 - 39.5) = 12 inches which is 12/1.6 = 7.5 dispersion units difference in the mean value.

Thus there is a highly significant difference in the mean annual rainfall between Eugene and Seattle.

Note this method is only an approximation. A more exact and proper way to compare two sample means will be given later.

Another way to look at this rainfall comparison is as follows:

We have already determined that 65 inches is not a significant amount of rainfall in Eugene compared to the normal value of 51.5 inches. Would 65 inches be a significant amount of rain in Seattle?

For the case of Seattle, 65 inches is 65-39.5 = 26.5 inches above normal. The dispersion in the Seattle data is 7 inches and so 26.5 inches is 26.5/7 = 3.8 dispersion units above the mean. This is highly significant which again reinforces the notion that there is a significant difference in mean rainfall between

Eugene and Seattle (note also this difference in community web pages).

Comparing Two Sample Means - Find the difference of the two sample means in units of sample mean errors. This works as follows:

  • Sample 1 has mean M1 and error in the mean E1
  • Sample 2 has mean M2 and error in the mean E2

    Difference in terms of signifance is:

    Simple Approximation:

    • If E1 and E2 are similar then use (M1-M2)/1.5E1

    • If E1 > 2*E2 then use (M1-M2)/E1

    Let's no apply this principle to some real data. The actual salmon count data:

    This distribution, defined by 44 points, has a mean of 358,000 salmon with a dispersion of 82,000 salmon. The error in the mean is 12,000 (82000/(square root of 44))

    Points to note about the distribution:

    1. The dispersion is fairly large. Is this intrinsic to the population or a reflection of measuring errors because salmon counting is difficult and unreliable.?

    2. There seems to be a hard lower limit in the data of around 225,000 salmon

    3. There is a tail towards very high salmon counts (> 500,000 salmon). Tails like this have a significant impact on the mean value and might represent some kind of anamoly in the data.

    4. Overall, the distribution is not real well fit by a bell curve but the median value of 340,000 is similar to the mean so we can use our principles of dispersion to calculate significant differences.

There has been some speculation and data that suggest there has been a decline of salmon recently in the Columbia River System. What do these data say.?

Here is the distribution of the data with the last 5 years subtracted out, so there are 39 years worth of data:

This distribution, defined by 39 points, has a mean of 368,000 salmon with a dispersion of 81,000 salmon and a mean error of 13,000.

Note: The dispersion for the 39 year sample and the 44 year sample are similar this indicates that we have enough data to accurately determine the dispersion.

Over the last 5 years, the data are defined by an average of 278,000 salmon with a dispersion of 33,000 and a mean error of 15,000 = (33,000/(sqrt of 5)). Does this data show a significant decline of salmon?

Since the mean errors are similar we can use (M1-M2)/1.5E1 for an approximation:

  • M1-M2 = 368,000 - 278,000 = 90,000
  • 1.5E1 = 1.5*13,000 = 20,000

  • difference is 90,000/20,000 = 4.5 dispersion units HIGHLY SIGNIFICANT!

Previous Lecture Next Lecture Course Page