Introduction to Sampling

## Introduction to Sampling

Sampling represents the problem of accurately acquiring the necessary data in order to form a representative view of the problem. This is much more difficult to do than is generally realized.

Overall Methodology:

• State the objectives of the survey
• Define the target population
• Define the data to be collected
• Define the variables to be determined
• Define the required precision and accuracy
• Define the measurement `instrument'

The Concept of Random

Distributions:

When you form a sample you represent that by a plotted distribution known as a histogram . A histogram is the distribution of frequency of occurrence of a certain variable within a specified range. An example histogramming is shown below

This has broad bins

This one has narrow bins

How to construct a histogram:

Here is an example of how, exactly, you do this.

Step 1: Defining the target population and the problem: In this case we are interested in the distribution of tree diameters in a pine forest.

Step 2: Define the measurement: We will define a standard tree diameter one which is measured at a height of 1.0 meters above the ground

Step 3: Define the population: There are thousands of trees in the forest. We decide (for reasons to be clear later) to go randomly sample 50 trees. How do we do this? We go the middle of the forest with a deck of cards. The 4 suits represent North, South, East, and West and the rank of the card represents a distance in meters.

We shuffle the deck and get the 10 of hearts - this means that we go 10 meters west and measure the first tree we encounter at that position. Then we shuffle the deck and go on. This procedure, of course, assumes that the middle of the forest is just like any other place in the forest.

After telling our statistics boss about this they say its a bad assumption (because they never took ENVS 202) and the boss makes us go to 24 separate locations in the forest to measure 50 tree diameters at each location. So we have a total of 1200 measurements.

Step 4: Collect the data: In this case we are very anal and take a measuring device which measures tree diameters to an accuracy of 0.001 centimeters. Here is the tabular form of the data for the first site where 50 trees were measured.

```62.653 63.375 63.241 63.574 62.061
61.010 49.314 56.207 61.152 56.125
57.055 56.162 63.174 59.219 60.983
56.327 61.399 64.470 56.693 56.905
66.167 67.443 66.595 55.845 65.250
62.309 64.621 56.444 53.981 57.540
49.154 58.910 59.146 68.144 59.853
58.584 61.382 60.999 51.388 58.044
58.041 65.309 56.949 62.992 54.460
59.850 56.871 56.909 60.206 58.425
```

As you can see, just looking at the numbers this way doesn't tell you a lot.

Step 5: Since we are interested in a distribution of tree diameters we can count the tree diameter data in bins of fixed size (say 5 cm). For the 1200 measurements that have been made, we sort the data into bins of width 5 cm each to get:

```
Bin Limits      |  Frequency     |   Proportion
------------------------------------------------
30.00 to 34.99  |       0        |     0.000
35.00 to 39.99  |       0        |     0.000
40.00 to 44.99  |       0        |     0.000
45.00 to 49.99  |      22        |     0.018
50.00 to 54.99  |     147        |     0.123
55.00 to 59.99  |     402        |     0.335
60.00 to 64.99  |     428        |     0.357
65.00 to 69.99  |     185        |     0.154
70.00 to 74.99  |      15        |     0.012
75.00 to 79.99  |       0        |     0.000
80.00 to 84.99  |       1        |     0.001
85.00 +         |       0        |     0.000
-------------------------------------------------
1200              1.000
```

Step 6: We construct the histogram by plotting the frequency vs the bin width. This is also known as a bar graph and its shown here:

A critical question to now ask is "What is the minimum sample size required to accurately represent a distribution"? That depends on the intrinsic shape of the distribution! As we will learn later, for distributions that are intrinsically bell shaped (these are called Normal or Gaussian distributions) 25-30 random measurements are usually good enough.

This curve has the following characteristics:

• A well defined peak or most probably value - this is the sample mean.

• A width, known as the sample dispersion or the standard deviation (listed as SD in the above) this dispersion may also be denoted by the term sigma or the greek letter s

• A tail in which there are not very many events.

The reason that we will come to simultaneously love and hate this curve is that most all events in nature are accurately described by this kind of frequency distribution.

Note: student exam scores are usually on a bell curve and that is how your grade is determined by your position on that curve relative to the average. We will put this into practice in this class.

Bell Curve with a large sigma (dispersion)

Bell Curves with smaller dispersions

Simulations:

Some Typical Problems Associated with Sampling:

• Sample is of insufficient size: means that you weren't very clever when you defined the sample

• The sample is biased: often biases can be subtle and can take time to find and correct. Control samples are usually not an adequate substitute as they could be biased as well

• The wrong variables were measured: the collected data are measuring secondary effects not primary effects. This happens a lot!

• The sample is censored: there exists a population which is below the threshold of your measuring technique or apparatus. This is very difficult to deal with.

• The data precision is low: you have only low signal-to-noise results

• The intrinsic measuring accuracy of the instrument is unknown which leads to an artifical spread in the true distribution. Test scores are plagued by this problem

Examples:

• Decline in Salmon Runs: What constitutes a representative measure?
• How to best characterize global warming?
• How to measure deforestation/clear cutting? From the ground? From the Air? From satellites?
• How to measure how much old growth is needed for spotted owl environment? 500 Acres? What about predators?

Summary:

• The Sample Mean - Numerical measure of the average or most probable value in some distribution. Can be measured for any distribution knowing the mean value alone for some sample is not very meaningful.

• The Sample Distribution - Plot of the frequency of occurrence of ranges of data values in the sample. The distribution needs to be represented by a reasonable number of data intervals (counting in bins).

• The Sample Dispersion - Numerical measure of the range of the data about the mean value. Defined such that +/- 1 dispersion unit contains 68% of the sample, +/- 2 dispersion units contains 95% and +/- 3 dispersion units contains 99.7%. This is schematically shown below:

In general, we map dispersion units on to probabilities

For instance:

• The Probability that some event will be greater than 0 dispersion units above the mean is 50%
• The Probability that some event will be greater than 1 dispersion units above the mean is 15%
• The Probability that some event will be greater than 2 dispersion units above the mean is 2%
• The Probability that some event will be greater than 3 dispersion units above the mean is 0.1% (1 in 1000)

The calculation of dispersion in a distribution is very important because it represents a uniform way to determine probabilities and therefore to determine if some event in the data is expected (i.e. probable) or is significantly different than expected (i.e. improbable). This is how your exam will be graded

So now let's apply this to the issue of global warming:

From that data, the following frequency distribution can be constructed.

The fitted normal distribution (which the tool will do for you automatically) has the following characteristics:

• mean = -0.1 degree
• dispersion = 0.25 degrees

Given that information, how discrepant is the year 1998 in which the temperature was +0.8?

1. +0.8 is (0.8 - -0.1) or +0.9 degrees above the mean value (which is all a newspaper story would say).

2. the number of dispersion units above the mean is therefore 0.9/0.25 or 3.6

3. Well, we already know that the probality of a 3.6 sigma event is lower than 1 in 1000 (remember, there are only 150 years worth of data) . We can use this handy probability calculator to figure out the probability that corresponds to 3.6 sigma. (its .00015 or slighty more than 1 part in 10,000).

4. This means that the global temperature of +0.8 degrees in 1998 was not a random, statistical event. Given 150 years worth of data we would not expect this event to have occurred by random.