3.1. Samples & Populations¶
In statistics, we generally want to study a population. You can think of a population as an entire collection of persons, things, or objects under study. For example, a population could be all current MBA students in accredited USA universities. To study the larger population, we select a sample. The idea of sampling is to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population. Data are the result of sampling from a population.
Because it takes a lot of time and money to examine an entire population, sampling is a very practical technique. If you wished to compute the overall grade point average at your school, it would make sense to select a sample of students who attend the school. The data collected from the sample would be the students’ grade point averages. In presidential elections, opinion poll samples of 1,000 to 2,000 people are taken. The opinion poll is supposed to represent the views of the people in the entire country. Manufacturers of canned carbonated drinks take samples to determine if the manufactured 16-ounce containers do indeed contain 16 ounces of the drink.
From the sample data, we can calculate a statistic. A statistic is a number that is a property of the sample. Common sample statistics include sample means, sample proportions, and sample variances. For example, if we consider one math class to be a sample of the population of all math classes, then the average number of points earned by students in that one math class at the end of the term is an example of a statistic. The statistic can also be used as an estimate of a population parameter. A parameter is a number that is a property of the population. Since we considered all math classes to be the population, then the average number of points earned per student over all the math classes is an example of a parameter. One of the main concerns in the field of statistics is how accurately a statistic estimates a parameter. The accuracy depends on how well the sample represents the population. The sample must contain the characteristics of the population in order to be a representative sample.
As an example, suppose that we wanted to investigate whether smoking during pregnancy leads to lower birth weight of babies. To determine whether smoking during pregnancy leads to lower birth weight of babies, we would compare a random sample of weights of newborn babies whose mothers smoke, with a random sample of weights of newborn babies of non-smoking mothers. By analyzing the sample data, we would hope to be able to draw conclusions about the effects on birth weight of smoking during pregnancy for all babies (i.e. the population). The process of using a random sample to draw conclusions about a population is called statistical inference.
If we do not have a random sample, then sampling bias can invalidate our statistical results. For example, the birth weights of twins are generally lower than the weights of babies born alone. If all the non-smoking mothers in the sample were giving birth to twins, and all the smoking mothers were giving birth to single babies, then the conclusions we draw about the effects of smoking in pregnancy will not necessarily be correct as they are affected by sampling bias.