Descriiptive Statistics Let’s begin by testing your knowledge of descriiptive st

Descriiptive Statistics
Let’s begin by testing your knowledge of descriiptive st

Descriiptive Statistics
Let’s begin by testing your knowledge of descriiptive statistics. Below is an example of a discrete random variable, i.e., family size. The attributes (values) of the variables are the number of persons in the family.
Family Size Values
1
2
3
4
5
6
7
8 or more
You draw a sample of 50 families from Alexandria, LA, and observe family size. What are the attributes (possible values) of family size?
When you collect data on your variables, you want to find ways to present your findings. Descriiptive, or summary, statistics help you present a snapshot of how the values of the variable are distributed in your sample. This is different from the use of inferential statistics in which you express your degree of confidence in how well the data you collected using a sample represents the whole population from which the sample was taken.
For now, let’s review some of the ways you can describe your data.
You ask family 1 how many members are in the family, so you designate the value for this family (x1), which is the first data point in the data set, or value of 2.
You proceed to ask family 2 (x2); for the second family, x2 = 9. You continue to family 50 = x50.
What can you tell by looking at the raw data alone? Actually, very little since it is hard to detect any patterns in a list of 50 raw data entries.
You would list the possible values of the variable “family size” in the first column. In the second column, you would list the number of families of that size. In the third column, you would report the percentage of families of that size. A percentage is calculated by dividing the number of families by the sample size of 50 and multiplying that result by 100%.
Here are the frequency and percentage distributions for the data on family size. Note the construction of the table so that all the information is clear to your audience. The numbers in your frequency column should add up to the sample size. The percentages in your percentage column should add up to 100.
Family Size
Frequency
Percentage
1
5
10.0
2
21
42.0
3
12
24.0
4
4
8.0
5
3
6.0
6
2
4.0
7
1
2.0
8+
2
4.0
Total
50
100.0
If you notice, you can now get a much better sense of how family size varies in the sample and which are the most prominent values. Most families in the sample have 2 or 3 members.
Summary Measures With Single Statistics
How can we summarize the data with single statistics?
Averages, or measures of central tendency, include mode, median, and mean.
Mode is the value category that appears most often (i.e., is the most frequently observed).
Median is the middle value in the distribution, such that 50% of the data points are above and 50% are below.
Must rank order data from lowest to highest, or vice versa
Find the median location with formula, (n+1)/2
Count to the median; if it is between 2 values, take the average
Mean is the arithmetic average, sum of all the observations divided by the number of observations.
Note: When using the frequency distribution, we made the judgment call to use “8” as the final category. If you used all 50 data points from the raw data, you would note that there is a family size of 9. Using the raw data would give us a more accurate mean, and it would be slightly higher, or 2.88.
The mean uses all the data in the data set, whereas the mode and median only use the most frequently observed value, or the middle value. Therefore, the mean is influenced by extreme high or low values. In this case, the few families of size 8 and 9 pulled the mean higher, actually closer to 3 as the “average” family size.
In summary, mode = 2; median = 2; mean = 2.88. The distribution of family size is skewed to the right because of some high values of the variable. In such cases, the median would be a better measure of central tendency to report to your audience.
Measures of Variation
We can also compute measures of variation. Without doing the actual computations, can you define and interpret what the above would tell you about the variable “family size”?
Range: The range would tell you how the observed values are spread out by subtracting the lowest value observed from the highest.
Interquartile range: Because the range is influenced by outliers (i.e., very high or very low values), the interquartile range is often used instead. This statistic uses the middle 50% of the data values, eliminating any outliers. This gives us a better sense of how the values cluster around the median.
Variance: The variance tells us how the values are clustered about the mean. This is not always an easy statistic to interpret, but it is valuable in many advanced statistical computations.
Standard deviation: The standard deviation is the square root of the variance, and a bit more intuitive to interpret. Simply, you might think about the standard deviation as the average distance of the observations from the computed mean of the distribution. Like the mean, it is influenced by very high and very low outliers.
Practice Question: Using Mean and Standard Deviations
Suppose you are the director of an agency and you want to promote one of your front line staff to supervisor. As a basis for your decision, you look at the mean number of days each staff person takes to get a client needed treatment.
Worker A: Mean = 22.4, s= 15.9
Worker B: Mean = 18.7, s = 36.5
Worker C: Mean = 24.6, s = 19.7
Whom would you select on the basis of this observation, and why?
How are summary statistics used in decision making? We often use means and standard deviations in progress reports. For example, at the end of this semester, you will all fill out a student evaluation of teaching for the instructor. The evaluative items are represented as an interval scale so that means and standard deviations of all the scores can be computed and given to the instructor as feedback. The means on each item tell the instructor how students, on average, rated him/her on that item. The standard deviations tell the instructor how much variability there was among students on that item.
In this example, the director has summarized some important productivity data. At first glance at the means, you might say Worker B gets to his/her clients much quicker than the others, so he/she is the logical person to promote. However, the standard deviations are also revealing. Worker B’s standard deviation is very high compared to the other two workers. This might mean there were one or two cases that he/she got to very, very quickly and those pulled his/her average lower. In other words, a couple of outlier cases make Worker B’s performance, on average, look better. But that variation is captured in the higher standard deviation. Worker A’s and C’s cases seem to cluster closer to their averages. Worker A has the second lowest average, and also the lowest variation. So he/she would be the better choice to promote if you are only considering these quantitative data.