The goal of collecting data is to find out something about the characteristics of a sample or a population.
To analyze the data we collect, we always follow the same 3-step strategy: |
|
|
Choose best graph based on level of measurement |
|
Look at:
|
|
Use a few numbers to describe:
(mean, median, mode)
(standard deviation, five-number summary) |
Looking at a graph gives us a quick idea of the distribution of values in the set—a graph “summarizes” or “describes” the data.
In this section we will look at Step 3:
Tip: Before choosing the best numerical measures of center and variation, we
Graphs help us see the patterns in data, but we often prefer to use one number, or a numerical summary, to describe a data set.
There are two primary numerical measures of a data set:
Some of the most commonly used measures of center and spread are outlined in this video.
The center of a data set, or the “average,” is where most values cluster. It is also called a measure of center. You can also think of it as representing the “typical” value of a data set.
There are three common measures of central tendency: Mean, Median, Mode.
Measures of central tendency |
|||
Measure |
Description |
How to calculate |
*When to use it |
Mode |
Most frequently occurring value |
Nominal data |
|
Median |
Midpoint of an ordered data set |
Ordinal data
Quantitative data That’s skewed or with outliers |
|
Mean *most common |
Arithmetic average |
Quantitative data with symmetric distribution |
Choosing the best measure of center
To choose the best measure of center for a distribution, follow Steps 1 and 2: graph your data and look at it’s shape.
This brief video shows the differences between the three measures of center.
Variation tells us far from the data values are spread from the center.
There are a few measures of spread for quantitative data: range, the five number summary (quartiles, minimum and maximum), and standard deviation.
Measures of spread or variation |
|||
Measure |
Description |
How to calculate |
*When to use it |
Range |
Difference between highest and lowest value |
|
|
Values that divide the data into 100 equal groups |
|
||
Quartiles |
Values that divide distribution into 4 equal groups The Inter-quartile range (IQR) describes the middle 50% of data |
When distribution is skewed or with outliers *Use with median |
|
Standard Deviation |
The average distance that observations are spread from the mean |
Only when distribution is symmetric *Use with mean |
Choosing the best measure of spread
To choose the best measure of spread for a distribution, follow Steps 1 and 2: graph your data and look at it’s shape.
This video describes three common measures of spread.
Outliers have a large impact on the mean, by pulling the mean in the direction of the outlier. This results in a number that may be far from where most of the data points cluster.
When there are outliers, always use the median as the measure of central tendency. The median is less influenced by an outlier.
Outliers also impact the measure of spread. When there are outliers, the variation is larger.
When there are outliers, use the five-number summary as the measure of spread.
In this video we will be talking about the effects of outliers on spread and centre.
When a data distribution has outliers or is skewed, the mean (measure of center) and standard deviation (measure of spread) are not accurate summaries of the data set.
The five-number summary is used to numerically summarize a data set when there are outliers or a skewed distribution.
The five-number summary uses five numbers to summarize a data set. The numbers are listed from smallest to largest:
Minimum value
First quartile
Median
Third quartile
Maximum
A boxplot is the visual display that is used to show the five-number summary.
How to create and interpret and the five-number summary (Watch until 3:31)
Here is a brief example on calculating the five-number summary and drawing a box plot.
Shape of the distribution |
best numerical descriptor |
numbers used |
best graph |
skewed or outliers? |
The Five Number Summary |
Min, Q1, Med, Q3, Max |
Boxplot |
relatively symmetric? |
Mean & Standard Deviation |
x̄ , s |
Histogram |
This video shows the reasoning behind choosing either the mean and standard deviation or the five-number summary