We gather data to tell us something about a population, but a spreadsheet full of raw data doesn’t tell us much.
To analyze the data we collect, we always follow the same 3-step strategy: |
|
|
Choose best graph based on level of measurement |
|
Look at:
|
|
Use a few numbers to describe:
(mean, median, mode)
(standard deviation, five-number summary) |
In this section, we look at the first two steps for distributions of single variables.
1. We choose the best table or graph to display the data.
2. We identify patterns and deviations in the data. (This helps us choose the best numerical summaries in Step 3.)
Frequency Tables
A frequency distribution is one way to organize raw data.
It shows two things:
This video shows how to construct and interpret a frequency table.
Types of graphs and their uses
The most common graphs for categorical variables are:
• pie charts
• bar graph
The most common graphs for quantitative variables are:
• histograms
• stemplots
This video gives an excellent overview of these graphs.
It’s important to choose a graph that is appropriate for your data set.
Before you create a graph, identify the type of variable:
This video can help you chose an appropriate graph to display the distribution of your variable.
Graphs for Categorical variables |
|||
Pie charts are good: |
Bar charts are good: |
Pareto charts are good: |
Dot plots are good: |
|
|
|
|
https://ec.europa.eu/eurostat/web/products-eurostat-news/-/DDN-20180920-1 |
Moore, Statistics: Concepts and Controversies, 9e, 2017 by W. H. Freeman and Co |
Moore, Statistics: Concepts and Controversies, 9e, 2017 by W. H. Freeman and Co |
https://www.pinterest.co.uk/pin/406168460117709867/ |
|
Graphs for Quantitative variables |
|||
Histograms are good: |
frequency polygons are good: |
Stem-and-leaf plots are good: |
Boxplots are good: |
|
|
|
|
Moore, Statistics: Concepts and Controversies, 9e, 2017 by W. H. Freeman and Co |
https://courses.lumenlearning.com/introstats1/chapter/histograms-frequency-polygons-and-time-series-graphs/ |
Moore, Statistics: Concepts and Controversies, 9e, 2017 by W. H. Freeman and Co |
https://www.onlinemath4all.com/analyzing-box-plots-worksheet.html |
|
Graphs to show change over time |
Time-series graphs are good: |
|
Moore, Statistics: Concepts and Controversies, 9e, 2017 by W. H. Freeman and Co |
In Step 1, we choose the best graph to display the data.
Now, in Step 2, we identify patterns and deviations in the graph.
https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/dotplots-2-of-2/
To find patterns and deviations, we look at: |
|
shape |
if the data distribution is relatively symmetric or not |
center |
where most of the data values cluster in the data distribution |
variation |
how far the values spread from the center in the data distribution (and, if there are outliers) |
Shape of a distribution
To describe the shape of a distribution, look at:
Number of modes:
http://www.lynnschools.org/classrooms/english/faculty/documents/tim_serino/Printable_Assignments/24_notes__describing_quantitative_data.pdf
This video briefly describes how to identify whether distribution is symmetric or skewed.
Symmetric or skewed distribution? |
||
Symmetric |
Skewed left (negatively) |
Skewed right (positively) |
data values are evenly distributed around center of unimodal distribution
← →left and right hand sides of distribution show a mirror image |
data values are more spread out on left side
←the tail goes to the left |
data values are more spread out on right side
the tail goes to the right→
|
|
|
|
mode, mean, and median are the same |
outliers pull mean towards the left |
outliers pull the mean to the right |
|
|
all images from Statistical Reasoning for Everyday Life, 5e |
Center
The center is the location where most of the data values cluster in a distribution. Think about it as a “typical” value of the data set.
Spread (Variation)
Variation, or spread, describes how far the values are spread out from the center of the data distribution (and, if there are outliers)
In the picture below, you can see increasing variation in each image as you move from left to right. The center of the data stays the same, but the values get more spread out.
Small variation |
Moderate variation |
Large variation |
|
|
|
https://www.spss-tutorials.com/standard-deviation/
Outliers
An outlier is a value in a data set that is either very high or very low when compared to the other values.
An outlier increases variation in a data set.
To find an outlier, we must first create a graph.
Tip: An outlier strongly affects the mean of a data set, but does not effect the median.
https://statisticsbyjim.com/basics/histograms/ https://online.stat.psu.edu/stat462/node/170/
Sometimes, graphs may not present an accurate display of the data. This may be accidental or intentional.