We use descriptive statistics when we summarize raw data using a graph or a few numbers.
Analysing the distribution of a single variable (categorical or quantitative) is called univariate analysis.
Correlation
Correlation looks at the relationship between two quantitative variables: this is called bivariate analysis.
To analyze a relationship between two quantitative variables, we use the same 3-step strategy we use for single variables. |
|
|
Use a scatterplot |
|
|
|
Use a single number to describe:
|
This video illustrates the main ideas of scatterplots, correlation, and the importance of explanatory and response variables when plotting a scatterplot.
Explanatory variables are sometimes called independent variables.
Response variables are sometimes called the dependent variable.
https://keydifferences.com/difference-between-independent-and-dependent-variable.html
When we speculate about a relationship between variables, we can further categorize them:
Explanatory variable Independent variable |
Response variable Dependent variable |
|
If we change the… |
Is there an effect on… |
|
|
amount of fat in diet |
weight loss |
ounces of coffee |
hours of sleeplessness |
|
hourly pay rate |
employee performance |
|
age of child |
height |
Make a graph: Scatterplots
Scatterplots are used to visually display the relationship between two quantitative variables.
In the graph below, the average daily temperature is considered to be the explanatory variable. In other words, it seems likely that the change in temperature results in a change in the number of visitors to the beach.
https://www.cqeacademy.com/cqe-body-of-knowledge/continuous-improvement/quality-control-tools/the-scatter-plot-linear-regression/
Identify the patterns and deviations (and outliers): scatterplots
In Step 1, we choose the best graph to display the data.
Now, in Step 2, we identify patterns and deviations in the graph.
TIP: Pay special attention to outliers.
https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/scatterplots-2-of-5/
The patterns we look at are:
We use a single number, the correlation coefficient r, to numerically summarize the direction and strength of a linear relationship between two quantitative variables.
TIP: If there are outliers, we usually don’t use r.
Form: linear or not? Tip: If form is not linear, you can not use correlation as a numerical summary. |
||
Linear |
Not linear |
|
https://mat117.wisconsin.edu/2-a-scatterplot/ |
https://mat117.wisconsin.edu/2-a-scatterplot/ |
|
Direction: positive + or negative - ? The sign of r tells us if the relationship is positive or negative. |
||
Positive + |
Negative - |
|
Dots go up from left to right? Direction of relationship is positive.
As values of one variable increase, values of other variable also increase.
Perfect positive correlation is +1. |
Dots go down from left to right? Direction of relationship is negative.
As values of one variable increase, values of the other variable decreases.
Perfect negative correlation is -1. |
|
Strength: weak or strong? Dots close to making a straight line? relationship is strong Dots not close to making a straight line? relationship is weak The value of r tells us the strength of the relationship. Close to 0 or close to +1 or -1? |
||
https://www.youtube.com/watch?v=DAH8DyLXdjM&t=28s
https://www.researchgate.net/figure/nterpretation-of-the-Pearsons-and-Spearmans-correlation-coefficients_tbl1_326885374 |
Numerical and verbal descriptions of correlation coefficient r
https://www.researchgate.net/figure/nterpretation-of-the-Pearsons-and-Spearmans-correlation-coefficients_tbl1_326885374 |
|
Outliers? Outliers fall outside the linear pattern. You can see the outliers circled in the two scatterplots below. They greatly reduce the strength of the relationship between the two variables. Tip: If there are outliers, you should not use correlation as a numerical summary. |
||
https://intl.siyavula.com/read/maths/grade-11/statistics/11-statistics-06 |
https://web.njit.edu/~dhar/math661/IPS7e_LecturePPT_ch02.pdf |
Regression (Best-fit) Lines
If we have established a linear relationship, we draw a regression line (or best-fit line) as a model to represent the pattern of data shown in a scatterplot.
The simplest way to create a regression line is the least-squares method which makes the vertical distance between each data point and the line as small as possible. We rarely calculate the regression by hand because software can do it for us.
The formula for a regression line is Y = a + bX
Y = response/dependent variable
X = explanatory/independent variable
b = slope of the line is
a = the intercept (the value of y when x = 0)
If we have established that the linear relationship is moderately strong, we can use linear regression to make predictions about the response variable.
This video highlights how the formula for the regression line can be use to make predictions. (Watch until 4:38)
We can measure the strength of the best-fit line by calculating r2.
r2 tells us the proportion of change in the response variable is explained by the explanatory variable.
The closer the value of r2 is to 100%, the better the model is at making accurate predictions. This video offers some examples
Limitations of using linear regression for prediction
We established that linear regression can be used to make predications about the response variable based on change in the explanatory variable
Here will we add two more conditions that must be met before we can use regression for prediction.
This video explains how the reliability of prediction goes down when we use a linear equation to make predictions outside the range of data or when there are outliers.
Cause or correlation? |
|
Does low self-esteem cause depression? Does depression cause low self-esteem? |
Does an increase in sunburns cause an increase in ice cream sales? Does an increase in ice cream sales cause an increase in sunburns? |
https://sites.google.com/site/kaylauthepsychologyobjectives/grade-level/objective-5 |
https://www.clearsalesmessage.com/correlation-causation/ |
Confounding variables, distressing events or biological disposition, might explain both depression and low self-esteem |
The confounding variable, summer weather, might explain both ice cream consumption and sunburns |
This video offers another way to think about correlation and causation. (Watch until 1:07)
How to establish causality
The best way to establish causality is through a randomized, controlled experiment. However, it is also possible to establish a cause-and-effect relationship using observational studies.
Experiments are the best way to establish causality because the researcher can isolate the explanatory variable and control for extraneous variables. This make us confident that it is the explanatory variable that is causing the changes in the response variable.
This video explains why experiments are so useful to establish a cause-and-effect relationship between two variables.
When it is unfeasible or unethical to conduct an experiment, researchers use another method to establish causality. They look at the data available from observational studies and ask the questions below:
A fascinating example of using observational studies to establish causation is the relationship between smoking and lung cancer.
Smoking and Lung Cancer: From Association to Causation