We use descriptive statistics when we summarize raw data using a graph or a few numbers.
Analysing the distribution of a single variable (categorical or quantitative) is called univariate analysis.
Correlation
Correlation looks at the relationship between two quantitative variables: this is called bivariate analysis.
To analyze a relationship between two quantitative variables, we use the same 3step strategy we use for single variables. 


Use a scatterplot 



Use a single number to describe:

This video illustrates the main ideas of scatterplots, correlation, and the importance of explanatory and response variables when plotting a scatterplot.
Explanatory variables are sometimes called independent variables.
Response variables are sometimes called the dependent variable.
https://keydifferences.com/differencebetweenindependentanddependentvariable.html
When we speculate about a relationship between variables, we can further categorize them:
Explanatory variable Independent variable 
Response variable Dependent variable 

If we change the… 
Is there an effect on… 


amount of fat in diet 
weight loss 
ounces of coffee 
hours of sleeplessness 

hourly pay rate 
employee performance 

age of child 
height 
Make a graph: Scatterplots
Scatterplots are used to visually display the relationship between two quantitative variables.
In the graph below, the average daily temperature is considered to be the explanatory variable. In other words, it seems likely that the change in temperature results in a change in the number of visitors to the beach.
https://www.cqeacademy.com/cqebodyofknowledge/continuousimprovement/qualitycontroltools/thescatterplotlinearregression/
Identify the patterns and deviations (and outliers): scatterplots
In Step 1, we choose the best graph to display the data.
Now, in Step 2, we identify patterns and deviations in the graph.
TIP: Pay special attention to outliers.
https://courses.lumenlearning.com/wmopenconceptsstatistics/chapter/scatterplots2of5/
The patterns we look at are:
We use a single number, the correlation coefficient r, to numerically summarize the direction and strength of a linear relationship between two quantitative variables.
TIP: If there are outliers, we usually don’t use r.
Form: linear or not? Tip: If form is not linear, you can not use correlation as a numerical summary. 

Linear 
Not linear 

https://mat117.wisconsin.edu/2ascatterplot/ 
https://mat117.wisconsin.edu/2ascatterplot/ 

Direction: positive + or negative  ? The sign of r tells us if the relationship is positive or negative. 

Positive + 
Negative  

Dots go up from left to right? Direction of relationship is positive.
As values of one variable increase, values of other variable also increase.
Perfect positive correlation is +1. 
Dots go down from left to right? Direction of relationship is negative.
As values of one variable increase, values of the other variable decreases.
Perfect negative correlation is 1. 

Strength: weak or strong? Dots close to making a straight line? relationship is strong Dots not close to making a straight line? relationship is weak The value of r tells us the strength of the relationship. Close to 0 or close to +1 or 1? 

https://www.youtube.com/watch?v=DAH8DyLXdjM&t=28s
https://www.researchgate.net/figure/nterpretationofthePearsonsandSpearmanscorrelationcoefficients_tbl1_326885374 
Numerical and verbal descriptions of correlation coefficient r
https://www.researchgate.net/figure/nterpretationofthePearsonsandSpearmanscorrelationcoefficients_tbl1_326885374 

Outliers? Outliers fall outside the linear pattern. You can see the outliers circled in the two scatterplots below. They greatly reduce the strength of the relationship between the two variables. Tip: If there are outliers, you should not use correlation as a numerical summary. 

https://intl.siyavula.com/read/maths/grade11/statistics/11statistics06 
https://web.njit.edu/~dhar/math661/IPS7e_LecturePPT_ch02.pdf 
Regression (Bestfit) Lines
If we have established a linear relationship, we draw a regression line (or bestfit line) as a model to represent the pattern of data shown in a scatterplot.
The simplest way to create a regression line is the leastsquares method which makes the vertical distance between each data point and the line as small as possible. We rarely calculate the regression by hand because software can do it for us.
The formula for a regression line is Y = a + bX
Y = response/dependent variable
X = explanatory/independent variable
b = slope of the line is
a = the intercept (the value of y when x = 0)
If we have established that the linear relationship is moderately strong, we can use linear regression to make predictions about the response variable.
This video highlights how the formula for the regression line can be use to make predictions. (Watch until 4:38)
We can measure the strength of the bestfit line by calculating r^{2}.
r^{2 } tells us the proportion of change in the response variable is explained by the explanatory variable.
The closer the value of r^{2} is to 100%, the better the model is at making accurate predictions. This video offers some examples
Limitations of using linear regression for prediction
We established that linear regression can be used to make predications about the response variable based on change in the explanatory variable
Here will we add two more conditions that must be met before we can use regression for prediction.
This video explains how the reliability of prediction goes down when we use a linear equation to make predictions outside the range of data or when there are outliers.
Cause or correlation? 

Does low selfesteem cause depression? Does depression cause low selfesteem? 
Does an increase in sunburns cause an increase in ice cream sales? Does an increase in ice cream sales cause an increase in sunburns? 
https://sites.google.com/site/kaylauthepsychologyobjectives/gradelevel/objective5 
https://www.clearsalesmessage.com/correlationcausation/ 
Confounding variables, distressing events or biological disposition, might explain both depression and low selfesteem 
The confounding variable, summer weather, might explain both ice cream consumption and sunburns 
This video offers another way to think about correlation and causation. (Watch until 1:07)
How to establish causality
The best way to establish causality is through a randomized, controlled experiment. However, it is also possible to establish a causeandeffect relationship using observational studies.
Experiments are the best way to establish causality because the researcher can isolate the explanatory variable and control for extraneous variables. This make us confident that it is the explanatory variable that is causing the changes in the response variable.
This video explains why experiments are so useful to establish a causeandeffect relationship between two variables.
When it is unfeasible or unethical to conduct an experiment, researchers use another method to establish causality. They look at the data available from observational studies and ask the questions below:
A fascinating example of using observational studies to establish causation is the relationship between smoking and lung cancer.
Smoking and Lung Cancer: From Association to Causation