George W. Wallace Library: QM Course Guide: Correlation and Causality

Correlation, Regression Lines, and Causality

We use descriptive statistics when we summarize raw data using a graph or a few numbers.

graphs and tables visually describe the data.
measures of center and variation numerically describe the data.

Analysing the distribution of a single variable (categorical or quantitative) is called univariate analysis.

Correlation

Correlation looks at the relationship between two quantitative variables: this is called bivariate analysis.

To analyze a relationship between two quantitative variables, we use the same 3-step strategy we use for single variables.
Make a graph	Use a scatterplot
Identify patterns and deviations	Look at: form direction strength (and outliers)
Choose a numerical summary	Use a single number to describe: Direction and strength of the relationship (correlation coefficient r)

This video illustrates the main ideas of scatterplots, correlation, and the importance of explanatory and response variables when plotting a scatterplot.

Explanatory variables are sometimes called independent variables.

Response variables are sometimes called the dependent variable.

dependent vs independent variable

https://keydifferences.com/difference-between-independent-and-dependent-variable.html

When we speculate about a relationship between variables, we can further categorize them:

Explanatory variable Independent variable		Response variable Dependent variable
If we change the…		Is there an effect on…
	amount of fat in diet	weight loss
	ounces of coffee	hours of sleeplessness
	hourly pay rate	employee performance
	age of child	height

Making a Scatterplot

Make a graph: Scatterplots

Scatterplots are used to visually display the relationship between two quantitative variables.

In the graph below, the average daily temperature is considered to be the explanatory variable. In other words, it seems likely that the change in temperature results in a change in the number of visitors to the beach.

Scatter Plot

https://www.cqeacademy.com/cqe-body-of-knowledge/continuous-improvement/quality-control-tools/the-scatter-plot-linear-regression/

Constructing a Scatterplot

Identifying patterns and deviation in a scatterplot

Constructing a Scatterplot

Identify the patterns and deviations (and outliers): scatterplots

In Step 1, we choose the best graph to display the data.

Now, in Step 2, we identify patterns and deviations in the graph.

TIP: Pay special attention to outliers.

Flowchart of graphing the distribution of 2 quantitative variables in a scatterplot, which includes overall patterns and derivations from the patterns

https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/scatterplots-2-of-5/

The patterns we look at are:

form
direction
strength (Outliers have an impact on the strength of a relationship.)

Choosing a numerical summary: correlation coefficient r

We use a single number, the correlation coefficient r, to numerically summarize the direction and strength of a linear relationship between two quantitative variables.

TIP: If there are outliers, we usually don’t use r.

Form: linear or not?

Tip: If form is not linear, you can not use correlation as a numerical summary.

Linear

Not linear

https://mat117.wisconsin.edu/2-a-scatterplot/

Direction: positive + or negative - ?

The sign of r tells us if the relationship is positive or negative.

Positive +

Negative -

Dots go up from left to right?

Direction of relationship is positive.

As values of one variable increase,

values of other variable also increase.

Perfect positive correlation is +1.

Dots go down from left to right?

Direction of relationship is negative.

As values of one variable increase,

values of the other variable decreases.

Perfect negative correlation is -1.

Strength: weak or strong?

Dots close to making a straight line? relationship is strong

Dots not close to making a straight line? relationship is weak

The value of r tells us the strength of the relationship.

Close to 0 or close to +1 or -1?

https://www.youtube.com/watch?v=DAH8DyLXdjM&t=28s

https://www.researchgate.net/figure/nterpretation-of-the-Pearsons-and-Spearmans-correlation-coefficients_tbl1_326885374

Numerical and verbal

descriptions of

correlation coefficient r

https://www.researchgate.net/figure/nterpretation-of-the-Pearsons-and-Spearmans-correlation-coefficients_tbl1_326885374

Outliers?

Outliers fall outside the linear pattern. You can see the outliers circled in the two scatterplots below.

They greatly reduce the strength of the relationship between the two variables.

Tip: If there are outliers, you should not use correlation as a numerical summary.

https://intl.siyavula.com/read/maths/grade-11/statistics/11-statistics-06

https://web.njit.edu/~dhar/math661/IPS7e_LecturePPT_ch02.pdf

Drawing a regression line

Regression (Best-fit) Lines

If we have established a linear relationship, we draw a regression line (or best-fit line) as a model to represent the pattern of data shown in a scatterplot.

The simplest way to create a regression line is the least-squares method which makes the vertical distance between each data point and the line as small as possible. We rarely calculate the regression by hand because software can do it for us.

The formula for a regression line is Y = a + bX

Y = response/dependent variable

X = explanatory/independent variable

b = slope of the line is

a = the intercept (the value of y when x = 0)

Using a regression line to make predictions

If we have established that the linear relationship is moderately strong, we can use linear regression to make predictions about the response variable.

This video highlights how the formula for the regression line can be use to make predictions. (Watch until 4:38)

We can measure the strength of the best-fit line by calculating r².

r² tells us the proportion of change in the response variable is explained by the explanatory variable.

The closer the value of r² is to 100%, the better the model is at making accurate predictions. This video offers some examples

Limitations of using linear regression for prediction

We established that linear regression can be used to make predications about the response variable based on change in the explanatory variable

if there is a linear relationship between the two variables
if the linear relationship is moderately strong.

Here will we add two more conditions that must be met before we can use regression for prediction.

there must be no outliers
we must make predictions within the data available

This video explains how the reliability of prediction goes down when we use a linear equation to make predictions outside the range of data or when there are outliers.

Correlation is not causation

Cause or correlation?

Does low self-esteem cause depression?

Does depression cause low self-esteem?

Does an increase in sunburns cause an increase in ice cream sales?

Does an increase in ice cream sales cause an increase in sunburns?

https://sites.google.com/site/kaylauthepsychologyobjectives/grade-level/objective-5

https://www.clearsalesmessage.com/correlation-causation/

Confounding variables, distressing events or biological disposition, might explain both depression and low self-esteem

The confounding variable, summer weather, might explain both ice cream consumption and sunburns

This video offers another way to think about correlation and causation. (Watch until 1:07)

Establishing causality in social research

How to establish causality

The best way to establish causality is through a randomized, controlled experiment. However, it is also possible to establish a cause-and-effect relationship using observational studies.

Experiments are the best way to establish causality because the researcher can isolate the explanatory variable and control for extraneous variables. This make us confident that it is the explanatory variable that is causing the changes in the response variable.

This video explains why experiments are so useful to establish a cause-and-effect relationship between two variables.

When it is unfeasible or unethical to conduct an experiment, researchers use another method to establish causality. They look at the data available from observational studies and ask the questions below:

A fascinating example of using observational studies to establish causation is the relationship between smoking and lung cancer.

Smoking and Lung Cancer: From Association to Causation