Any sufficiently advanced technology is equivalent to magic.

Top

Site Menu

Linear Regression

When working with bivariate data, not only can we get numerical and graphical summaries for the individual numerical variables, but we can get statistical summaries of them together.

Correlation

To calculate the correlation between two numerical variables, we will use the cor() function.
We will use the mtcars dataset in R.  (first used on Numerical Data Summary page)

#no pec #Correlation between weight of car and miles per gallon. cor(mtcars$mpg, mtcars$wt) #Correlation between Iris petal length and petal width. cor(iris$Petal.Length, iris$Petal.Width)

You can even get correlations for multiple pairs of numerical variables arranged in a matrix.
cor() can only accept numerical input. Since the iris dataset has four numerical variables and one categorical variable, we need to subset the data (using [,] notation) before putting it into cor().

#no pec #This will only include the numerical variable columns # iris[, 1:4] #Finding the correlation matrix cor(iris[, 1:4])

Scatterplots

To create a scatterplot between two numerical variables, we will use the plot() function. The first two arguments for plot() are the variables for the $x$ and $y$ axes, respectively. A few of the additional arguments for plot() are listed below along with their use:
We will use the mtcars dataset in R.

#no pec #Basic Scatterplot plot(mtcars$wt, mtcars$mpg) #Customized Scatterplot plot(mtcars$wt, mtcars$mpg, main = "Cars from 1974", xlab = "Weight (1000 lbs)", ylab = "Miles per gallon (mpg)", xlim = c(0, 6), ylim = c(0, 40), pch = 19, col = "chocolate")

You can even get scatterplots for multiple pairs of numerical variables arranged in a matrix using the pairs() function.
pairs() can only accept numerical input. Since the iris dataset has four numerical variables and one categorical variable, we need to subset the data (using [,] notation) before putting it into pairs().

#no pec #This will only include the numerical variable columns # iris[, 1:4] #Creating the scatterplot matrix pairs(iris[, 1:4])

Video Tutorial:

Least Squares Line

To calculate the least squares regression line for two numerical variables, we will use the lm() function, which stands for "linear model". Using '~' notation, we will input our variables with the dependent variable, $y$, first with the explantory variable, $x$, after the '~'.
We will use the mtcars dataset in R. Miles per gallon (mpg) will be our dependent variable, $y$, and weight (wt) will be our explanatory variable, $x$.

#no pec lm(mtcars$mpg ~ mtcars$wt)
The lm() function will return the intercept and slope coefficient for the least squares regression line. For this example, the $y$-intercept is 37.285 and the slope is -5.344. The least squares line would look like this: $$y = 37.285 - 5.344x$$ where $x$ is the weight (wt) variable and $y$ is the miles per gallon (mpg) variable.
For more detailed information on the linear model, you can use the summary() function.
#no pec summary(lm(mtcars$mpg ~ mtcars$wt))

Plotting the Least Squares Line

After creating your scatterplot, you can add the least squares regression line to your plot using the abline() function. You can either specify the intercept and slope values explicitly or just nest lm() inside of abline().
#no pec plot(mtcars$wt, mtcars$mpg, main = "Cars from 1974", xlab = "Weight (1000 lbs)", ylab = "Miles per gallon (mpg)", xlim = c(0, 6), ylim = c(0, 40), pch = 19, col = "chocolate") #Explicitly, 'a' is intercept, 'b' is slope abline(a = 37.285, b = -5.344) #Using lm() abline(lm(mtcars$mpg ~ mtcars$wt))

Video Tutorial: