Numerical Data Summary
Here are some numerical data summaries that you can find using R:- sum() is used to take the sum of a set of numbers.
- length() can be used to find the number of elements in the vector provided. In other words, this will output the number of observations in a vector of data.
- mean() is used to find the mean of a set of numbers.
- median() is used to find the median of a set of numbers.
- cor() is used to find the correlation between two numeric sets of numbers.
- var() is used to find the variance of a set of numbers.
- sd() is used to find the standard deviation of a set of numbers.
- IQR() is used to find the interquartile range of a set of numbers.
- quantile() is used to find specific quantiles in a set of numbers.
- R will automatically output the 0th, 25th, 50th, 75th and 100th percentiles. If you want different quantiles, they can be specified in the probs argument.
- The summary() function can be used for both numerical and categorical data but provide different results.
- To get the relative frequencies of each of the categories for a categorical variable, you simply can divide the summary statistics by the number of observations.
- You can even use summary() on an entire dataset to get an overall summary of all of the variables.
-
If you have categorical data that has been recorded numerically (for instance, "yes" or "no" recorded as "1" and "0" respectively), you will need to let R know to treat the numbers as category names rather than numerical data.
Using the as.factor() function will tell R to treat the numbers like factors, or categories rather than numbers.
# no pec
sum(iris$Petal.Length)
length(iris$Petal.Length)
# There are 150 observations in the iris dataset.
mean(iris$Petal.Length)
median(iris$Petal.Length)
cor(iris$Petal.Length, iris$Sepal.Length)
# no pec
var(iris$Petal.Length)
sd(iris$Petal.Length)
IQR(iris$Petal.Length)
quantile(iris$Petal.Length)
# Specifying the 40th and 95th percentiles in the quantile() function.
quantile(iris$Petal.Length, probs = c(0.40, 0.95))
Petal.Length is a numerical variable while Species is a categorical variable.
# no pec
# Numerical Variable
summary(iris$Petal.Length)
# Categorical Variable
summary(iris$Species)
For a numerical variable, summary() provides the five-number summary along with the mean.
For a categorical variable, summary() tallies up how many observations are in each category.
For a categorical variable, summary() tallies up how many observations are in each category.
# no pec
# Relative frequencies
summary(iris$Species) / length(iris$Species)
# Count frequencies
summary(iris)
In the mtcars dataset in R, the type of transmission was recorded for 32 cars. The am variable was recorded as a "0" for automatic and "1" for manual transmission.
For the first summary() function, it treated the zeroes and ones as numeric data. In the second line, we specified that R should treat the data as factors and we get a table summarizing the number of automatic and manual transmission cars in each category.
# no pec
summary(mtcars$am)
summary(as.factor(mtcars$am))
For the first summary() function, it treated the zeroes and ones as numeric data. In the second line, we specified that R should treat the data as factors and we get a table summarizing the number of automatic and manual transmission cars in each category.