Descriptive statistics provide a summary that quantitatively describes a sample of data.
Population refers to the entire group of individuals that we want to draw conclusions about.
Sample refers to the (usually smaller) group of people for which we have collected data on.
For the examples later, let’s create a population of data in R…:
… and draw a sample from it:
What do the values look like?
The mean, often simply called the average, is defined as the sum of all values divided by the number of values. It’s a measure of central tendency that tells us what’s happening near the middle of the data.
\(\bar{x} = \frac{1}{n} \sum_{i=i}^{n} x_{i}\)
In R, we use the mean()
function:
The median of a dataset is the middle value when the data is arranged in ascending order, or the average of the two middle values if the dataset has an even number of observations.
In R, we use the median()
function:
The mode statistic represents the value that appears most frequently in a dataset.
In R, there is no mode()
function. Instead, we count how many of each value there are and choose the one with the highest number:
The range is the difference between the maximum and minimum values in a dataset.
In R, we can use the max()
and min()
function and subtract the values:
Note that the range()
function returns the minimum and maximum, not a single value:
The sample variance tells us about how spread out the data is. A lower variance indicates that values tend to be close to the mean, and a higher variance indicates that the values are spread out over a wider range.
\(s^2 = \frac{\Sigma_{i= 1}^{N} (x_i - \bar{x})^2}{n-1}\)
In R, we use the var()
function:
The sample standard deviation is the square root of the variance. It also tells us about how spread out the data is.
\(s = \sqrt{\frac{\Sigma_{i= 1}^{N} (x_i - \bar{x})^2}{n-1}}\)
In R, we use the sd()
function:
Descriptive statistics provide a summary that quantitatively describes a sample of data.
In R:
ames
housing data set using data(ames, package = "modeldata")
Sale_Price
column).Remember: you can extract a column in R using
dataset$column_name
.
[1] 180796.1
[1] 160000
[1] "135000"
[1] 742211
[1] 6381883616
[1] 79886.69