Introduction to Statistics

Nicola Rennie

Descriptive statistics

Descriptive statistics provide a summary that quantitatively describes a sample of data.

Population

Population refers to the entire group of individuals that we want to draw conclusions about.

Sample

Sample refers to the (usually smaller) group of people for which we have collected data on.

Generate sample data

For the examples later, let’s create a population of data in R…:

Generate sample data

… and draw a sample from it:

What do the values look like?

Mean

The mean, often simply called the average, is defined as the sum of all values divided by the number of values. It’s a measure of central tendency that tells us what’s happening near the middle of the data.

\(\bar{x} = \frac{1}{n} \sum_{i=i}^{n} x_{i}\)

In R, we use the mean() function:

Median

The median of a dataset is the middle value when the data is arranged in ascending order, or the average of the two middle values if the dataset has an even number of observations.

In R, we use the median() function:

Mode

The mode statistic represents the value that appears most frequently in a dataset.

In R, there is no mode() function. Instead, we count how many of each value there are and choose the one with the highest number:

Range

The range is the difference between the maximum and minimum values in a dataset.

In R, we can use the max() and min() function and subtract the values:

Note that the range() function returns the minimum and maximum, not a single value:

Sample variance

The sample variance tells us about how spread out the data is. A lower variance indicates that values tend to be close to the mean, and a higher variance indicates that the values are spread out over a wider range.

\(s^2 = \frac{\Sigma_{i= 1}^{N} (x_i - \bar{x})^2}{n-1}\)

In R, we use the var() function:

Sample standard deviation

The sample standard deviation is the square root of the variance. It also tells us about how spread out the data is.

\(s = \sqrt{\frac{\Sigma_{i= 1}^{N} (x_i - \bar{x})^2}{n-1}}\)

In R, we use the sd() function:

Descriptive statistics

Descriptive statistics provide a summary that quantitatively describes a sample of data.

  • Mean: The sum of the values divided by the number of values.
  • Median: The middle value of the data when it’s sorted.
  • Mode: The value that appears most frequently.
  • Range: The difference between the maximum and minimum values.
  • Variance: The average of the squared differences from the mean.
  • Standard deviation: The square root of the variance.

Exercise

In R:

  • Load the ames housing data set using data(ames, package = "modeldata")
  • Calculate the mean, median, mode, range, variance, and standard deviation of house prices (the Sale_Price column).

Remember: you can extract a column in R using dataset$column_name.

Exercise solutions

# load data
data(ames, package = "modeldata")

# summary statistics
mean(ames$Sale_Price)
[1] 180796.1
median(ames$Sale_Price)
[1] 160000
names(sort(table(ames$Sale_Price), decreasing = TRUE)[1])
[1] "135000"
max(ames$Sale_Price) - min(ames$Sale_Price)
[1] 742211
var(ames$Sale_Price)
[1] 6381883616
sd(ames$Sale_Price)
[1] 79886.69

Questions?