R - Mean, Median and Mode

Hello, aspiring R programmers! Today, we're going to dive into the world of descriptive statistics using R. As your friendly neighborhood computer science teacher, I'm here to guide you through the concepts of mean, median, and mode. Don't worry if you've never written a line of code before – we'll start from the very beginning and work our way up together.

R - Mean, Median & Mode

Mean

Let's begin with the mean, which is probably the most common measure of central tendency. In simple terms, it's what we often call the "average."

Basic Mean Calculation

To calculate the mean in R, we use the mean() function. Here's a simple example:

numbers <- c(10, 20, 30, 40, 50)
result <- mean(numbers)
print(result)

This will output: 30

Let's break this down:

We create a vector called numbers using the c() function.
We use the mean() function to calculate the average of these numbers.
We store the result in a variable called result.
Finally, we print the result.

Mean with NA Values

Now, what happens if we have missing data, represented by NA in R? Let's see:

numbers_with_na <- c(10, 20, NA, 40, 50)
result_with_na <- mean(numbers_with_na)
print(result_with_na)

This will output: NA

Oops! R returns NA because it doesn't know how to handle the missing value. But don't worry, we have a solution!

Applying NA Option

We can tell R to ignore NA values using the na.rm option:

numbers_with_na <- c(10, 20, NA, 40, 50)
result_na_removed <- mean(numbers_with_na, na.rm = TRUE)
print(result_na_removed)

This will output: 30

Much better! By setting na.rm = TRUE, we're telling R to remove NA values before calculating the mean.

Applying Trim Option

Sometimes, we want to exclude extreme values from our mean calculation. That's where the trim option comes in handy. It allows us to trim a percentage of values from both ends of the data before calculating the mean.

numbers <- c(1, 2, 3, 4, 5, 100)  # Note the outlier 100
result_trimmed <- mean(numbers, trim = 0.1)
print(result_trimmed)

This will output a value close to 3.5

By setting trim = 0.1, we're removing 10% of the data from each end before calculating the mean. This helps reduce the impact of outliers.

Median

The median is the middle value when a dataset is ordered from least to greatest. It's less affected by outliers than the mean.

numbers <- c(1, 3, 5, 7, 9, 11, 13)
result_median <- median(numbers)
print(result_median)

This will output: 7

The median() function works similarly to mean(). It also has an na.rm option for handling NA values:

numbers_with_na <- c(1, 3, NA, 7, 9, 11, 13)
result_median_na <- median(numbers_with_na, na.rm = TRUE)
print(result_median_na)

This will output: 8

Mode

Interestingly, R doesn't have a built-in function for mode (the most frequent value). But don't worry! We can create our own function:

get_mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

numbers <- c(1, 2, 2, 3, 3, 3, 4, 4, 5)
result_mode <- get_mode(numbers)
print(result_mode)

This will output: 3

Let's break down this custom function:

unique(v) gets the unique values in the vector.
match(v, uniqv) finds the positions of matches.
tabulate() counts the occurrences.
which.max() finds the position of the maximum count.
We return the value at that position.

Summary of Functions

Here's a handy table summarizing the functions we've learned:

Measure	Function	Options
Mean	mean()	na.rm, trim
Median	median()	na.rm
Mode	Custom function	N/A

Remember, practice makes perfect! Try these functions with different datasets and explore how changing the options affects the results.

As we wrap up, I'm reminded of a story from my early days of learning R. I once spent hours trying to calculate the mean of a dataset, only to realize I had forgotten to remove the NA values. Don't be like me – always check your data and use na.rm = TRUE when needed!

Happy coding, and may your statistical adventures in R be filled with insights and aha moments!

Credits: Image by storyset

Previous Tutorial:

R - Scatterplots

Next Tutorial:

R - Linear Regression