R - Mean, Median and Mode
Hello, aspiring R programmers! Today, we're going to dive into the world of descriptive statistics using R. As your friendly neighborhood computer science teacher, I'm here to guide you through the concepts of mean, median, and mode. Don't worry if you've never written a line of code before – we'll start from the very beginning and work our way up together.
Mean
Let's begin with the mean, which is probably the most common measure of central tendency. In simple terms, it's what we often call the "average."
Basic Mean Calculation
To calculate the mean in R, we use the mean()
function. Here's a simple example:
numbers <- c(10, 20, 30, 40, 50)
result <- mean(numbers)
print(result)
This will output: 30
Let's break this down:
- We create a vector called
numbers
using thec()
function. - We use the
mean()
function to calculate the average of these numbers. - We store the result in a variable called
result
. - Finally, we print the result.
Mean with NA Values
Now, what happens if we have missing data, represented by NA
in R? Let's see:
numbers_with_na <- c(10, 20, NA, 40, 50)
result_with_na <- mean(numbers_with_na)
print(result_with_na)
This will output: NA
Oops! R returns NA
because it doesn't know how to handle the missing value. But don't worry, we have a solution!
Applying NA Option
We can tell R to ignore NA values using the na.rm
option:
numbers_with_na <- c(10, 20, NA, 40, 50)
result_na_removed <- mean(numbers_with_na, na.rm = TRUE)
print(result_na_removed)
This will output: 30
Much better! By setting na.rm = TRUE
, we're telling R to remove NA values before calculating the mean.
Applying Trim Option
Sometimes, we want to exclude extreme values from our mean calculation. That's where the trim
option comes in handy. It allows us to trim a percentage of values from both ends of the data before calculating the mean.
numbers <- c(1, 2, 3, 4, 5, 100) # Note the outlier 100
result_trimmed <- mean(numbers, trim = 0.1)
print(result_trimmed)
This will output a value close to 3.5
By setting trim = 0.1
, we're removing 10% of the data from each end before calculating the mean. This helps reduce the impact of outliers.
Median
The median is the middle value when a dataset is ordered from least to greatest. It's less affected by outliers than the mean.
numbers <- c(1, 3, 5, 7, 9, 11, 13)
result_median <- median(numbers)
print(result_median)
This will output: 7
The median()
function works similarly to mean()
. It also has an na.rm
option for handling NA values:
numbers_with_na <- c(1, 3, NA, 7, 9, 11, 13)
result_median_na <- median(numbers_with_na, na.rm = TRUE)
print(result_median_na)
This will output: 8
Mode
Interestingly, R doesn't have a built-in function for mode (the most frequent value). But don't worry! We can create our own function:
get_mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
numbers <- c(1, 2, 2, 3, 3, 3, 4, 4, 5)
result_mode <- get_mode(numbers)
print(result_mode)
This will output: 3
Let's break down this custom function:
-
unique(v)
gets the unique values in the vector. -
match(v, uniqv)
finds the positions of matches. -
tabulate()
counts the occurrences. -
which.max()
finds the position of the maximum count. - We return the value at that position.
Summary of Functions
Here's a handy table summarizing the functions we've learned:
Measure | Function | Options |
---|---|---|
Mean | mean() | na.rm, trim |
Median | median() | na.rm |
Mode | Custom function | N/A |
Remember, practice makes perfect! Try these functions with different datasets and explore how changing the options affects the results.
As we wrap up, I'm reminded of a story from my early days of learning R. I once spent hours trying to calculate the mean of a dataset, only to realize I had forgotten to remove the NA values. Don't be like me – always check your data and use na.rm = TRUE
when needed!
Happy coding, and may your statistical adventures in R be filled with insights and aha moments!
Credits: Image by storyset