R - Decision Tree: A Beginner's Guide

Hello there, future data scientists! Today, we're going to embark on an exciting journey into the world of decision trees using R. Don't worry if you've never coded before – I'll be your friendly guide every step of the way. By the end of this tutorial, you'll be creating your own decision trees and feeling like a real data wizard!

R - Decision Tree

What is a Decision Tree?

Before we dive into the code, let's understand what a decision tree is. Imagine you're trying to decide whether to go for a run or not. You might ask yourself:

Is it raining?
Do I have enough time?
Am I feeling energetic?

Based on your answers, you make a decision. That's essentially what a decision tree does – it makes decisions based on a series of questions!

Installing the Necessary R Packages

First things first, we need to equip ourselves with the right tools. In R, these tools are called packages. For our decision tree adventure, we'll need two main packages: rpart and rpart.plot.

Let's install them:

install.packages("rpart")
install.packages("rpart.plot")

Now, let's load these packages:

library(rpart)
library(rpart.plot)

Great job! You've just taken your first steps in R programming. Pat yourself on the back!

Creating a Simple Dataset

Now that we have our tools ready, let's create a simple dataset to work with. Imagine we're trying to predict whether someone will buy ice cream based on the temperature and whether it's a weekend.

# Create a data frame
ice_cream_data <- data.frame(
  temperature = c(68, 85, 72, 90, 60, 78, 82, 75, 68, 71),
  is_weekend = c(0, 1, 0, 1, 0, 1, 1, 0, 1, 0),
  buy_icecream = c(0, 1, 0, 1, 0, 1, 1, 0, 1, 0)
)

# View the data
print(ice_cream_data)

In this dataset:

temperature is in Fahrenheit
is_weekend is 1 for weekend, 0 for weekday
buy_icecream is 1 if they bought ice cream, 0 if they didn't

Building Our First Decision Tree

Now for the exciting part – let's build our decision tree!

# Create the decision tree model
ice_cream_tree <- rpart(buy_icecream ~ temperature + is_weekend, 
                        data = ice_cream_data, 
                        method = "class")

# Plot the tree
rpart.plot(ice_cream_tree, extra = 106)

Let's break down what's happening here:

rpart() is the function we use to create the decision tree.
buy_icecream ~ temperature + is_weekend tells R that we want to predict buy_icecream based on temperature and is_weekend.
data = ice_cream_data specifies our dataset.
method = "class" tells R we're doing a classification task (predicting a category).
rpart.plot() creates a visual representation of our tree.

When you run this code, you'll see a beautiful tree diagram. Each node shows a decision rule, and the leaves show the predictions. It's like a flowchart of ice cream decisions!

Understanding the Tree

Let's interpret our ice cream decision tree:

The top node (root) shows the first split. It might be something like "temperature < 76".
If true (yes), it goes to the left branch; if false (no), it goes to the right.
This process continues until it reaches a leaf node, which gives the final prediction.

The numbers in the nodes represent:

The predicted class (0 or 1)
The probability of that class
The percentage of observations in that node

Making Predictions

Now that we have our tree, let's use it to make some predictions!

# Create new data
new_data <- data.frame(
  temperature = c(70, 95),
  is_weekend = c(1, 0)
)

# Make predictions
predictions <- predict(ice_cream_tree, new_data, type = "class")

# View predictions
print(predictions)

This code predicts whether someone will buy ice cream on a 70°F weekend and a 95°F weekday.

Evaluating the Model

To see how well our model performs, we can use a confusion matrix:

# Make predictions on our original data
predictions <- predict(ice_cream_tree, ice_cream_data, type = "class")

# Create confusion matrix
confusion_matrix <- table(Actual = ice_cream_data$buy_icecream, Predicted = predictions)

# View confusion matrix
print(confusion_matrix)

# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", accuracy))

This gives us a quick view of how many predictions were correct and incorrect.

Conclusion

Congratulations! You've just built your first decision tree in R. From installing packages to making predictions, you've covered a lot of ground. Remember, practice makes perfect, so don't be afraid to experiment with different datasets and parameters.

Here's a quick recap of the methods we've used:

Method	Description
install.packages()	Installs R packages
library()	Loads installed packages
data.frame()	Creates a data frame
rpart()	Builds a decision tree
rpart.plot()	Visualizes the decision tree
predict()	Makes predictions using the tree
table()	Creates a confusion matrix

Keep exploring, keep learning, and most importantly, have fun with data science!

Credits: Image by storyset

Previous Tutorial:

R - Nonlinear Least Square

Next Tutorial:

R - Random Forest