R - Decision Tree: A Beginner's Guide

Hello there, future data scientists! Today, we're going to embark on an exciting journey into the world of decision trees using R. Don't worry if you've never coded before – I'll be your friendly guide every step of the way. By the end of this tutorial, you'll be creating your own decision trees and feeling like a real data wizard!

R - Decision Tree

What is a Decision Tree?

Before we dive into the code, let's understand what a decision tree is. Imagine you're trying to decide whether to go for a run or not. You might ask yourself:

  1. Is it raining?
  2. Do I have enough time?
  3. Am I feeling energetic?

Based on your answers, you make a decision. That's essentially what a decision tree does – it makes decisions based on a series of questions!

Installing the Necessary R Packages

First things first, we need to equip ourselves with the right tools. In R, these tools are called packages. For our decision tree adventure, we'll need two main packages: rpart and rpart.plot.

Let's install them:

install.packages("rpart")
install.packages("rpart.plot")

Now, let's load these packages:

library(rpart)
library(rpart.plot)

Great job! You've just taken your first steps in R programming. Pat yourself on the back!

Creating a Simple Dataset

Now that we have our tools ready, let's create a simple dataset to work with. Imagine we're trying to predict whether someone will buy ice cream based on the temperature and whether it's a weekend.

# Create a data frame
ice_cream_data <- data.frame(
  temperature = c(68, 85, 72, 90, 60, 78, 82, 75, 68, 71),
  is_weekend = c(0, 1, 0, 1, 0, 1, 1, 0, 1, 0),
  buy_icecream = c(0, 1, 0, 1, 0, 1, 1, 0, 1, 0)
)

# View the data
print(ice_cream_data)

In this dataset:

  • temperature is in Fahrenheit
  • is_weekend is 1 for weekend, 0 for weekday
  • buy_icecream is 1 if they bought ice cream, 0 if they didn't

Building Our First Decision Tree

Now for the exciting part – let's build our decision tree!

# Create the decision tree model
ice_cream_tree <- rpart(buy_icecream ~ temperature + is_weekend, 
                        data = ice_cream_data, 
                        method = "class")

# Plot the tree
rpart.plot(ice_cream_tree, extra = 106)

Let's break down what's happening here:

  1. rpart() is the function we use to create the decision tree.
  2. buy_icecream ~ temperature + is_weekend tells R that we want to predict buy_icecream based on temperature and is_weekend.
  3. data = ice_cream_data specifies our dataset.
  4. method = "class" tells R we're doing a classification task (predicting a category).
  5. rpart.plot() creates a visual representation of our tree.

When you run this code, you'll see a beautiful tree diagram. Each node shows a decision rule, and the leaves show the predictions. It's like a flowchart of ice cream decisions!

Understanding the Tree

Let's interpret our ice cream decision tree:

  1. The top node (root) shows the first split. It might be something like "temperature < 76".
  2. If true (yes), it goes to the left branch; if false (no), it goes to the right.
  3. This process continues until it reaches a leaf node, which gives the final prediction.

The numbers in the nodes represent:

  • The predicted class (0 or 1)
  • The probability of that class
  • The percentage of observations in that node

Making Predictions

Now that we have our tree, let's use it to make some predictions!

# Create new data
new_data <- data.frame(
  temperature = c(70, 95),
  is_weekend = c(1, 0)
)

# Make predictions
predictions <- predict(ice_cream_tree, new_data, type = "class")

# View predictions
print(predictions)

This code predicts whether someone will buy ice cream on a 70°F weekend and a 95°F weekday.

Evaluating the Model

To see how well our model performs, we can use a confusion matrix:

# Make predictions on our original data
predictions <- predict(ice_cream_tree, ice_cream_data, type = "class")

# Create confusion matrix
confusion_matrix <- table(Actual = ice_cream_data$buy_icecream, Predicted = predictions)

# View confusion matrix
print(confusion_matrix)

# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", accuracy))

This gives us a quick view of how many predictions were correct and incorrect.

Conclusion

Congratulations! You've just built your first decision tree in R. From installing packages to making predictions, you've covered a lot of ground. Remember, practice makes perfect, so don't be afraid to experiment with different datasets and parameters.

Here's a quick recap of the methods we've used:

Method Description
install.packages() Installs R packages
library() Loads installed packages
data.frame() Creates a data frame
rpart() Builds a decision tree
rpart.plot() Visualizes the decision tree
predict() Makes predictions using the tree
table() Creates a confusion matrix

Keep exploring, keep learning, and most importantly, have fun with data science!

Credits: Image by storyset