R - Logistic Regression

Create Regression Model

Welcome to the world of logistic regression in R! In this tutorial, we'll walk you through the basics of creating a logistic regression model using R. We'll start with some basic concepts and then dive into the code. Remember, this is designed for beginners who have no prior programming experience, so don't worry if you feel a bit lost at first. Let's get started!

What is Logistic Regression?

Logistic regression is a statistical method used to analyze data and make predictions based on one or more predictor variables. It's often used for binary classification problems, where the outcome can be either "yes" (1) or "no" (0). The key difference between logistic regression and linear regression is that logistic regression predicts the probability of an outcome occurring, while linear regression predicts a continuous value.

Why Use Logistic Regression?

Logistic regression is widely used in various fields, including healthcare, finance, marketing, and social sciences. It's particularly useful when you want to understand the relationship between a binary outcome and one or more predictor variables. For example, you might use logistic regression to predict whether a customer will buy a product based on their age, income, and past purchase history.

Creating a Logistic Regression Model in R

To create a logistic regression model in R, we'll use the glm() function, which stands for Generalized Linear Models. Here's a step-by-step guide:

Step 1: Install and Load the Necessary Libraries

First, you need to install and load the necessary libraries. We'll use the tidyverse library for data manipulation and visualization, and the caret library for modeling.

install.packages("tidyverse")
install.packages("caret")

library(tidyverse)
library(caret)

Step 2: Load the Data

Next, let's load a dataset. For this example, we'll use the built-in mtcars dataset, which contains information about various car models. We'll focus on predicting whether a car is a sports car based on its weight and horsepower.

data(mtcars)
head(mtcars)

Step 3: Preprocess the Data

Before building the logistic regression model, we need to preprocess the data. This includes encoding categorical variables, handling missing values, and scaling features. In our case, we only have numerical variables, so we don't need to do any encoding or scaling. However, we'll create a new binary variable called is_sports_car to indicate whether a car is a sports car or not.

mtcars <- mtcars %>%
  mutate(is_sports_car = ifelse(hp > 150, 1, 0))

Step 4: Split the Data into Training and Test Sets

It's important to split the data into training and test sets to evaluate the performance of our model. We'll use the createDataPartition() function from the caret package to create a partition.

set.seed(123)
trainIndex <- createDataPartition(mtcars$is_sports_car, p = 0.8, list = FALSE)
trainSet <- mtcars[trainIndex, ]
testSet <- mtcars[-trainIndex, ]

Step 5: Build the Logistic Regression Model

Now we're ready to build our logistic regression model. We'll use the glm() function with the family argument set to binomial to specify that we want to perform logistic regression.

model <- glm(is_sports_car ~ wt + hp, data = trainSet, family = binomial)
summary(model)

The summary() function provides an overview of the model, including coefficients, standard errors, z-values, and p-values. These statistics help us understand the importance of each predictor variable and whether they are statistically significant.

Step 6: Make Predictions and Evaluate the Model

Once we have our model, we can use it to make predictions on the test set and evaluate its performance. We'll use the predict() function to generate predicted probabilities and then convert them to binary outcomes using a threshold of 0.5.

predictions <- predict(model, newdata = testSet, type = "response")
predicted_classes <- ifelse(predictions > 0.5, 1, 0)

Now, let's calculate the accuracy of our model by comparing the predicted classes to the actual classes in the test set.

accuracy <- mean(predicted_classes == testSet$is_sports_car) * 100
cat("Accuracy:", accuracy, "%")

And there you have it! You've successfully created a logistic regression model in R using the glm() function. Remember, this is just a basic example, and there are many other factors to consider when building and evaluating a logistic regression model, such as feature selection, regularization, and model tuning. But this should give you a good starting point for your journey into the world of logistic regression in R.

Credits: Image by storyset

Previous Tutorial:

R - Multiple Regression

Next Tutorial:

R - Normal Distribution