Linear and Logistic Regression
Linear Regression and Logistic Regression are two fundamental algorithms in machine learning, widely used for various tasks. While they are both regression algorithms, they serve different purposes based on the type of output they predict. Here's an in-depth guide to both with examples.
What is Linear Regression?
Linear Regression is a supervised learning algorithm used to predict a continuous dependent variable (target) based on one or more independent variables (features). The model assumes a linear relationship between the dependent variable and the independent variables.
Equation of Linear Regression
The equation for simple linear regression is:
y = β₀ + β₁x
Where:
- y is the predicted value
- x is the independent variable
- β₀ is the y-intercept (bias term)
- β₁ is the slope of the line (coefficient)
For multiple linear regression with multiple features, the equation becomes:
y = β₀ + β₁x₁ + β₂x₂ + ⋯ + βₙxₙ
Key Assumptions in Linear Regression:
- Linearity: The relationship between the independent and dependent variable is linear
- Independence: Observations are independent of each other
- Homoscedasticity: Constant variance of errors (residuals)
- Normality of Errors: Errors follow a normal distribution
Example: Linear Regression in Python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Example DataFrame
data = {'Feature': [1, 2, 3, 4, 5], 'Target': [1, 2, 2.8, 4.2, 5.1]}
df = pd.DataFrame(data)
# Split the data into features and target
X = df[['Feature']]
y = df['Target']
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the Linear Regression model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
The model will predict a continuous target variable (e.g., house price, stock price).
What is Logistic Regression?
Logistic Regression is a statistical method for predicting a binary outcome (dependent variable) based on one or more independent variables. It is used for classification tasks rather than regression tasks, where the target variable is categorical (usually binary).
Logistic Regression applies the logistic function (also known as the sigmoid function) to the output of a linear regression model, transforming it into a probability between 0 and 1.
Sigmoid Function in Logistic Regression
The sigmoid function used in logistic regression is:
sigmoid(z) = 1 / (1 + e⁻ᶻ)
Where:
z is the output of the linear equation: z = β₀ + β₁x₁ + ⋯ + βₙxₙ
The result is a value between 0 and 1, representing the probability of belonging to a class.
Example: Logistic Regression in Python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Example DataFrame
data = {'Feature': [1, 2, 3, 4, 5], 'Target': [0, 0, 1, 1, 1]} # Binary classification
df = pd.DataFrame(data)
# Split the data into features and target
X = df[['Feature']]
y = df['Target']
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the Logistic Regression model
model = LogisticRegression()
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
The model will output probabilities of class membership and classify the data into binary categories, like 0 or 1, spam or not spam, etc.
Key Differences Between Linear and Logistic Regression
Feature | Linear Regression | Logistic Regression |
---|---|---|
Purpose | Predicts continuous values | Predicts categorical values (binary outcomes) |
Output | Continuous (e.g., house price, salary) | Probability (transformed to binary class) |
Model Function | Linear equation (y = β₀ + β₁x) | Logistic function (sigmoid of linear equation) |
Loss Function | Mean Squared Error (MSE) | Log-Loss or Binary Cross-Entropy |
Use Cases | Regression tasks (e.g., predicting prices, growth) | Classification tasks (e.g., spam detection) |
Model Evaluation
Linear Regression Evaluation Metrics:
- Mean Squared Error (MSE): Measures the average squared difference between actual and predicted values
- R-squared (R²): Indicates the proportion of the variance in the dependent variable that is predictable from the independent variables
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print("R-squared:", r2)
Logistic Regression Evaluation Metrics:
- Accuracy: Percentage of correct predictions
- Precision: The proportion of positive predictions that are actually positive
- Recall: The proportion of actual positives correctly identified
- F1-Score: The harmonic mean of precision and recall
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
When to Use Linear vs. Logistic Regression?
Use Linear Regression when:
- The output variable is continuous
- You are predicting values such as price, salary, or height
Use Logistic Regression when:
- The output variable is categorical, particularly for binary classification tasks
- You need to predict whether an email is spam, whether a patient has a disease, or classify any other binary outcomes
Conclusion
Linear Regression is perfect for predicting continuous outcomes, while Logistic Regression is suited for binary classification problems. Understanding when to use each algorithm, along with their strengths and limitations, is key to building effective machine learning models.