Linear and Logistic Regression

Linear Regression and Logistic Regression are two fundamental algorithms in machine learning, widely used for various tasks. While they are both regression algorithms, they serve different purposes based on the type of output they predict. Here's an in-depth guide to both with examples.

What is Linear Regression?

Linear Regression is a supervised learning algorithm used to predict a continuous dependent variable (target) based on one or more independent variables (features). The model assumes a linear relationship between the dependent variable and the independent variables.

Equation of Linear Regression

The equation for simple linear regression is:

y = β₀ + β₁x

Where:

y is the predicted value
x is the independent variable
β₀ is the y-intercept (bias term)
β₁ is the slope of the line (coefficient)

For multiple linear regression with multiple features, the equation becomes:

y = β₀ + β₁x₁ + β₂x₂ + ⋯ + βₙxₙ

Key Assumptions in Linear Regression:

Linearity: The relationship between the independent and dependent variable is linear
Independence: Observations are independent of each other
Homoscedasticity: Constant variance of errors (residuals)
Normality of Errors: Errors follow a normal distribution

Example: Linear Regression in Python

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Example DataFrame
data = {'Feature': [1, 2, 3, 4, 5], 'Target': [1, 2, 2.8, 4.2, 5.1]}
df = pd.DataFrame(data)

# Split the data into features and target
X = df[['Feature']]
y = df['Target']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

The model will predict a continuous target variable (e.g., house price, stock price).

What is Logistic Regression?

Logistic Regression is a statistical method for predicting a binary outcome (dependent variable) based on one or more independent variables. It is used for classification tasks rather than regression tasks, where the target variable is categorical (usually binary).

Logistic Regression applies the logistic function (also known as the sigmoid function) to the output of a linear regression model, transforming it into a probability between 0 and 1.

Sigmoid Function in Logistic Regression

The sigmoid function used in logistic regression is:

sigmoid(z) = 1 / (1 + e⁻ᶻ)

Where:

z is the output of the linear equation: z = β₀ + β₁x₁ + ⋯ + βₙxₙ

The result is a value between 0 and 1, representing the probability of belonging to a class.

Example: Logistic Regression in Python

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Example DataFrame
data = {'Feature': [1, 2, 3, 4, 5], 'Target': [0, 0, 1, 1, 1]}  # Binary classification
df = pd.DataFrame(data)

# Split the data into features and target
X = df[['Feature']]
y = df['Target']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Logistic Regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

The model will output probabilities of class membership and classify the data into binary categories, like 0 or 1, spam or not spam, etc.

Key Differences Between Linear and Logistic Regression

Feature	Linear Regression	Logistic Regression
Purpose	Predicts continuous values	Predicts categorical values (binary outcomes)
Output	Continuous (e.g., house price, salary)	Probability (transformed to binary class)
Model Function	Linear equation (y = β₀ + β₁x)	Logistic function (sigmoid of linear equation)
Loss Function	Mean Squared Error (MSE)	Log-Loss or Binary Cross-Entropy
Use Cases	Regression tasks (e.g., predicting prices, growth)	Classification tasks (e.g., spam detection)

Model Evaluation

Linear Regression Evaluation Metrics:

Mean Squared Error (MSE): Measures the average squared difference between actual and predicted values
R-squared (R²): Indicates the proportion of the variance in the dependent variable that is predictable from the independent variables

from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print("R-squared:", r2)

Logistic Regression Evaluation Metrics:

Accuracy: Percentage of correct predictions
Precision: The proportion of positive predictions that are actually positive
Recall: The proportion of actual positives correctly identified
F1-Score: The harmonic mean of precision and recall

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

When to Use Linear vs. Logistic Regression?

Use Linear Regression when:

The output variable is continuous
You are predicting values such as price, salary, or height

Use Logistic Regression when:

The output variable is categorical, particularly for binary classification tasks
You need to predict whether an email is spam, whether a patient has a disease, or classify any other binary outcomes

Conclusion

Linear Regression is perfect for predicting continuous outcomes, while Logistic Regression is suited for binary classification problems. Understanding when to use each algorithm, along with their strengths and limitations, is key to building effective machine learning models.