Scikit-learn Setup and Workflow

What is Scikit-learn?

Scikit-learn is a Python library that provides simple and efficient tools for data mining and machine learning. It's built on top of NumPy, SciPy, and matplotlib and is one of the most popular libraries for building ML models.

It supports supervised and unsupervised learning algorithms and tools for model evaluation and hyperparameter tuning.

Step 1: Install Scikit-learn

To use Scikit-learn in your Python environment, first, you need to install it. Here's how to install Scikit-learn using pip:

Installation Command

pip install scikit-learn

Step 2: Import Libraries and Load Dataset

After setting up Scikit-learn, let's start by importing necessary libraries and loading a dataset.

Example: Load the Iris Dataset

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

# Load the Iris dataset
data = load_iris()

# Convert to DataFrame for easier exploration
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
print(df.head())

Step 3: Preprocessing Data

Before feeding the data into a machine learning model, it's important to preprocess it. This includes steps like:

Splitting the dataset into training and testing subsets.
Normalizing or scaling the data.
Handling missing values.

Train/Test Split

from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (80/20 split)
X = df.drop('target', axis=1)  # Features
y = df['target']  # Labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Choosing a Machine Learning Model

Scikit-learn offers many models for both classification and regression tasks. Here, we'll use Logistic Regression (for classification) and Linear Regression (for regression tasks).

Example: Logistic Regression for Classification

from sklearn.linear_model import LogisticRegression

# Initialize the model
model = LogisticRegression(max_iter=200)

# Train the model on the training data
model.fit(X_train, y_train)

Example: Linear Regression for Regression

from sklearn.linear_model import LinearRegression

# Initialize the model
regressor = LinearRegression()

# Train the model
regressor.fit(X_train, y_train)

Step 5: Model Evaluation

After training a model, it's crucial to evaluate its performance using various metrics.

Accuracy (Classification)

from sklearn.metrics import accuracy_score

# Predict using the trained model
y_pred = model.predict(X_test)

# Evaluate the model accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

Mean Squared Error (Regression)

from sklearn.metrics import mean_squared_error

# Predict using the trained model
y_pred = regressor.predict(X_test)

# Evaluate the model performance
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

Step 6: Hyperparameter Tuning

Hyperparameter tuning is the process of adjusting the model parameters to improve performance. This can be done using GridSearchCV or RandomizedSearchCV.

Example: Hyperparameter Tuning with GridSearchCV

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {'C': [0.1, 1, 10], 'solver': ['liblinear', 'lbfgs']}

# Initialize GridSearchCV
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)

# Fit the model
grid_search.fit(X_train, y_train)

# Best Parameters
print("Best Parameters:", grid_search.best_params_)

Step 7: Model Deployment

Once you have a well-tuned model, you can deploy it in a real-world environment for prediction.

Saving and Loading the Model

import joblib

# Save the model
joblib.dump(model, 'logistic_model.pkl')

# Load the model
loaded_model = joblib.load('logistic_model.pkl')

# Make predictions with the loaded model
predictions = loaded_model.predict(X_test)

Scikit-learn Workflow Summary

Step	Task
1. Install Scikit-learn	Install using pip install scikit-learn
2. Import Libraries	Import necessary libraries like pandas, numpy, etc.
3. Data Preprocessing	Split data into training and test sets, handle missing values, scale data
4. Train a Model	Choose and train a model (e.g., Logistic Regression)
5. Evaluate Model	Use evaluation metrics (e.g., accuracy, mean squared error)
6. Hyperparameter Tuning	Optimize the model using GridSearchCV
7. Model Deployment	Save and load the model for predictions in production