Scikit-learn Setup and Workflow
What is Scikit-learn?
Scikit-learn is a Python library that provides simple and efficient tools for data mining and machine learning. It's built on top of NumPy, SciPy, and matplotlib and is one of the most popular libraries for building ML models.
It supports supervised and unsupervised learning algorithms and tools for model evaluation and hyperparameter tuning.
Step 1: Install Scikit-learn
To use Scikit-learn in your Python environment, first, you need to install it. Here's how to install Scikit-learn using pip:
Installation Command
pip install scikit-learn
Step 2: Import Libraries and Load Dataset
After setting up Scikit-learn, let's start by importing necessary libraries and loading a dataset.
Example: Load the Iris Dataset
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
# Load the Iris dataset
data = load_iris()
# Convert to DataFrame for easier exploration
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
print(df.head())
Step 3: Preprocessing Data
Before feeding the data into a machine learning model, it's important to preprocess it. This includes steps like:
- Splitting the dataset into training and testing subsets.
- Normalizing or scaling the data.
- Handling missing values.
Train/Test Split
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets (80/20 split)
X = df.drop('target', axis=1) # Features
y = df['target'] # Labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Choosing a Machine Learning Model
Scikit-learn offers many models for both classification and regression tasks. Here, we'll use Logistic Regression (for classification) and Linear Regression (for regression tasks).
Example: Logistic Regression for Classification
from sklearn.linear_model import LogisticRegression
# Initialize the model
model = LogisticRegression(max_iter=200)
# Train the model on the training data
model.fit(X_train, y_train)
Example: Linear Regression for Regression
from sklearn.linear_model import LinearRegression
# Initialize the model
regressor = LinearRegression()
# Train the model
regressor.fit(X_train, y_train)
Step 5: Model Evaluation
After training a model, it's crucial to evaluate its performance using various metrics.
Accuracy (Classification)
from sklearn.metrics import accuracy_score
# Predict using the trained model
y_pred = model.predict(X_test)
# Evaluate the model accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))
Mean Squared Error (Regression)
from sklearn.metrics import mean_squared_error
# Predict using the trained model
y_pred = regressor.predict(X_test)
# Evaluate the model performance
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
Step 6: Hyperparameter Tuning
Hyperparameter tuning is the process of adjusting the model parameters to improve performance. This can be done using GridSearchCV or RandomizedSearchCV.
Example: Hyperparameter Tuning with GridSearchCV
from sklearn.model_selection import GridSearchCV
# Define the parameter grid
param_grid = {'C': [0.1, 1, 10], 'solver': ['liblinear', 'lbfgs']}
# Initialize GridSearchCV
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
# Fit the model
grid_search.fit(X_train, y_train)
# Best Parameters
print("Best Parameters:", grid_search.best_params_)
Step 7: Model Deployment
Once you have a well-tuned model, you can deploy it in a real-world environment for prediction.
Saving and Loading the Model
import joblib
# Save the model
joblib.dump(model, 'logistic_model.pkl')
# Load the model
loaded_model = joblib.load('logistic_model.pkl')
# Make predictions with the loaded model
predictions = loaded_model.predict(X_test)
Scikit-learn Workflow Summary
Step | Task |
---|---|
1. Install Scikit-learn | Install using pip install scikit-learn |
2. Import Libraries | Import necessary libraries like pandas, numpy, etc. |
3. Data Preprocessing | Split data into training and test sets, handle missing values, scale data |
4. Train a Model | Choose and train a model (e.g., Logistic Regression) |
5. Evaluate Model | Use evaluation metrics (e.g., accuracy, mean squared error) |
6. Hyperparameter Tuning | Optimize the model using GridSearchCV |
7. Model Deployment | Save and load the model for predictions in production |