Exploratory Data Analysis (EDA) in Python
What is EDA?
Exploratory Data Analysis (EDA) is the process of:
- Analyzing datasets to summarize their key characteristics.
- Identifying data patterns, outliers, missing values, and relationships.
- Preparing data for machine learning and modeling.
Python offers powerful EDA tools through libraries like:
- pandas
- matplotlib
- seaborn
- missingno
Prerequisites (Install Libraries)
pip install pandas matplotlib seaborn missingno
1. Load the Dataset
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv")
print(df.head())
2. Basic Information and Summary
Check Dataset Shape and Info
print(df.shape) # Rows, Columns
print(df.info()) # Data types & non-null count
print(df.describe()) # Statistical summary
3. Check Missing Values
Using Pandas
print(df.isnull().sum())
Visualize Missing Values with missingno
import missingno as msno
msno.matrix(df)
4. Univariate Analysis (Single Variable)
Value Counts for Categorical Columns
print(df['sex'].value_counts())
Histogram for Numerical Data
import matplotlib.pyplot as plt
df['age'].hist(bins=20)
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()
Countplot for Categorical Data (Seaborn)
import seaborn as sns
sns.countplot(x='class', data=df)
plt.title("Passenger Class Distribution")
plt.show()
5. Bivariate Analysis (Two Variables)
Correlation Matrix
correlation = df.corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()
Boxplot: Age vs Survived
sns.boxplot(x='survived', y='age', data=df)
plt.title("Age Distribution by Survival")
plt.show()
Bar Plot: Survival by Gender
sns.barplot(x='sex', y='survived', data=df)
plt.title("Survival Rate by Gender")
plt.show()
6. Outlier Detection
sns.boxplot(x=df['fare'])
plt.title("Outlier Detection in Fare")
plt.show()
7. Feature Relationships
sns.pairplot(df[['age', 'fare', 'pclass', 'survived']], hue='survived')
plt.show()
Common EDA Questions to Ask
Question | Purpose |
---|---|
What is the shape of the dataset? | Understand size |
What are the types of variables? | Identify numeric/categorical |
Are there any missing values? | Handle data quality |
Are there outliers? | Clean anomalies |
How are features related? | Choose features for modeling |
What is the target class balance? | Detect imbalance |