Exploratory Data Analysis (EDA) in Python


What is EDA?

Exploratory Data Analysis (EDA) is the process of:

  • Analyzing datasets to summarize their key characteristics.
  • Identifying data patterns, outliers, missing values, and relationships.
  • Preparing data for machine learning and modeling.

Python offers powerful EDA tools through libraries like:

  • pandas
  • matplotlib
  • seaborn
  • missingno

Prerequisites (Install Libraries)

pip install pandas matplotlib seaborn missingno


1. Load the Dataset

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv")
print(df.head())


2. Basic Information and Summary

Check Dataset Shape and Info

print(df.shape)        # Rows, Columns
print(df.info())       # Data types & non-null count
print(df.describe())   # Statistical summary


3. Check Missing Values

Using Pandas

print(df.isnull().sum())

Visualize Missing Values with missingno

import missingno as msno
msno.matrix(df)


4. Univariate Analysis (Single Variable)

Value Counts for Categorical Columns

print(df['sex'].value_counts())

Histogram for Numerical Data

import matplotlib.pyplot as plt
df['age'].hist(bins=20)
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

Countplot for Categorical Data (Seaborn)

import seaborn as sns
sns.countplot(x='class', data=df)
plt.title("Passenger Class Distribution")
plt.show()


5. Bivariate Analysis (Two Variables)

Correlation Matrix

correlation = df.corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

Boxplot: Age vs Survived

sns.boxplot(x='survived', y='age', data=df)
plt.title("Age Distribution by Survival")
plt.show()

Bar Plot: Survival by Gender

sns.barplot(x='sex', y='survived', data=df)
plt.title("Survival Rate by Gender")
plt.show()


6. Outlier Detection

sns.boxplot(x=df['fare'])
plt.title("Outlier Detection in Fare")
plt.show()


7. Feature Relationships

sns.pairplot(df[['age', 'fare', 'pclass', 'survived']], hue='survived')
plt.show()


Common EDA Questions to Ask

Question Purpose
What is the shape of the dataset? Understand size
What are the types of variables? Identify numeric/categorical
Are there any missing values? Handle data quality
Are there outliers? Clean anomalies
How are features related? Choose features for modeling
What is the target class balance? Detect imbalance