Welcome to the third post in our data science series! If Python is the engine of data science, Exploratory Data Analysis (EDA) is the compass—it guides you through the wilderness of raw data to uncover actionable insights. Today, we’ll demystify EDA using the iconic Titanic dataset, a staple for learning data analysis.
What is EDA?
EDA is the process of analyzing datasets to summarize their main characteristics, often using visual methods. It answers questions like:
- What patterns or anomalies exist in the data?
- How are variables distributed or correlated?
- What hypotheses can we test further?
Think of it as a detective’s first walkthrough of a crime scene: observing, noting clues, and forming initial theories.
Key EDA Techniques
1. Summary Statistics
Quickly grasp data distribution with metrics like:
- Mean, median, standard deviation
- Min/max values, quartiles
import pandas as pd
data = pd.read_csv("titanic.csv")
print(data.describe())
2. Data Cleaning
Handle missing values and outliers:
# Check for missing values
print(data.isnull().sum())
# Drop rows with missing 'Age'
data_clean = data.dropna(subset=['Age'])
3. Visualization
Spot trends with plots:
import seaborn as sns
import matplotlib.pyplot as plt
# Survival rate by gender
sns.barplot(x='Sex', y='Survived', data=data)
plt.title("Survival Rate by Gender")
plt.show()
4. Correlation Analysis
Identify relationships between variables:
correlation = data.corr(numeric_only=True)
sns.heatmap(correlation, annot=True)
plt.title("Correlation Matrix")
plt.show()
