A Comprehensive Guide to Exploratory Data Analysis: Univariate and Multivariate Analysis
Introduction:
Exploratory Data Analysis (EDA) is a crucial phase in any data science project, paving the way for insights and informed decision-making. In this comprehensive guide, we will embark on a journey through univariate and multivariate analysis using real-world datasets. By employing a variety of visualization techniques, we aim to unravel patterns, relationships, and hidden nuances within the data.
Part 1: Univariate Analysis
Section 1.1: Understanding the Data
In our first exploration, we’ll analyze the ‘USA Housing’ dataset, focusing on key aspects such as data size, data types, missing values, and statistical summaries.
- Data Overview:
- The dataset comprises 5000 rows and 6 columns, providing information about various housing parameters in the USA.
print(“Data Size: “, df.shape)
Data Sample:
- To gain a sense of the data, we inspect a random sample of 5 rows.
print(“Random Sample:”)
print(df.sample(5))
Data Types:
- All columns are of float64 data type, indicating numerical values.
print(“Data Types:”)
print(df.info())
Missing Values:
- The dataset is clean, with no missing values in any column.
print(“Missing Values:”)
print(df.isnull().sum())
Descriptive Statistics:
- Descriptive statistics provide a summary of numerical columns, including mean, standard deviation, and quartiles.
print(“Descriptive Statistics:”)
print(df.describe())
Duplicate Values:
- No duplicate values were found in the dataset.
print(“Duplicate Values:”)
print(df.duplicated().sum())
Correlation Between Columns:
- The correlation matrix unveils relationships between numerical variables.
print(“Correlation Between Columns:”)
print(df.corr())
Section 1.2: Univariate Analysis Techniques
Next, we move on to exploring individual variables using univariate analysis techniques.
- Categorical Data Analysis:
- Focusing on the ‘Titanic’ dataset, we employ countplots and pie charts to visualize categorical data, specifically the passenger class (‘Pclass’).
import seaborn as sns
# Countplot
sns.set(style=”darkgrid”)
sns.countplot(x=’Pclass’, data=db)
# Pie Chart
db[“Pclass”].value_counts().plot(kind=”pie”, autopct=’%1.1f%%’)
Numerical Data Analysis:
- Utilizing the ‘Titanic’ dataset again, we employ histogram, distplot, and boxplot to analyze numerical data, specifically passenger age (‘Age’).
import matplotlib.pyplot as plt
# Histogram
plt.hist(db[“Age”], bins=20, color=’skyblue’, edgecolor=’black’)
# Distplot
sns.distplot(db[“Age”], hist=True, kde=True, bins=20, color=’skyblue’, hist_kws={‘edgecolor’: ‘black’})
# Boxplot
sns.boxplot(data=db, x=”Pclass”, y=”Age”, hue=”Survived”)
Part 2: Multivariate Analysis
Section 2.1: Numerical to Numerical and Numerical-Categorical Analysis
In this section, we explore relationships between numerical and categorical variables using scatterplots, barplots, and boxplots.
- Scatterplot (Numerical to Numerical):
- Using the ‘USA Housing’ dataset, we create a scatterplot to visualize the relationship between ‘Avg. Area Income’ and ‘Avg. Area House Age’, with ‘Price’ represented by color.
sns.scatterplot(data=df, x=”Avg. Area Income”, y=”Avg. Area House Age”, hue=”Price”)
Barplot and Boxplot (Numerical-Categorical):
- In the ‘Titanic’ dataset, we use barplots and boxplots to compare the average age in different passenger classes (‘Pclass’) with survival status differentiated by color.
# Barplot
sns.barplot(data=db, x=”Pclass”, y=”Age”, hue=”Survived”)
# Boxplot
sns.boxplot(data=db, x=”Pclass”, y=”Age”, hue=”Survived”)
Section 2.2: Categorical-Categorical Analysis
Now, we explore relationships between two categorical variables using heatmap, clustermap, and pairplot.
- Heatmap:
- Using the ‘Titanic’ dataset, a heatmap visualizes the count of passengers in each combination of passenger class and survival status.
sns.heatmap(pd.crosstab(db[“Pclass”], db[“Survived”]), annot=True, fmt=”d”)
ClusterMap:
- A clustermap organizes and visualizes relationships between categories, revealing potential patterns in passenger survival across different classes.
sns.clustermap(pd.crosstab(db[“Pclass”], db[“Survived”]), annot=True, fmt=”d”)
Pairplot:
- For datasets with multiple numerical variables, a pairplot provides a grid of scatterplots for pairwise comparisons.
sns.pairplot(df)
Conclusion:
In this comprehensive guide, we’ve navigated through the realms of univariate and multivariate analysis. From understanding the dataset’s characteristics to visualizing relationships between variables, each step contributes to a holistic exploration of the data. Univariate techniques allow us to dissect individual variables, while multivariate analysis reveals intricate patterns and dependencies between multiple variables. This journey provides the foundation for further statistical modeling and machine learning applications, empowering data scientists to extract meaningful insights from complex datasets.