Photo by Samantha Hurley from Burst

Pandas EDA Libraries you need in 2020 (Part 1)

Posted by

Life is short, let Python automate your EDA

EDA (Exploratory Data Analysis) is one of the first steps performed on a given dataset. It helps us to understand more about our data and gives us an idea of manipulations and cleaning we might have to do. EDA can take anywhere from a few lines to a few hundred lines. In this tutorial, we will look at libraries which help us perform EDA in a few lines

Dataset

We will use the Titanic Dataset provide by Kaggle. Using Panda’s describe() method, we get the below output

Screenshot by Author

As you can see the Age Column has missing values. The below libraries are basically describe() on steroids.

1. Pandas-Profiling

Screencast of EDA Report Generated by Pandas Profiling

Install and Usage

First, we will instal the library

pip install pandas-profiling

Next, we will import the library and generate the report

import pandas_profiling
prof_report = pandas_profiling.ProfileReport(df , title = 'Titanic Report')

To display it inside the notebook

prof_report.to_widgets()

To generate it as an HTML file

prof_report.to_html()

Key Features in the Report

Panda Profile Report screenshot by Author

A brief overview of your data consisting of the number of missing rows, duplicate rows and your number of categories, numerical values etc

Panda Profile Report screenshot by Author

Warnings based on the distribution of data, number of missing values, zero values etc

Panda Profile Report screenshot by Author
Panda Profile Report screenshot by Author

Data Distributions and Distinct, Missing values for each column

Panda Profile Report screenshot by Author

Interactions and Correlation between the various features

Panda Profile Report screenshot by Author
Panda Profile Report screenshot by Author

A count of the missing values for each Feature

2. SweetViz

Screencast of EDA Report Generated by Sweetviz

Install and Usage

First, we will instal the library

pip install sweetviz

Next, we will import the library and generate the report

import sweetviz
import pandas as pd
df = pd.read_csv('train.csv')
report = sweetviz.analyze(df)
report.show_html()

You can also pass a file name to show_html()

report.show_html("Titanic.html")

By default, it’s named ‘SWEETVIZ_REPORT.html’

Key Features

Sweetviz Report screenshot by Author

An overview of the data frame is provided. It displays the number of duplicate rows and the number of types of features.

Sweetviz Report screenshot by Author

The association between the different features. It provides a really intuitive heatmap. As you can see, the box relating Fare and P-class is very prominent which makes sense since a first-class passenger would pay more than a third-class passenger.

Sweetviz Report screenshot by Author

For each categorical feature, the following relevant information is shown

  • Data Distribution
  • Features which give information on it
  • Features it can give information about
  • Its correlation with other features
Sweetviz Report screenshot by Author

For numerical features, it shows the numerical and categorical associations and distributions

Sweetviz Report screenshot by Author

It also highlights the missing values based on the percentage of missing values.

3. Autoviz

Screencast of EDA Report Generated by Autotviz

Install and Usage

First, we will instal the library

pip install autoviz

Next, we will import the library and generate the report

from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
df = AV.AutoViz('train.csv')

Key Features

Autoviz Report screenshot by Author

It provides a scatter plot between continuous variables

Autoviz Report screenshot by Author

Distribution of the data for the various features

Autoviz Report screenshot by Author

A heatmap and bar plot to show the relationship between continuous features.

Other EDA Libraries

In a future article, I will discuss some of the below-mentioned libraries but in the meantime I recommend you to check out the resources listed.

Pandas GUI

Dataprep

D-tale

Dora

Bamboolib