Exploratory Data Analysis(Part 2)

Welcome back,

In Part 1 we discovered, you’re a commander with a finite amount of resources (i.e. time and data). Exploratory Analysis is akin to dispatching scouts to determine the optimal location for your soldiers to be deployed.


Doing this will make your project so much easier in 3 ways:

  1. You’ll learn useful data cleaning tips (which can make or break your models).
  2. You’ll come up with Feature Engineering concepts (which can take your models from good to great).
  3. You’ll gain a “feel” for the dataset, which will make it easier for you to discuss results and have a bigger effect.

However this process can become long for some and time is a factor to consider while doing this. It is important not to skip this stage but also not to get stuck on it either.

To begin, you’ll need to respond to a few basic questions regarding the dataset:

  • Do I have any other observations?
  • How many features are there?
  • What data types do my features have? Are they digits? Categorical?
  • Do I have a variable to aim for?

After that, you’ll want to show several examples of observations from the dataset to get a “feel” for the values of each feature and to make sure everything is in order.

The goal of exhibiting examples from the collection isn’t to do a thorough study. Rather, it’s to obtain a “feel” for the dataset on a qualitative level.

  • Do you understand the columns?
  • Do those column values make sense?
  • Is the format of the numbers correct?
  • Based on a fast eyeball test, is missing data going to be a major issue?

At this point, you should start making extensive notes about potential improvements. If something doesn’t seem right, such as a possible outlier in one of your features, now is the opportunity to ask the client/key stakeholder or dig a little more.

Categorical Distributions should be plotted.
Histograms are unable to depict categorical features. Bar plots can be used instead.
You’ll want to keep an eye out for sparse classes, which are classes with a small number of observations. A “class,” by the way, is merely a unique value for a categorical feature.

In Conclusion, If you want to get the most out of data, you must first evaluate it properly. And it is not necessary to treat all data sets in the same way.

As Always keep learning.

Leave a comment