Preface:

Few essential steps are needed to be followed especially if the data set numeric and the problem associated with it is of regression type. One of them is a five-point summary. It comes from descriptive statistics. The five-point summary says how the data look like, how the data is distributed among the quartiles. The near-identical counterpart of the five-point summary is boxplot. We can generate a boxplot for each feature. So we get an overall picture for a feature. It becomes handy for a quick overview of the data set. Python’s matplotlib and seaborn both can generate a box-whiskers plot. R has an inbuilt function for the same.

The figure below is an example of boxplot of a normally distributed data:

What we can get from the boxplot?

It shows the distribution of data(feature/column) with the help of a box and whisker(a perpendicular line at both the end of the box).

  1. We can say it is another form of distribution graph.
  2. Boxplot provides a five-point summary which helps users to get critical info in less time.
  3. The five-point summary includes a minimum value, lower quartile (Q1), median value (Q2), upper quartile (Q3), maximum value.
  4. Whiskers are the perpendicular line drawn from both the sides of the box.
  5. It can be used to detect outliers.
  6. Any data point outside of the whiskers is an outlier.
  7. Length of the whiskers calculated as 1.5* inter-quartile range(IQR).
  8. Formulation of IQR=Q3-Q1, this is also the length of the box.
  9. Data points that are red marked are the outliers, as these data points exceed the length of the whiskers.
  10. If data point more/less than 3*I.Q.R, then it is an outlier.

 

Enough of theory, let’s move to code…..

 

In the following section, we will see how it’s generated in Python:

Data set:

Sample boxplot:

boxplot without outlier

Boxplot with Outlier

boxplot showing outlier