Few essential steps are needed to be followed especially if the data set numeric and the problem associated with it is of regression type. One of them is a five-point summary. It comes from descriptive statistics. The five-point summary says how the data look like, how the data is distributed among the quartiles. The near-identical counterpart of the five-point summary is boxplot. We can generate a boxplot for each feature. So we get an overall picture for a feature. It becomes handy for a quick overview of the data set. Python’s matplotlib and seaborn both can generate a box-whiskers plot. R has an inbuilt function for the same.
The figure below is an example of boxplot of a normally distributed data:
What we can get from the boxplot?
It shows the distribution of data(feature/column) with the help of a box and whisker(a perpendicular line at both the end of the box).
- We can say it is another form of distribution graph.
- Boxplot provides a five-point summary which helps users to get critical info in less time.
- The five-point summary includes a minimum value, lower quartile (Q1), median value (Q2), upper quartile (Q3), maximum value.
- Whiskers are the perpendicular line drawn from both the sides of the box.
- It can be used to detect outliers.
- Any data point outside of the whiskers is an outlier.
- Length of the whiskers calculated as 1.5* inter-quartile range(IQR).
- Formulation of IQR=Q3-Q1, this is also the length of the box.
- Data points that are red marked are the outliers, as these data points exceed the length of the whiskers.
- If data point more/less than 3*I.Q.R, then it is an outlier.
Enough of theory, let’s move to code…..
In the following section, we will see how it’s generated in Python:
Boxplot with Outlier