Recipes for the Visualizations of Data Distributions (2024)

Visualization

Histograms, KDE plots, box(en) plots and violin plots and more…

Recipes for the Visualizations of Data Distributions (3)

As a budding data scientist, I realized that the first piece of code is always written to understand the distribution of one or several variables in the data set during project initiation. Visualizing the distribution of a variable is important to immediately grasp valuable parameters such as frequencies, peaks, skewness, center, modality, and how variables and outliers behave in the data range.

With the excitement of sharing knowledge, I created this blog post about summarized explanations of single-variable (univariate) distributions to share my deductions from several articles and documentations. I will provide steps to draw the distribution functions without going deep in theories and keep my post simple.

I will start by explaining the functions to visualize data distributions with Python using Matplotlib and Seaborn libraries. Code behind the visualizations can be found in this notebook.

For illustrations, I used Gapminder life expectancy data, the cleaned version can be found in this GitHub repository.

The data set shows 142 countries’ life expectancy at birth, population and GDP per capita between the years 1952 and 2007. I will plot the life expectancy at birth using:

  1. Histogram
  2. Kernel Density Estimation and Distribution Plot
  3. Box Plot
  4. Boxen Plot
  5. Violin Plot

Histograms are the simplest way to show how data is spread. Here is the recipe for making a histogram:

  • Create buckets (bins) by dividing your data range into equal sizes, the number of subsets in your data is the number of bins you have.
  • Record the count of the data points that fall into each bin.
  • Visualize each bucket side by side on the x-axis.
  • Count values will be shown on the y-axis, showing how many items are there in each bin.

And you have a brand-new histogram!

It is the easiest and most intuitive way. However, one drawback is to decide on the number of bins necessary.

In this graph, I determined 25 bins, which seems to be optimal after playing around with the bins parameter in the Matplotlib hist function.

# set the histogram
plt.hist(df.life_expectancy,
range=(df.life_expectancy.min(),
df.life_expectancy.max()+1),
bins=25,
alpha=0.5)
# set title and labels
plt.xlabel(“Life Expectancy”)
plt.ylabel(“Count”)
plt.title(“Histogram of Life Expectancy between 1952 and 2007 in the World”)
plt.show()
Recipes for the Visualizations of Data Distributions (4)

Different number of bins can significantly change how your data distribution looks. Here is the same data distribution with 5 bins, it looks like a totally different data set, right?

Recipes for the Visualizations of Data Distributions (5)

If you don’t want to be bothered by the number of bins determination, then let’s jump to the kernel density estimation functions and distribution plots.

Kernel Density Estimation (KDE) plots save you from the hassle of deciding on the bin size by smoothing the histogram. Follow the below logic to create a KDE plot:

  • Plot a Gaussian (normal) curve around each data point.
  • Sum the curves to create a density at each point.
  • Normalize the final curve, so that the area under it equals to 1, resulting in a probability density function. Here is a visual example of those 3 steps:
Recipes for the Visualizations of Data Distributions (6)
  • You will find the range of the data on the x-axis and probability density function of the random variable on the y-axis. Probability density function is defined in this article by Will Koehrsen as follows:

You may think of the y-axis on a density plot as a value only for relative comparisons between different categories.

Luckily, you don’t have to remember and apply all these steps manually. Seaborn’s KDE plot function completes all these steps for you, just pass the column of your data frame or Numpy array to it!

# set KDE plot, title and labels
ax = sns.kdeplot(df.life_expectancy, shade=True, color=”b”)
plt.title(“KDE Plot of Life Expectancy between 1952 and 2007 in the World”)
plt.ylabel(“Density”)
Recipes for the Visualizations of Data Distributions (7)

If you want to combine histograms and KDE plot, Seaborn has another cool way to show both histograms and KDE plots in one graph: Distribution plot which draws KDE Plot with the flexibility of turning on and off the histograms by changing the hist parameter in the function.

# set distribution plot, title and labels
ax = sns.distplot(df.life_expectancy, hist=True, color=”b”)
plt.title(“Distribution Plot of Life Expectancy between 1952 and 2007 in the World”)
plt.ylabel(“Density”)
Recipes for the Visualizations of Data Distributions (8)

KDE plots are also capable of showing distributions among different categories:

# create list of continents 
continents = df[‘continent’].value_counts().index.tolist()
# set kde plot for each continent
for c in continents:
subset = df[df[‘continent’] == c]
sns.kdeplot(subset[“life_expectancy”], label=c, linewidth=2)
# set title, x and y labels
plt.title(“KDE Plot of Life Expectancy Among Continents Between 1952 and 2007”)
plt.ylabel(“Density”)
plt.xlabel(“Life Expectancy”)
Recipes for the Visualizations of Data Distributions (9)

Although KDE plots or distribution plots have more computations and mathematics behind compared to histograms, it is easier to understand modality, symmetry, skewness and center of the distribution by looking at a continuous line. One disadvantage may be, lacking information about summary statistics.

If you wish to provide summary statistics of your distribution visually, then let’s move to the box plots.

Box plots show data distributions with the five-number summary statistics (minimum, first quartile Q1, median the second quartile, third quartile Q3, maximum). Here are the steps to draw them:

  • Sort your data to determine the minimum, quartiles (first, second and third) and maximum.
  • Draw a box between the first and third quartile, then draw a vertical line in the box corresponding to the median.
  • Draw a horizontal line outside of the box halving the box into two and put the minimum and maximum at the edge. These lines will be your whiskers.
  • The end of the whiskers are equal to the minimum and maximum of the data and, if you see any, the little diamonds set aside is interpreted as “outliers”.

Steps are straightforward to create a box plot manually, but I prefer to get some support from Seaborn box plot function.

# set the box plot and title 
sns.boxplot(x=”life_expectancy”, data=df, palette=”Set3") plt.title(“Boxplot of Life Expectancy between 1952 and 2007 in the World”)
Recipes for the Visualizations of Data Distributions (10)

There are several different ways to calculate the length of whiskers, Seaborn box plot function determines whiskers by extending the 1.5 times the interquartile range (IQR) from the first and third quartiles by default. Thus, any data point bigger than Q3+(1.5*IQR) or smaller than Q1-(1.5*IQR) will be visualized as outliers. You can change the calculation of whiskers by adjusting the whis parameter.

Like KDE plots, box plots are also suitable for visualizing the distributions among categories:

# set the box plot with the ordered continents and title sns.boxplot(x=”continent”, y=”life_expectancy”, data=df,
palette=”Set3",
order=[“Africa”, “Asia”, “Americas”, “Europe”,
“Oceania”])
plt.title(“Boxplot of Life Expectancy Among Continents Between 1952 and 2007”)
Recipes for the Visualizations of Data Distributions (11)

Box plots provide the story of the statistics, where half of the data lies, and the whole range of data by looking at the box shape and whiskers. On the other hand, you don’t have the visibility of the story of the data outside the box. That is the reason why some scientists published a paper about boxen plots, known as extended box plots.

Boxen plots, or letter value plots or extended box plots, might be the least used method for data distribution visualizations, yet they convey more information on large data sets.

To create a boxen plot, let’s first understand what a letter value summary is. Letter value summary is about continually determining the middle value of a sorted data.

First, determine the middle value for all the data, and create two slices. Then, determine the median of those two slices and iterate on this process when the stopping criteria is reached or no more data is left to be separated.

First middle value determined is the median. Middle values determined in the second iteration are called fourths, and middle values determined in the third iteration are called eights.

Now let’s draw a box plot and visualize letter value summaries outside the box plot instead of whiskers. In other words, plot a box plot with extended box edges corresponding to the middle value of the slices (eights, sixteenths and so on..)

# set boxen plot and title 
sns.boxenplot(x=”life_expectancy”, data=df,palette=”Set3") plt.title(“Boxenplot of Life Expectancy Among Continents Between 1952 and 2007”)
Recipes for the Visualizations of Data Distributions (12)

They are also effective in telling the data story for different categories:

# set boxen plot with ordered continents and title sns.boxenplot(x=”continent”, y=”life_expectancy”, data=df,
palette=”Set3",
order=[“Africa”, “Asia”, “Americas”, “Europe”,
“Oceania”])
plt.title(“Boxenplot of Life Expectancy Among Continents Between 1952 and 2007”)
Recipes for the Visualizations of Data Distributions (13)

Boxen plots emerged to visualize the larger data sets more effectively by showing how data is spread outside of the main box and putting more emphasis on the outliers because the importance of outliers and data outside the IQR is more significant in larger data sets.

There are two perspectives that give clues about data distribution, the shape of the data distribution and the summary statistics. To explain a distribution from both perspectives at the same time, let’s learn to cook some Violin plots.

Violin plots are the perfect combination of the box plots and KDE plots. They deliver the summary statistics with the box plot inside and shape of distribution with the KDE plot on the sides.

It is my favorite plot because data is expressed with all the details it has. Do you remember the life expectancy distribution shape and summary statistics we plotted earlier? Seaborn violin plot function will blend it for us now.

Et voilà !

# set violin plot and title 
sns.violinplot(x=”life_expectancy”, data=df, palette=”Set3") plt.title(“Violinplot of Life Expectancy between 1952 and 2007 in the World”)
Recipes for the Visualizations of Data Distributions (14)

You can observe the peak of the data around 70 by looking at the distribution on the sides, and half of the data points gathered between 50 and 70 by noticing the slim box inside.

These beautiful violins can be used to visualize data with categories, and you can express summary statistics with dots, dashed lines or lines if you wish, by changing the inner parameter.

Recipes for the Visualizations of Data Distributions (15)

The advantage is obvious: Visualize the shape of the distribution and summary statistics simultaneously!

Bonus points with Violin plots: By setting scale parameter to count, you can also show how many data points you have in each category, thus emphasizing the importance of each category. When I change scale, Africa and Asia expanded and Oceania shrank, concluding there are fewer data points in Oceania and more in Africa and Asia.

# set the violin plot with different scale, inner parameter and title 
sns.violinplot(x=”continent”, y=”life_expectancy”, data=df,
palette=”Set3",
order=[“Africa”, “Asia”, “Americas”, “Europe”,
“Oceania”],
inner=None, scale=”count”)
plt.title(“Violinplot of Life Expectancy Among Continents Between 1952 and 2007”)
Recipes for the Visualizations of Data Distributions (16)

So, these recipes about visualizing distributions explained the core idea behind each plot. There are plenty of options to show single-variable, or univariate, distributions.

Histogram, KDE plot and distribution plot are explaining the data shape very well. Additionally, distribution plots can combine histograms and KDE plots.

Box plot and boxen plot are best to communicate summary statistics, boxen plots work better on the large data sets and violin plot does it all.

They are all effective communicators and each of them can be built quickly with Seaborn library in Python. Your visualization choice depends on your project (data set) and what information you want to transfer to your audience. If you are thrilled by this post and want to learn more, you can check the Seaborn and Matplotlib documentation.

Last but not least, this is my first contribution for Towards Data Science, I hope you enjoyed reading! I appreciate your constructive feedback and would like to hear your opinions about this blog post in the responses or on Twitter.

Recipes for the Visualizations of Data Distributions (2024)

FAQs

How do you Visualise distribution of data? ›

Visualization methods that display frequency, how data spread out over an interval or is grouped.
  1. Box & Whisker Plot.
  2. Bubble Chart.
  3. Density Plot.
  4. Dot Matrix Chart.
  5. Histogram.
  6. Multi-set Bar Chart.
  7. Parallel Sets.
  8. Pictogram Chart.

What are best used to visualize how data is distributed? ›

Scatter charts: distribution and relationships

Scatter charts present categories of data by circle color and the volume of the data by circle size; they're used to visualize the distribution of, and relationship between, two variables.

What is the best way to view distribution of data? ›

Box plots show distribution based on a statistical summary, while column histograms are great for finding the frequency of an occurrence. Scatter plots are best for showing distribution in large data sets.

What is the best chart to show the distribution of data? ›

Scatter Plot Chart

This is useful when looking for outliers or understanding your data's distribution.

Which diagrams will you use to visualize the distribution of data? ›

A histogram is the most commonly used plot type for visualizing distribution. It shows the frequency of values in data by grouping it into equal-sized intervals or classes (so-called bins). In such a way, it gives you an idea about the approximate probability distribution of your quantitative data.

What is the most appropriate option for visualizing distributions? ›

Boxplots, or box-and-whisker plots, provide a skeletal representation of a distribution. They are very well suited for showing distributions for multiple groups. There are many variations of boxplots: Most start with a box from the first to the third quartiles and divided by the median.

How to visualize large amounts of data? ›

When you bin data on both axes of a graph, you make it easier to visualize the big data. Binning an also be used with box plots, which can be especially useful with data so large that even your outliers include millions of data points.

How to display data visually? ›

Pictogram charts, or pictograph charts, are particularly useful for presenting simple data in a more visual and engaging way. These charts use icons to visualize data, with each icon representing a different value or category. For example, data about time might be represented by icons of clocks or watches.

How do you analyze data distribution? ›

Find out the average (mean), the middle value (median), and the most common value (mode). Also, look at how spread out your data is by calculating the range, interquartile range, standard deviation, and variance. These calculations can tell you a lot about the shape and characteristics of your data.

How to chart distribution in Excel? ›

On the Insert menu, click Chart. Under Chart type, click XY (Scatter). Under Chart sub-type, in the middle row, click the chart on the right. Note: Just below these 5 sub-types, the description will say "Scatter with data points connected by smoothed lines without markers."

How do you compare the distribution of data? ›

One of the simplest and most effective ways to compare data distributions is to visualize them using graphs or charts. Visualizing the data can help you identify the shape, center, spread, and variability of each distribution, as well as any gaps, clusters, or outliers.

What are the 3 C's of data visualization? ›

Clarity, consistency, and context.

I think if you can provide these 3 things to your dashboard, you're 95% on your way to a great story with data.

What are the 5 C's of data visualization? ›

However, there are five characteristics of data that will apply across all of your data: clean, consistent, conformed, current, and comprehensive. The five Cs of data apply to all forms of data, big or small.

What are the 3 rules of data visualization? ›

Conclusion. To recap, here are the three most effective data visualization techniques you can use to deliver presentations that people understand and remember: compare to a real object, include a visual, and give context to your numbers. Try using one or more of these techniques in your next presentation.

How do you visualize data distribution in Excel? ›

To make a normal distribution graph, go to the “Insert” tab, and in “Charts,” select a “Scatter” chart with smoothed lines and markers. When we insert the chart, we see that our bell curve or normal distribution graph is created.

How do you visualize a continuous distribution of data? ›

Histograms are usually used to visualize the distribution of a continuous variable. The range of values of a continuous variables are divided into discrete bins and the number of data points (or values) in each bin is visualized with bars.

What type of graph is used to show the distribution of data? ›

Histograms are used to show distributions of variables while bar charts are used to compare variables. Histograms plot quantitative data with ranges of the data grouped into bins or intervals while bar charts plot categorical data.

How do you display the distribution of a variable? ›

Categorical variables should be displayed using pie charts or bar graphs. Quantitative variables are usually displayed using histograms or stemplots. Variables that change over time should be displayed using time plots. The distribution of a variable shows what values it takes and how often it takes these values.

References

Top Articles
Latest Posts
Article information

Author: Saturnina Altenwerth DVM

Last Updated:

Views: 5933

Rating: 4.3 / 5 (64 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Saturnina Altenwerth DVM

Birthday: 1992-08-21

Address: Apt. 237 662 Haag Mills, East Verenaport, MO 57071-5493

Phone: +331850833384

Job: District Real-Estate Architect

Hobby: Skateboarding, Taxidermy, Air sports, Painting, Knife making, Letterboxing, Inline skating

Introduction: My name is Saturnina Altenwerth DVM, I am a witty, perfect, combative, beautiful, determined, fancy, determined person who loves writing and wants to share my knowledge and understanding with you.