For many data visualizations in Python, Seaborn provides the best combination of a high-level API and nice looking plots. As of version
0.11.0, they have a great function for plotting histograms called
histplot(). Let’s take a look.
Once you have your data in a DataFrame, plotting a basic, high quality histogram is a simple one-liner:
import pandas as pd import seaborn as sns df = pd.read_csv("https://jbencook.s3.amazonaws.com/data/dummy-sales-large.csv") # Plot the histogram sns.histplot(df, x="revenue")
We just pass the DataFrame as the first positional argument and a column name for
One of the most important parameters for determining how well a histogram describes your data is the number of bins. Too many bins and you will see too much noise that looks like peaks, too few bins and you won’t be able to see the shape of your distribution. You can change this number with the
As a rule of thumb, I usually try
sns.histplot(df, x="revenue", bins=50)
Then, I back off a bit if it looks too noisy. Let’s try
Much better! Another thing you might want to do with your histogram is convert the y-axis to the empirical probability so that it better represents the underlying distribution. You can make this change by passing in
ax = sns.histplot(df, x="revenue", bins=30, stat="probability") ax.set(ylabel="$p(x)$")
Notice we can also change the y-label. We do this by calling the
set() method on the Matplotlib
AxesSubplot object that Seaborn returns.
Finally, you sometimes want to see multiple distributions in the same plot. If your Pandas DataFrame is in long format, you can do this by passing in a categorical column to the
ax = sns.histplot(df, x="revenue", bins=30, stat="probability", hue="region") ax.set(ylabel="$p(x)$")
You can see how I generated this dataset and the rest of the code for this post in my Jupyter notebook!