Here we will explore a number of python based visualization tools which we will deploy in Google Colab. We will take a look at key libaries which will include:
Plotly
The most prominent and earliest of these is Matplotlib. It was originally conceived as a multi-platform data visualization library built on NumPy arrays. It was introduced by John Hunter in 2002, originally as a patch to IPython for enabling interactive MATLAB-style plotting. An early adapter was the Space Telescope Science Institute (the Hubble Telescope crew), who also endeavoured to financially supported Matplotlib’s development.
One of Matplotlib’s most important features is its ability to function cross-platform which has help spawn an active developer base. This is normally the critical ecosystem necessary for a library to thrive. An active developer community coupled with mass interest is normally important.
ggplot and ggvis in the R language more recently has revealed the attractiveness of utilising a grammar of graphing. In response, new packages like Seaborn (built on Matplotlib) have provided features that are expressive and afford more structured semantics. Plotnine for instance, mimics a good deal of the grammar typically associated with ggplot2. Altair and Pandas also piggy back as wrappers around Matplotlib's API. Plotly provides a range of interactive graphs which are attractive to users and are popular on Kaggle.
We present here some data visualization techniques using Python's Seaborn, Matplotlib and Altair libraries. We will use a the palmerpenguin dataset.
1. Pie Charts for Categorical Data:
We begin by using pie charts to visualize categorical data. Pie charts are excellent for showing the distribution of various categories within a dataset. We're examining three aspects: sex, species, and island of the penguins. The slices of the pie represent the proportions of each category, giving us an immediate sense of the data's composition.
2. Enhancing Aesthetics and Themes:
By importing necessary libraries and adjusting the graphic theme using sns.set_theme(). We customize the figure size and dpi to ensure our visualizations are clear and impactful.
3. Ignoring Warning Messages:
Sometimes, warning messages can clutter our workspace. To focus solely on the visuals, we use the warning package to temporarily suppress these messages.
4. Scatter Plots for Continuous Variables:
These two-dimensional plots are perfect for visualizing relationships between continuous variables. We're exploring the connection between bill length and body mass of penguins. Each dot on the plot represents an individual penguin, and the position along the x and y axes corresponds to its bill length and body mass. We're also adding colors, styles, and sizes to differentiate between species, sexes, and islands.
5. Adding Trend Lines:
Next we add trend lines to visually capture overall trends in the data. We introduce trend lines for all data points combined and also for each unique species. This brings a dynamic aspect to our scatter plots, enhancing our understanding of potential patterns.
6. Histograms for Data Distribution:
These are fantastic for showcasing the distribution of numerical data. We focus on the bill length of penguins. In a histogram, data is divided into intervals or "bins," and the height of each bar represents the frequency of data falling into that bin. This helps us understand how often certain bill lengths occur and whether there's any skewness or multimodality.
7. Histograms Unveiled:
Histograms are a great way to understand the distribution of a single variable. In the first example, we use Seaborn's sns.histplot() function to visualize the distribution of penguins' bill lengths. The plot presents us with a clear view of how bill lengths are distributed, and the title "Bill Length" adds context.
8. Controlling Histogram Width:
We can control the width of the bins in a histogram. In the next example, we use the binwidth parameter to create a histogram of flipper lengths, showcasing how we can adjust the granularity of the distribution.
9. Visualizing Categorical Data:
Here we compare distributions across categories? Enter sns.histplot() with the hue parameter. In this example, we explore body masses of penguins, colored by species. This adds a layer of depth, enabling us to understand how body masses differ across species.
10. The Power of Bar Plots:
Now let's shift to bar plots. These are excellent for comparing categorical data. In the first example, we use sns.countplot() to visualize the count of penguins in each species. The title "Penguins: Species Count" says it all!
11. Diving into Box Plots:
Box plots are fantastic for understanding the distribution of numerical data across categories. In one instance, we depict how flipper lengths vary among penguin species. The box shows the interquartile range, the line inside represents the median, and whiskers indicate data ranges. The result? "Flipper Length for 3 Penguin Species" visualized!
12. Exploring Facet Plots:
Facet plots are the real deal when it comes to visualizing multiple subsets in your dataset. We show how to create a grid of histograms to compare flipper lengths based on the island and sex of penguins. Each cell in the grid tells a unique story, enhancing our insights.
13. Unveiling Pair Plots:
Pair plots provide a sneak peek into relationships between multiple numerical variables. We create a matrix of scatter plots and histograms, highlighting the pair relations among numerical attributes. The hue parameter paints the plot with species colors, adding another layer of richness.
14. Cracking Correlations with Heatmaps:
Finally, we dive into heatmaps - a fantastic way to see correlations between numerical variables.