The Data Science PipelineVisualization
Data visualization is a way to leverage your visual cortex to gain insight into data. Because vision is such a rich and well-developed interface between the human mind and the external world, visualization is a critical tool for understanding and communicating data ideas.
The standard graphics library in Python is Matplotlib, but here we will use a newer package called Plotly. Plotly offers a number of material advantages relative to Matplotlib: (1) figures support interactions like mouseovers and animations, (2) there is support for
If you use Plotly in a Jupyter notebook, the figures will automatically display in an interactive form. Therefore, it is recommended that you follow along using a separate tab with a Jupyter notebook. However, we will use the function show
defined in the cell below to display the figures as static images so they can be viewed on this page.
from datagymnasia import show print("Success!")
Scatter plot
We can visualize the relationship between two columns of numerical data by associating them with the horizontal and vertical axes of the Cartesian plane and drawing a point in the figure for each observation. This is called a scatter plot. In Plotly Express, scatter plots are created using the px.scatter
function. The columns to associate with the two axes are identified by name using the keyword arguments x
and y
.
import plotly.express as px import pydataset iris = pydataset.data('iris') show(px.scatter(iris,x='Sepal.Width',y='Sepal.Length'))
An aesthetic is any visual property of a plot object. For example, horizontal position is an aesthetic, since we can visually distinguish objects based on their horizontal position in a graph. We call horizontal position the x
aesthetic. Similarly, the y
aesthetic represents vertical position.
We say that the x='Sepal.Width'
argument maps the 'Sepal.Width'
variable to the x
aesthetic. We can map other variables to other aesthetics, with further keyword arguments, like color
and symbol
:
show(px.scatter(iris, x='Sepal.Width', y='Sepal.Length', color='Species', symbol='Species'))
Note that we used the same categorical variable ('Species'
) to the color
and symbol
aesthetics.
Exercise
Create a new data frame by appending a new column called "area" which is computed as a product of petal length and width. Map this new column to the size
aesthetic (keeping x
, y
, and color
the same as above). Which species of flowers has the smallest petal area?
Solution. We use the assign
method to add the suggested column, and we include an additiona keyword argument to map the new column to the size
aesthetic.
show(px.scatter(iris.assign(area = iris["Petal.Length"] * iris['Petal.Width']), x='Sepal.Width', y='Sepal.Length', color='Species', size='area'))
Faceting
Rather than distinguishing species by color, we could also show them on three separate plots. This is called faceting. In Plotly Express, variables can be faceted using the facet_row
and facet_col
arguments.
show(px.scatter(iris, x = 'Sepal.Width', y = 'Sepal.Length', facet_col = 'Species'))
Line plots
A point is not the only geometric object we can use to represent data. A line might be more suitable if we want to help guide the eye from one data point to the next. Points and lines are examples of plot geometries. Geometries are tied to Plotly Express functions: px.scatter
uses the point geometry, and px.line
uses the line geometry.
Let's make a line plot using the Gapminder data set, which records life expectancy and per-capita GDP for 142 countries.
import plotly.express as px gapminder = px.data.gapminder() usa = gapminder.query('country == "United States"') show(px.line(usa, x="year", y="lifeExp"))
The line_group
argument allows us to group the data by country so we can plot multiple lines. Let's also map the 'continent'
variable to the color
aesthetic.
show(px.line(gapminder, x="year", y="lifeExp", line_group="country", color="continent"))
Exercise
Although Plotly Express is designed primarily for data analysis, it can be used for mathematical graphs as well. Use px.line
to graph the function over the interval .
Hint: begin by making a new data frame with appropriate columns. You might find np.linspace
useful.
Solution. We use np.linspace
to define an array of -values, and we exponentiate it to make a list of -values. We package these together into a data frame and plot it with px.line
as usual:
import numpy as np import pandas as pd x = np.linspace(0,5,100) y = np.exp(x) df = pd.DataFrame({'x': x, 'exp(x)': y}) show(px.line(df, x = 'x', y = 'exp(x)'))
Bar plots
Another common plot geometry is the bar. Suppose we want to know the average petal width for flowers with a given petal length. We can group by petal length and aggregate with the mean
function to obtain the desired data, and then visualize it with a bar graph:
show(px.bar(iris.groupby('Petal.Length').agg('mean').reset_index(), x = 'Petal.Length', y = 'Petal.Width'))
We use reset_index
because we want to be able to access the index column of the data frame (which contains the petal lengths), and the index is not directly accessible from Plotly Express. Resetting makes the index a normal column and replaces it with consecutive integers starting from 0.
Perhaps the most common use of the bar geometry is to make histograms. A histogram is a bar plot obtained by binning observations into intervals based on the values of a particular variable and plotting the intervals on the horizontal axis and the bin counts on the vertical axis.
Here's an example of a histogram in Plotly Express.
show(px.histogram(iris, x = 'Sepal.Width', nbins = 30))
We can control the number of bins with the nbins
argument.
Exercise
Does it make sense to map a categorical variable to the color
aesthetic for a histogram? Try changing the command below to map the species column to color
.
show(px.histogram(iris, x = 'Sepal.Width', nbins = 30))
Solution. Yes, we can split each bar into multiple colors to visualize the contribution to each bar from each category. This works in Plotly Express:
show(px.histogram(iris, x = 'Sepal.Width', nbins = 30, color = 'Species'))
Density plots
Closely related to the histogram is a one-dimensional density plot. A density plot approximates the distribution of a variable in a smooth way, rather than the using the
Unfortunately, Plotly Express doesn't have direct support for one-dimensional density plots, so we'll use plotly module called the figure factory:
import plotly.figure_factory as ff show(ff.create_distplot([iris['Sepal.Width']],['Sepal.Width']))
The figure factory takes two lists as arguments: one contains the values to use to estimate the density, and the other represents the names of the groups (in this case, we're just using one group). You'll see that the plot produced by this function contains three
If a categorical variables is mapped to the x
aesthetic, the point geometry fails to make good use of plot space because all of the points will lie on a limited number of
show(px.box(iris, x = 'Species', y = 'Petal.Width'))
show(px.violin(iris, x = 'Species', y = 'Petal.Width'))
The box plot represents the distribute of the y
variable using five numbers: the min, first quartile, median, third quartile, and max. Alternatively, the min and max are sometimes replaced with upper and lower fences, and observations which lie outside are considered outliers and depicted with with points. The plot creator has discretion regarding how to calculate fence cutoffs, but one common choice for the upper fence formula is , where is the third quartile and is the
A violin plot is similar to a boxplot, except that rather than a box, a small
In this section we introduced several of the main tools in a data scientist's visualization toolkit, but you will learn many others. Check out the cheatsheet for ggplot2 to see a much longer list of geometries, aesthetics, and statistical transformations.