Embedding Interactive Charts on an IPython Notebook - Part 1

Plotting census data using Pandas, D3.js, Chart.js and HighCharts in an IPython Notebook.

Introduction

In this three part post we’ll show you how easy it is to integrate D3.js, Chart.js and HighCharts chart into an notebook and how to make them interactive using HTML widgets.

IPython Notebook

This post is also available as an IPython Notebook on github.com

Requirements

The only requirement to run the examples is IPython Notebook version 2.0 or greater. All the modules that we reference are either in the standard Python distribution, or are dependencies of IPython.

About Pandas

Although Pandas is not strictly necessary to accomplish what we do in the examples, it is such a popular data analysis tool that we wanted to use it anyway. We recommend that you read the 10 Minutes to Pandas tutorial to get and idea of what it can do or buy Python for Data Analysis for an in depth guide of data analysis using Python, Pandas and NumPy.

About the Data

All the data that we use in the examples are taken from the United States Census Bureau site. We’re going to use 2012 population estimates and we’re going to plot the sex and age groups by the state, region and division.

Population by State

We’re going to build a Pandas DataFrame from the dataset of Incorporated Places and Minor Civil Divisions. We could have just grabbed the estimates for the states, but also wanted to show you how easy it is to work with data using Pandas. First, we fetch the data using urlopen and we parse the response as CSV using Pandas’ read_csv function:

sub_est_2012_df = pd.read_csv(
    urlopen('http://www.census.gov/popest/data/cities/totals/2012/files/SUB-EST2012.csv'),
    encoding='latin-1',
    dtype={'STATE': 'str', 'COUNTY': 'str', 'PLACE': 'str'}
)

The resulting data frame has a lot of information that we don’t need and can be discarded. According to the file layout description, the data is summarized at the nation, state, county and place levels according to the SUMLEV column. Since we’re only interested in the population for each state we can just filter the rows with SUMLEV ‘40’, but wanted to show you how to use the aggregate feature of Pandas’ DataFrames, so we’ll take the data summarized at the count level (SUMLEV ‘50’), then we’ll group by state, and sum the population estimates.

sub_est_2012_df_by_county = sub_est_2012_df[sub_est_2012_df.SUMLEV == 50]
sub_est_2012_df_by_state = sub_est_2012_df_by_county.groupby(['STATE']).sum()

# Alternatively we could have just taken the summary rows for the states

# sub_est_2012_df_by_state = sub_est_2012_df[sub_est_2012_df.SUMLEV == 40]

If you see the table, the states are referenced using their ANSI codes. We can augment the table to include the state names and abbreviations by merging with another resource from the Geography section of the US Census Bureau site. We use read_csv Pandas function making sure that we use the pipe character (|) as separator.

# Taken from http://www.census.gov/geo/reference/ansi_statetables.html

state = pd.read_csv(urlopen('http://www.census.gov/geo/reference/docs/state.txt'), sep='|', dtype={'STATE': 'str'})
state.drop(
    ['STATENS'],
    inplace=True, axis=1
)
sub_est_2012_df_by_state = pd.merge(sub_est_2012_df_by_state, state, left_index=True, right_on='STATE')
sub_est_2012_df_by_state.drop(
    ['SUMLEV', 'COUSUB', 'CONCIT', 'ESTIMATESBASE2010', 'POPESTIMATE2010', 'POPESTIMATE2011'],
    inplace=True, axis=1
)

We’re also interested in plotting the information about the age and sex of the people, and for that we can use the Annual Estimates of the Civilian Population by Single Year of Age and Sex.

# Taken from http://www.census.gov/popest/data/state/asrh/2012/SC-EST2012-AGESEX-CIV.html

sc_est2012_agesex_civ_df = pd.read_csv(
    urlopen('http://www.census.gov/popest/data/state/asrh/2012/files/SC-EST2012-AGESEX-CIV.csv'),
    encoding='latin-1',
    dtype={'SUMLEV': 'str'}
)

Once again, the table is summarized at many levels, but we’re only interested in the information at the state level, so we filter out the unnecessary rows. We also do a little bit of processing to the STATE column so it can be used to merge with the state DataFrame.

sc_est2012_agesex_civ_df_sumlev040 = sc_est2012_agesex_civ_df[
    (sc_est2012_agesex_civ_df.SUMLEV == '040') &
    (sc_est2012_agesex_civ_df.SEX != 0) &
    (sc_est2012_agesex_civ_df.AGE != 999)
]
sc_est2012_agesex_civ_df_sumlev040.drop(
    ['SUMLEV', 'NAME', 'ESTBASE2010_CIV', 'POPEST2010_CIV', 'POPEST2011_CIV'],
    inplace=True, axis=1
)
sc_est2012_agesex_civ_df_sumlev040['STATE'] = sc_est2012_agesex_civ_df_sumlev040['STATE'].apply(lambda x: '%02d' % (x,))

What we need to do is group the rows by state, region, division and sex, and sum across all ages. Afterwards, we augment the result with the names and abbreviations of the states.

sc_est2012_sex = sc_est2012_agesex_civ_df_sumlev040.groupby(['STATE', 'REGION', 'DIVISION', 'SEX'], as_index=False)[['POPEST2012_CIV']].sum()
sc_est2012_sex = pd.merge(sc_est2012_sex, state, left_on='STATE', right_on='STATE')

For the age information, we group by state, region, division and age and we sum across all sexes. If you see the result, you’ll notice that there’s a row for each year. This is pretty useful for analysis, but it can be problematic to plot, so we’re going to group the rows according to age buckets of 20 years. Once again, we add the state information at the end.

sc_est2012_age = sc_est2012_agesex_civ_df_sumlev040.groupby(['STATE', 'REGION', 'DIVISION', 'AGE'], as_index=False)[['POPEST2012_CIV']].sum()
age_buckets = pd.cut(sc_est2012_age.AGE, range(0,100,20))
sc_est2012_age = sc_est2012_age.groupby(['STATE', 'REGION', 'DIVISION', age_buckets], as_index=False)['POPEST2012_CIV'].sum()
sc_est2012_age = pd.merge(sc_est2012_age, state, left_on='STATE', right_on='STATE')

We also need information about regions and divisions, but since the dataset is small, we’ll build the dictionaries by hand.

region_codes = {
    0: 'United States Total',
    1: 'Northeast',
    2: 'Midwest',
    3: 'South',
    4: 'West'
}
division_codes = {
    0: 'United States Total',
    1: 'New England',
    2: 'Middle Atlantic',
    3: 'East North Central',
    4: 'West North Central',
    5: 'South Atlantic',
    6: 'East South Central',
    7: 'West South Central',
    8: 'Mountain',
    9: 'Pacific'
}

Part 1 - Embedding D3.js

D3.js is an incredibly flexible JavaScript chart library. Although it is primarily used to plot data, it can be used to draw arbitrary graphics and animations.

Let’s build a column chart of the five most populated states in the USA. IPython Notebooks are regular web pages so in order to use any JavaScript library in it, we need to load the necessary requirements. IPython Notebook uses RequireJS to load its own requirements, so we can make use of it with the %%javascript cell magic to load external dependencies.

In all the examples of this notebook we’ll load the libraries from cdnjs.com, so to declare the requirement of D3.js we do

%%javascript
require.config({
    paths: {
        d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min'
    }
});

Now we’ll make use of the display function and HTML from the IPython Notebook API to render HTML content within the notebook itself. We’re declaring styles to change the look and feel of the plots, and we define a new div with id "chart_d3" that the library is going to use as the target of the plot.

display(HTML("""
<style>
.bar {
 fill: steelblue;
}
.bar:hover {
 fill: brown;
}
.axis {
 font: 10px sans-serif;
}
.axis path,
.axis line {
 fill: none;
 stroke: #000;
}
.x.axis path {
 display: none;
}
</style>
<div id="chart_d3"/>
"""))

Next, we define the sub_est_2012_df_by_state_template template with the JavaScript code that is going to render the chart. Notice that we iterate over the “data” parameter to populate the “data” variable in JavaScript. Afterwards, we use the display method once again to force the execution of the JavaScript code, which renders the chart on the target div.

display(Javascript(sub_est_2012_df_by_state_template.render(
    data=sub_est_2012_df_by_state.sort(['POPESTIMATE2012'], ascending=False)[:5].itertuples()))
)
/static/media/uploads/uploads/javascript_charts_0.png

The chart shows that California, Texas, New York, Florida and Illinois are the most populated states. What about the other states? Let’s build an interactive chart that allows us to show whichever state we chose. IPython Notebook provides widgets that allow us to get information from the user in an intuitive manner. Sadly, at the time of this writing, there’s no widget to select multiple items from a list but IPython is easily extensible, so we built our own and named it MultipleSelectWidget

We’re going to use IPython’s interact function to display the widgets and execute the callback function display_chart_d3 responsible to draw the chart. As we mentioned before, d3 requires a target element to draw the chart, so we use an HTMLWidget to make sure the div is properly rendered before the callback is executed.

values = {
    record['STUSAB']: "{0} - {1}".format(record['STUSAB'], record['STATE_NAME']) for record in state[['STUSAB', 'STATE_NAME']].sort('STUSAB').to_dict(outtype='records')
}
i = interact(
    display_chart_d3,
    data=widgets.fixed(sub_est_2012_df_by_state),
    show_javascript=widgets.CheckboxWidget(value=False),
    states=MultipleSelectWidget(
        value=['CA', 'NY'],
        values=values,
        values_order=sorted(values.keys())
    ),
    div=widgets.HTMLWidget(value='<div id="chart_d3_interactive"></div>')
)

We’ve also added a show_javascript checkbox to display the generated code on a pop-up.

/static/media/uploads/uploads/javascript_charts_1.png /static/media/uploads/uploads/javascript_charts_2.png

Although D3 is capable of creating incredible charts, it has a steep learning curve and it can be overkill if what you want are just simple charts. Let us explore simpler alternatives.

On parts 2 and part 3 we’ll explore alternative solutions which are simpler, but still good looking.

Want to read more? Follow us on Twitter @machinalis


Previous / Next posts


Comments