Integrating Pandas, Django REST Framework and Bokeh

Ajax based Bokeh charts using Pandas and Django REST Framework

Posted by Agustín Bartó 9 months, 4 weeks ago Comments

It’s no secret that we love Django REST Framework. We’ve written quite a few blog posts about it and it is our default framework for projects that require a web API.

Another package that we use a lot is Pandas (and NumPy by extension). It is fast, flexible, well documented and it has a very active and friendly community. It has made our lives a lot easier when working on data-driven projects.

We regularly deal with projects that require complex Web APIs and we also work on projects that involve intensive data processing but only rarely we combine those two. In the last few months, one of these projects came up and we were pressed for time so we had to chose between the convenience of DRF and the performance and expressiveness of Pandas. Eventually we leaned mostly on the Django/DRF side and took a hit on performance, but this raised an interesting question: What would have been the ideal solution for a project such as ours? The answer is always the same: We would have used (and maybe adapted) available packages that fit our requirements.

When we started thinking about the problem we realized that it wasn’t such a big deal. One of the things we love the most about DRF is its extensibility. It is quite simple to insert custom behavior at every layer of the framework. With this in mind, we started working on a way to integrate Pandas with Django REST Framework.

pandas-drf-tools

pandas-drf-tools is a set of serializers, viewsets and mixins that allows you to expose a Pandas DataFrame through a web API the same way DRF does it with Django querysets.

We tried to follow DRF’s architecture whenever possible, so pandas-drf-tools offers much of the same flexibility, which means that the user can chose the level of integration. You can just use the Serializers and provide a simple read-only view to a DataFrame, or you can use a DataFrameViewSet and provide RESTful read-write access to it. Let us explore the simpler use-case first.

The package provides several Serializers that render DataFrames using methods provided by Pandas. Let’s take a look at DataFrameIndexSerializer:

class DataFrameIndexSerializer(Serializer):
    def to_internal_value(self, data):
        try:
            data_frame = pd.DataFrame.from_dict(data, orient='index').rename(index=int)
            return data_frame
        except ValueError as e:
            raise ValidationError({api_settings.NON_FIELD_ERRORS_KEY: [str(e)]})

    def to_representation(self, instance):
        instance = instance.rename(index=str)
        return instance.to_dict(orient='index')

We’re using to_dict to convert a DataFrame to a dictionary which DRF can then turn it into the appropriate format (usually JSON). We use from_dict to convert a request payload into a DataFrame. Here’s how you could use this Serializer:

class DataFrameIndexSerializerTestView(views.APIView):
    def get_serializer_class(self):
        return DataFrameIndexSerializer

    def get(self, request, *args, **kwargs):
        sample = get_some_dataframe().sample(20)
        serializer = self.get_serializer_class()(sample)
        return response.Response(serializer.data)

    def post(self, request, *args, **kwargs):
        serializer = self.get_serializer_class()(data=request.data)
        serializer.is_valid(raise_exception=True)
        data_frame = serializer.validated_data
        data = {
            'columns': list(data_frame.columns),
            'len': len(data_frame)
        }
        return response.Response(data)

This is a simple DRF APIView that provides a sample of a DataFrame upon receiving a GET request, and parses a POST request payload. It is pretty much the same thing you do when using any kind of custom Serializer. By default, only a very basic level of validation is provided, but you can always overwrite the is_valid method.

Besides DataFrameIndexSerializer, two more Serializers are provided: DataFrameListSerializer and DataFrameRecordsSerializer. The difference is in the methods use to serialize and de-serialize DataFrames, which in turn changes the way the data is rendered.

Besides serializers, pandas-drf-tools also provides a GenericDataFrameAPIView to expose a DataFrame using a view, the same way DRF’s GenericAPIView does it with Django’s querysets. This class will rarely be used directly. A GenericDataFrameViewSet class combined with custom list, retrieve, create, and update mixins turns into DataFrameViewSet (and ReadOnlyDataFrameViewSet) which mimics the behaviour of ModelViewSet.

Instead of setting a queryset field or overriding get_queryset, users of DataFrameViewSet need to set a dataframe field or override the get_dataframe method.

Let’s say we wanted to provide a read-only view of a DataFrame stored in a CSV file. This is quite easy to do:

import pandas as pd

class TestDataFrameViewSet(ReadOnlyDataFrameViewSet):
    serializer_class = DataFrameRecordsSerializer

    def get_dataframe(self):
        return pd.read_csv('dataframe.csv')

Simple. Now, what happens if you want to allow users of the web API to update the dataframe? You’ll need to use DataFrameViewSet instead. By default, this class implements all verbs (list, retrieve, create, update and destroy) but the methods that modify the underlying dataframe are read-only. In order to give the developers a chance to make the change permanent, an update_dataframe callback is provided. For this example, we’ll use pickle to read and write the DataFrame:

import pandas as pd

class TestDataFrameViewSet(DataFrameViewSet):
    serializer_class = DataFrameRecordsSerializer

    def get_dataframe(self):
        return pd.read_pickle('test.pkl')

    def update_dataframe(self, dataframe):
        dataframe.to_pickle('test.pkl')
        return dataframe

These viewsets can then be used the same way as any regular DRF viewset. To actually make them available, we can register them with a router:

from rest_framework.routers import DefaultRouter

router = DefaultRouter()
router.register(r'test', TestDataFrameViewSet, base_name='test')

The only caveat here is that, since there is no queryset associated with the viewset, DRF cannot guess the base name, so it has to be set explicitly.

An example

A complete example that shows most features of pandas-drf-tools is available on GitHub. It is a project that shows you how to use pandas-drf-tools with a live Django REST Framework site that generates Bokeh charts from information stored in Pandas data frames. The data is taken from the US Census Bureau site.

A Vagrantfile is provided if you want to test the live project by yourself.

States Population Estimates

The first part of the example shows what we think is going to be the most common use case for the pandas-drf-tools package, and that is taking an existing DataFrame and exposing it so a front-end application can make use of the data.

@lru_cache()
def get_cc_est2015_alldata_df():
    try:
        data = path.join(path.abspath(path.dirname(__file__)), 'data')
        cc_est2015_alldata_df = pd.read_pickle(path.join(data, 'CC-EST2015-ALLDATA.pkl'))
        state_df = get_state_df()[['STATE', 'STUSAB']]
        cc_est2015_alldata_df = cc_est2015_alldata_df.merge(state_df, on=('STATE',))
    except FileNotFoundError as e:
        raise ImproperlyConfigured(
            'Missing data file. Please run the "download_census_data" management command.') from e

    return cc_est2015_alldata_df


@lru_cache()
def get_state_df():
    try:
        data = path.join(path.abspath(path.dirname(__file__)), 'data')
        state_df = pd.read_pickle(path.join(data, 'state.pkl'))
    except FileNotFoundError as e:
        raise ImproperlyConfigured(
            'Missing data file. Please run the "download_census_data" management command.') from e

    return state_df

The data is presented to the user using Bokeh charts. The charts are generated by two views. The first one shows the population of each state:

def get_state_abbreviations():
    alldata_df = get_cc_est2015_alldata_df()
    return alldata_df['STUSAB'].drop_duplicates().tolist()


def get_states_plot():
    source = AjaxDataSource(
        data={'STATE': [], 'STNAME': [], 'STUSAB': [], 'TOT_POP': [], 'TOT_MALE': [], 'TOT_FEMALE': []},
        data_url='/api/states/', mode='replace', method='GET')

    hover = HoverTool(
        tooltips=[
            ("State", "@STNAME"),
            ("Population", "@TOT_POP"),
            ("Female Population", "@TOT_FEMALE"),
            ("Male Population", "@TOT_MALE"),
        ]
    )

    plot = figure(title='Population by State', plot_width=1200, plot_height=500,
                  x_range=FactorRange(factors=get_state_abbreviations()), y_range=(0, 40000000),
                  tools=[hover, 'tap','box_zoom','wheel_zoom','save','reset'])
    plot.toolbar.active_tap = 'auto'
    plot.xaxis.axis_label = 'State'
    plot.yaxis.axis_label = 'Population'
    plot.yaxis.formatter = NumeralTickFormatter(format="0a")
    plot.sizing_mode = 'scale_width'
    plot.vbar(bottom=0, top='TOT_POP', x='STUSAB', legend=None, width=0.5, source=source)

    url = "/counties/@STATE/"
    taptool = plot.select(type=TapTool)
    taptool.callback = OpenURL(url=url)

    return plot


class StatesView(TemplateView):
    template_name = 'chart.html'

    def get_context_data(self, **kwargs):
        context_data = super().get_context_data(**kwargs)

        plot = get_states_plot()
        bokeh_script, bokeh_div = components(plot, CDN)

        context_data['title'] = 'Population by State'
        context_data['bokeh_script'] = bokeh_script
        context_data['bokeh_div'] = bokeh_div

        return context_data

This view illustrates the guidelines given for embedding Bokeh charts. The interesting parts lies in the usage of AjaxDataSource. When the chart is rendered in the front-end, it’ll make a GET request to the supplied URL to fetch the data. Here’s where pandas-drf-tools comes into play. The request is handled by a ReadOnlyDataFrameViewSet that exposes the dataframe constructed above.

class StateEstimatesViewSet(ReadOnlyDataFrameViewSet):
    serializer_class = DataFrameListSerializer
    pagination_class = LimitOffsetPagination

    def get_dataframe(self):
        alldata_df = get_cc_est2015_alldata_df()

        state_names_df = alldata_df[['STATE', 'STNAME', 'STUSAB']].set_index('STATE')\
            .drop_duplicates()
        latest_total_population = alldata_df[(alldata_df.YEAR == 8) & (alldata_df.AGEGRP == 0)]
        population_by_state = latest_total_population.groupby(['STATE']).sum().join(state_names_df)\
            .reset_index()

        return population_by_state[['STATE', 'STNAME', 'STUSAB', 'TOT_POP', 'TOT_MALE', 'TOT_FEMALE']]

The DataFrameListSerializer used in this view is compatible with the format that Bokeh uses, so there’s nothing else to do.

The only thing left to do is hook everything up with a template:

<!DOCTYPE html>
{% load static %}
<html lang="en">
    <head>
        ...
        <link href="http://cdn.pydata.org/bokeh/release/bokeh-0.12.3.min.css" rel="stylesheet" type="text/css">
        <link href="http://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.3.min.css" rel="stylesheet" type="text/css">
    </head>
    <body>
        {{ bokeh_div|safe }}
        ...
        <script src="http://cdn.pydata.org/bokeh/release/bokeh-0.12.3.min.js"></script>
        <script src="http://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.3.min.js"></script>
        {{ bokeh_script|safe }}
    </body>
</html>

This is how it looks:

Screenshot

If you click on state column, a chart is going to be show with the top ten counties (by population) on said state. This view is handled with a regular Django view with an embedded Bokeh chart:

def get_counties_data_frame(state_fips_code):
    alldata_df = get_cc_est2015_alldata_df()
    county_names_df = alldata_df[['STATE', 'COUNTY', 'CTYNAME']].set_index('COUNTY') \
        .drop_duplicates()
    latest_total_population = alldata_df[(alldata_df.YEAR == 8) & (alldata_df.AGEGRP == 0)]
    population_by_county = latest_total_population.groupby(['COUNTY']).sum() \
        .join(county_names_df).reset_index()

    population_by_county = population_by_county[['STATE', 'COUNTY', 'CTYNAME', 'TOT_POP', 'TOT_MALE', 'TOT_FEMALE']]
    population_by_county = population_by_county[population_by_county.STATE == state_fips_code]
    population_by_county = population_by_county.sort_values('TOT_POP', ascending=False)[:10]

    return population_by_county

def get_counties_plot(data_frame):
    plot = Bar(data_frame, label='CTYNAME', values='TOT_POP', agg='max', plot_width=1200, plot_height=500,
               title='Population by County', legend=False)
    plot.xaxis.axis_label = 'County'
    plot.yaxis.axis_label = 'Population'
    plot.yaxis.formatter = NumeralTickFormatter(format="0a")
    plot.sizing_mode = 'scale_width'
    return plot

class CountiesView(TemplateView):
    template_name = 'chart.html'

    def get_context_data(self, **kwargs):
        context_data = super().get_context_data(**kwargs)

        data_frame = get_counties_data_frame(kwargs['state_fips_code'])
        plot = get_counties_plot(data_frame)
        bokeh_script, bokeh_div = components(plot, CDN)

        context_data['title'] = 'Population by County'
        context_data['bokeh_script'] = bokeh_script
        context_data['bokeh_div'] = bokeh_div

        return context_data

Read-write example

The second part shows you how to manipulate a dataframe as if it were a queryset, allowing you not only to list rows of the dataset, but also creating new rows, and updating and deleting existing ones. This time we’re going to use a different data set that only contains state population estimates:

def get_nst_est2015_alldata_df():
    df = cache.get('nst_est2015_alldata_df')

    if df is None:
        try:
            data = path.join(path.abspath(path.dirname(__file__)), 'data')
            df = pd.read_pickle(path.join(data, 'NST-EST2015-alldata.pkl'))
            df = df[df.SUMLEV == '040'][['STATE', 'NAME', 'POPESTIMATE2015']].reset_index(drop=True)
            cache.set('nst_est2015_alldata_df', df)
        except FileNotFoundError as e:
            raise ImproperlyConfigured(
                'Missing data file. Please run the "download_census_data" management command.') from e

    return df

The dataframe is then exposed through a DataFrameViewSet that illustrates how to make the changes stick by implementing the update_dataframe method. The index_row was overridden so we can reference the states based on their FIPS code instead of their position within the dataframe.

class TestDataFrameViewSet(DataFrameViewSet):
    serializer_class = DataFrameRecordsSerializer

    def index_row(self, dataframe):
        return dataframe[dataframe.STATE == self.kwargs[self.lookup_url_kwarg]]

    def get_dataframe(self):
        return get_nst_est2015_alldata_df()

    def update_dataframe(self, dataframe):
        cache.set('nst_est2015_alldata_df', dataframe)
        return dataframe

This set-up allows us to list rows:

$ curl --silent http://localhost:8000/api/test/ | python -mjson.tool
{
    "columns": [
        "index",
        "STATE",
        "NAME",
        "POPESTIMATE2015"
    ],
    "data": [
        [
            0,
            "01",
            "Alabama",
            4858979
        ],
        ...
        [
            51,
            "72",
            "Puerto Rico",
            3474182
        ]
    ]
}

...,get the details of a specific row...

$ curl --silent http://localhost:8000/api/test/72/ | python -mjson.tool
{
    "columns": [
        "index",
        "STATE",
        "NAME",
        "POPESTIMATE2015"
    ],
    "data": [
        [
            51,
            "72",
            "Puerto Rico",
            3474182
        ]
    ]
}

...add new rows...

$ curl --silent -X POST -H "Content-Type: application/json" --data '{"columns":["index","STATE","NAME","POPESTIMATE2015"],"data":[[52,"YY","Mars",1]]}' http://localhost:8000/api/test/
{"columns":["index","STATE","NAME","POPESTIMATE2015"],"data":[[52,"YY","Mars",1]]}

...update existing rows...

$ curl --silent -X PUT -H "Content-Type: application/json" --data '{"columns":["index","STATE","NAME","POPESTIMATE2015"],"data":[[52,"YY","Mars",0]]}' http://localhost:8000/api/test/YY/

...and delete rows...

$ curl --silent -X DELETE http://localhost:8000/api/test/YY/

It provides pretty much the same functionality as regular DRM ModelViewSets.

Conclusions

If you analyze the code that we’ve presented, you’ll notice that there’s not much to it. This is a testament of how well designed Pandas, Django REST Framework and Bokeh are. In the end it was just a matter of connecting the dots and a little bit of elbow grease.

Feedback

As always, comments, tickets and pull requests are welcomed. You can reach me at abarto (at) machinalis.com or @m4rgin4l on Twitter; and you can check our Services page to see other cool things we do here at Machinalis.


Previous / Next posts


Comments