Pandas for GrimoireLab indexes

Pandas is one of the most common libraries used in data analytics with Python. It can be very useful when dealing with GrimoireLab indexes. In this section, we will explore how to create dataframes (one of the most interesting data structures provided by Pandas) from GrimoireLab indexes, and how to work with them.

Building a dataframe from an index

To start exploring how to use Pandas with GrimoireLab, we will create a simple program that creates a couple of CSV files from information in an index (pandas_1_py).

First things first: let’s import the modules we’re going to need: in addition to datetime, we will use elasticsearch and elasticsearch_dsl for accesing the ElasticSearch instance where our index will live, and Pandas.

from datetime import datetime

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
import pandas as pd

Then we create an object for accesing the ElasticSearch instance. In this case, we’ll assume that it is running in our computer, with its REST interface available in port 9200 (the default port used by ElasticSearch). The verify_certs is not strictly neccesary, but maybe you’ll need it if you’re connecting to an ElasticSearch instance over TLS (https) with a bad certificate. In any case, if you don’t need it you better don’t use it (just remove it from the call to the Elasticsearch constructor).

es = Elasticsearch('http://localhost:9200', verify_certs=False)

Now we can build the query, using the facilities provided by elasticsearch_dsl. The query will be on the index named git (which should be a GrimoireLab enriched git index). These indexes store one document per commit, with some fields such as author_name (name of the author of the commit) and author_date (date of authorship of the commit).

The query build buckets of commits, grouped by author name (field author_name), aggregated as first commit for each of these authors (minimum field author_date for all the documents in each bucker).

s = Search(using=es, index='git')
s.aggs.bucket('by_authors', 'terms', field='author_name', size=10000) \
    .metric('first_commit', 'min', field='author_date')
s = s.sort("author_date")

Now, we can execute the query:

result = s.execute()

Note that we have specified a size of 10,000 for the buckets, which should allow for most situations. But feel free to make it larger if you’re working with a really large index.

And the moment for creating a Pandas dataframe arrived! Dataframes are somewhat like tables. In this case, were’going to have one row in that table per bucket (author), with author_name and author_date as columns.

We will first create a list with all the buckets received (buckets). It is just a matter of extracting the relevant parts from the results of executing the query:

buckets_result = result['aggregations']['by_authors']['buckets']
buckets = []
for bucket in buckets_result:
    first_commit = bucket['first_commit']['value']/1000
    buckets.append(
        {'first_commit': datetime.utcfromtimestamp(first_commit),
        'author': bucket['key']}
        )

If this code is not clear, you can insert the following two lines right after the execution of the query. It will print the JSON document received from ElasticSearch, in all its glory.

from pprint import pprint
pprint(result.to_dict())

It will print something something akin to:

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'aggregations': {'by_authors': {'buckets': [{'doc_count': 1345,
                                              'first_commit': {'value': 1443636916000.0,
                                                               'value_as_string': '2015-09-30T18:15:16.000Z'},
                                              'key': 'Alvaro del Castillo'},
                                             {'doc_count': 557,
                                              'first_commit': {'value': 1439921307000.0,
                                                               'value_as_string': '2015-08-18T18:08:27.000Z'},
                                              'key': 'Santiago Dueñas'},

Once we have the list ready, we can creat a Pandas dataframe out of it:

authors = pd.DataFrame.from_records(buckets)

There are better and more efficient ways of creating a dataframe out of the results of a query, but maybe this one is the most clear. So, let’s stick to it for now. We got our first dataframe: authors.

Now, we can start using the magic of Pandas. For example, we can order the dataframe (that is, the rows in the dataframe, each corresponding to one commit) as follows:

authors.sort_values(by='first_commit', ascending=False, inplace=True)

If you want to see the nice dataframe you have, you can use again the pprint trick:

pprint(authors)

This will produce something like:

                            author        first_commit
...
7       Jesus M. Gonzalez-Barahona 2015-12-31 19:16:25
0              Alvaro del Castillo 2015-09-30 18:15:16
1                  Santiago Dueñas 2015-08-18 18:08:27

Each line in this output corresponds to a row in the dataframe. The first column is the index (which is not in ascending order because we reordered the dataframe it by first commit.

And some more Pandas magic: let’s produce a new dataframe with the number of new authors per month. In this case, each row in the dataframe will correspond to a month.

by_month = authors['first_commit'] \
    .groupby([authors.first_commit.dt.year,
            authors.first_commit.dt.month]) \
    .agg('count')

We first select the first_commit column The groupby method will produce groups by year / month, and the agg method will later aggregate them, by counting the rows in each group. We can use once again the good old pprint trick to see the by_month dataframe:

first_commit  first_commit
2015          8               1
              9               1
              12              1
2016          2               1
              3               4
              4               1
              7               3
              11              2
2017          1               1
Name: first_commit, dtype: int64

And we’re ready for the final fireworks: producing CSV files for both dataframes:

by_month.to_csv('authors_per_month.csv')
authors.to_csv('authors_first.csv',
                columns=['first_commit', 'author'],
                index=False)

The to_csv method of dataframes just dump them in a file, using the CSV conventions. We can check the files created (authors_per_month.csv and authors_first.csv):

2015,8,1
2015,9,1
2015,12,1
2016,2,1
2016,3,4
2016,4,1
2016,7,3
2016,11,2
2017,1,1
...
2015-12-31 19:16:25,Jesus M. Gonzalez-Barahona
2015-09-30 18:15:16,Alvaro del Castillo
2015-08-18 18:08:27,Santiago Dueñas