dplyr for python

dplyr enlightenment with pandas¶

I tried to reach dplyr enlightenment in the pandas ecosystem, but what I found was different from what I expected. Be warned: opinions ahead!

According to the dplyr homepage

"dplyr is a grammar of data manipulation, providing a consistent set of verbs that help > you solve the most common data manipulation challenges.

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names.
  • filter() picks cases based on their values.
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows."

Some people, myself included, think that you can write some concise and easy-to-read code with just these few verbs.

Here is an example of dplyr in action taken from the python-ply page:

In [ ]:
%%R
flights %>%
  group_by(year, month, day) %>%
  summarise(
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 & dep > 30)

Compare this to, what they call, "the most common way to express this in pandas":

In [ ]:
grouped_flights = flights.groupby(['year', 'month', 'day'])
output = pd.DataFrame()
output['arr'] = grouped_flights.arr_delay.mean()
output['dep'] = grouped_flights.dep_delay.mean()
filtered_output = output[(output.arr > 30) & (output.dep > 30)]

The dplyr version is more readable.

Let's check out if there is a dplyr-like way to write the obove with python and pandas. Luckily there are some libraries which implement some of dplyr's features in python.

  • python-ply has 124 stars and 12 forks on github. The last commit was in August 2015. python-ply monkey-patches pandas and adds .ply_select and .ply_where and adds a special X variable for lazy evaluation.
  • dplython has 576 stars and 38 forks on github. The last commit was in December 2016. dplython uses more of verb-like approach with functions and also adds a special X for delayed execution.

Case Study with the Diamond dataset¶

I'm going to show some simple data mungling with the diamonds dataset. dplython has some data mungling examples that demonstrate of how to use dplython with this dataset. I'm going to do the same with pandas-ply and vanilla pandas.

In [1]:
# The imports...
import pandas as pd

# pandas-ply
from pandas_ply import install_ply
# normally just X
from pandas_ply import X as X_ply
install_ply(pd)

# dplython
from dplython import (
    select,
    sift,
    sample_n,
    head,
    arrange,
    mutate,
    group_by,
    summarize,
)
# normally just X
from dplython import X as X_dpl

dplython already ships the diamonds dataset. Note that dplython uses DplyFrame as extension of the DataFrame and dyf is the diamond data as DplyFrame.

In [2]:
from dplython import diamonds

# Note that dplython needs the DplyFrame
# and dyf is the diamond data as DplyFrame
dyf = diamonds 
df = pd.DataFrame(diamonds)
print(type(df))
print(type(dyf))
<class 'pandas.core.frame.DataFrame'>
<class 'dplython.dplython.DplyFrame'>

Just to get an idea of how the data look like:

In [3]:
df.sample(3)
Out[3]:
Unnamed: 0 carat cut color clarity depth table price x y z
30681 30682 0.40 Ideal E SI1 61.2 57.0 737 4.70 4.74 2.89
52442 52443 0.72 Ideal D SI2 61.1 58.0 2514 5.76 5.79 3.53
52905 52906 0.70 Premium G VS2 58.7 53.0 2586 5.86 5.83 3.43
In [4]:
n = 3  # number of samples we'll draw

Simple select¶

In [5]:
# dplython
(
    dyf
    >> select(X_dpl.carat, X_dpl.cut, X_dpl.price)
    >> head(n)
)
Out[5]:
carat cut price
0 0.23 Ideal 326
1 0.21 Premium 326
2 0.23 Good 327

I like dplython's >> operator for piping the data (but I'm spoiled by elixir and I'm aware that operator overloading can be used for evil (hello c++)). The use of X gives us access to the columns of the data. It's nice, but again not a huge difference.

In [6]:
# pandas-ply
(
    df
    .ply_select('carat', 'cut', 'price')
    .head(n)
);

The ply_select methods is the equivalent to dplython's select function.

In [7]:
# pandas
(
    df
    [['carat', 'cut', 'price']]
    .head(n)
);

With pandas we use the good old bracket notation. Also quite readable.

Filter and select¶

In [8]:
# dplython
(
    dyf
    >> sift(X_dpl.carat > 4)
    >> select(X_dpl.carat, X_dpl.cut, X_dpl.depth, X_dpl.price)
)
Out[8]:
carat cut depth price
25998 4.01 Premium 61.0 15223
25999 4.01 Premium 62.5 15223
27130 4.13 Fair 64.8 17329
27415 5.01 Fair 65.5 18018
27630 4.50 Fair 65.8 18531

dplython's sift is like a filter and quite redable.

In [9]:
# pandas-ply
(
    df
    .ply_where(X_ply.carat > 4)
    .ply_select('carat', 'cut', 'depth', 'price')
);

.ply_where in combination with the X is just as nice as dplyton.

In [10]:
# pandas
(
    df
    [df.carat > 4]
    [['carat', 'cut', 'depth', 'price']]
);

I guess this is the most common way to write the same in vanilla pandas.

In [11]:
# pandas
(
    df
    .query('carat > 4')
    [['carat', 'cut', 'depth', 'price']]
);

But it's also possible to use the query method of pandas.

Sample and sort¶

In [12]:
# dplython
(
    dyf
    >> sample_n(n)
    >> arrange(X_dpl.carat)
    >> select(X_dpl.carat, X_dpl.cut, X_dpl.depth, X_dpl.price)
)
Out[12]:
carat cut depth price
4718 0.24 Ideal 62.7 572
31325 0.32 Ideal 61.8 758
26905 2.14 Very Good 59.7 16914

arrange might seem strange in the beginning. Just think of it as sort and you're fine.

In [13]:
# python-ply
(
    df
    .sample(n)
    .sort_values('carat')
    .ply_select('carat', 'cut', 'depth', 'price')
);

No surprises with pandas-ply.

In [14]:
# pandas
(
    df
    .sample(n)
    .sort_values('carat')
    [['carat', 'cut', 'depth', 'price']]
);

And no surprises with vanilla pandas, but I still think it's quite readable.

Assign, group by and summarize¶

In [15]:
# dplython
(
    dyf
    >> mutate(carat_bin=X_dpl.carat.round())
    >> group_by(X_dpl.cut, X_dpl.carat_bin)
    >> summarize(avg_price=X_dpl.price.mean())
    >> head(10)
)
Out[15]:
cut carat_bin avg_price
0 Fair 0.0 1027.979275
1 Fair 1.0 3305.754579
2 Fair 2.0 9588.269737
3 Fair 3.0 13466.823529
4 Fair 4.0 15842.666667
5 Fair 5.0 18018.000000
6 Good 0.0 786.054191
7 Good 1.0 3815.307879
8 Good 2.0 11096.950321
9 Good 3.0 15924.176471

This is an example where dplython shines. mutate makes clear that something is changed (though technically we return a copy, so we don't change anything) and summarize is similar to pandas agg but with a better interface/more functionality and a better name.

In [16]:
# python-ply and pandas
(
    df
    .assign(carat_bin=df.carat.round())
    .groupby(['cut', 'carat_bin'])
    .agg({'price': 'mean'})
    .rename(columns={"price": "avg_price"})
    .reset_index()
    .head(10)
)
Out[16]:
cut carat_bin avg_price
0 Fair 0.0 1027.979275
1 Fair 1.0 3305.754579
2 Fair 2.0 9588.269737
3 Fair 3.0 13466.823529
4 Fair 4.0 15842.666667
5 Fair 5.0 18018.000000
6 Good 0.0 786.054191
7 Good 1.0 3815.307879
8 Good 2.0 11096.950321
9 Good 3.0 15924.176471

We were not really able to use any padans-ply features here. ply_select(carat=_X.carat.round()) only leaves the selected 'carat' column.

pandas's assign method is quite clear. agg does not feel as clean as the summarize function from dplython. With summarize we can directly assign the name for the new column which we can't do with agg (you could use .agg({'price': {'avg_price': 'mean'}}) but that is going to be deprecated soon). Therefore we manually rename the column.

Also note, that agg() returns a DataFrame with a MultiIndex whereas summarize of dplython returns a flat DataFrame. We flatten the MultiIndex by resetting the index.

Conclusion¶

So which one is better? dplython, pandas-ply, or even vanialla pandas?

I think the method-chaining version of pandas is quite readable but pandas API ist still...pandas API: powerful but not necessarily intuitive. Even though I know pandas quite well, I found myself looking up the docs more often than for python-ply and dplython (which I haven't used before):

  • does the rename methods require a keyword argument?
  • wasn't there a select method in pandas that I can use?
  • can agg rename at the same time?

pandas-ply does not really offer a big advantage over good old pandas and is also not actively maintained.

dplython feels good! Piping, concise verbs, more popular. Will I use it in the future? I think I'll give it a try in a bigger project and see how it goes.

Nevertheless, check out the modern pandas post on method chaining if you don't use this feature yet!

PS¶

Just as a side note, I have a little python helper library pelper which contains a pipe function. pipe passes the output of a function as first argument into another function. This allows you to think of pragramming as a series of transformations. So you can rewrite this:

In [17]:
sorted(list(range(10)), reverse=True)
Out[17]:
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

as something more functional like this:

In [18]:
from pelper import pipe
pipe(10, range, list, (sorted, {"reverse": True}))
Out[18]:
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

Unsurprisingly, one can use the pipe function with dplython (not using the delayed execution functionality of dplython):

In [19]:
from pelper import pipe
pipe(
    diamonds,
    (mutate, {"carat_bin": X_dpl.carat.round()}),
    (group_by, X_dpl.cut, X_dpl.carat_bin),
    (summarize, {"avg_price": X_dpl.price.mean()}),
);

Or using the delayed execution functionality:

In [20]:
pipe(
    diamonds,
    mutate(carat_bin=X_dpl.carat.round()),
    group_by(X_dpl.cut, X_dpl.carat_bin),
    summarize(avg_price=X_dpl.price.mean()),
);