Handy Functions

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Average per day when you have monthly sums

Just occasionally I end up with dealing with data representing monthly sums (like sales in £ per month, or similar). To smooth out the fact that different months have different numbers of days, it can be handy to break this down into averages per day.

Assuming you have a DateTimeIndex, this little function in combination with pandas apply method gets the job done:

In [2]:
def avg_per_day_in_month(x):
    if x.name.month in [1,3,5,7,8,10,12]:
        return x/31
    elif x.name.month in [4,6,9,11]:
        return x/30
    elif x.name.month == 2:
        if x.name.year % 4 == 0:
            return x/29
        else:
            return x/28

Usage would be something like this:

In [ ]:
grouped_by_month['average_per_day'] = grouped_by_month.apply(avg_per_day_in_month, axis = 1)

How to feed variables into a function within agg/apply/transform etc

Say you wanted to divide the minimum value in a group by a number. But you wanted flexibility about what that number should be... how do you do it?

The answer is to feed a function within a function, like this:

In [3]:
def divide(a):
    def divide_(x):
        return x.min()/a
    return divide_

You would then use it as follows:

In [ ]:
df.groupby('group_key').agg({'column_name': divide(3)})

Turning categorical variables into one-hot columns

I feel that this should be easier than it is (unless there is some new pandas wizardry that I'm not aware of). Either way, it's a common job to turn categorical columns into dummy columns. Scikit-learn, in particular, doesn't like strings, so you really need this if you've got categorical variables and you're using Python for machine learning!

In [6]:
def create_dummy_columns(df, list_of_cols, drop_columns = True):
    """
    Function to create dummy columns from the categorical variables
    of an input dataframe.
    The columns to dummify are specified as a list.
    The original columns can be dropped by setting drop_columns = True.
    
    Inputs - df, the initial dataframe
           - list_of_cols, the columns to drop as a list of strings
           - drop_columns, whether to drop the categorical columns
           
    Returns - dummifed dataframe
    """
    df_copy = df.copy()
    dummy_dfs = [df_copy]
    for col in list_of_cols:
        dummy_df = pd.get_dummies(df_copy[col])
        dummy_dfs.append(dummy_df)
        if drop_columns:
            df_copy.drop(col, axis = 1, inplace = True)
    return pd.concat(dummy_dfs, axis = 1)

Check that the shape of the input/output dataframes are as expected - I haven't yet tested this code!

In [ ]:
 

blogroll

social