pd_utils package

High-level tools for common Pandas workflows

Subpackages

Submodules

pd_utils.corr module

pd_utils.corr.formatted_corr_df(df, cols=None)[source]

Calculates correlations on a DataFrame and displays only the lower triangular of the resulting correlation DataFrame.

Parameters
  • df (DataFrame) –

  • cols (Optional[Sequence[str]]) – subset of column names on which to calculate correlations

Return type

DataFrame

Returns

pd_utils.cum module

pd_utils.cum.create_windows(periods, time, method='between')[source]
pd_utils.cum.cumulate(df, cumvars, method, periodvar='Date', byvars=None, time=None, grossify=False, multiprocess=True, replace=False)[source]

Cumulates a variable over time. Typically used to get cumulative returns.

Parameters
  • df (DataFrame) –

  • cumvars (Union[str, List[str]]) – column names to cumulate

  • method (str) – ‘between’, ‘zero’, or ‘first’. If ‘zero’, will give returns since the original date. Note: for periods before the original date, this will turn positive returns negative as we are going backwards in time. If ‘between’, will give returns since the prior requested time period. Note that the first period is period 0. If ‘first’, will give returns since the first requested time period.

  • periodvar

  • byvars (Union[str, List[str], None]) – column names to use to separate by groups

  • time (Optional[Sequence[int]]) – for use with method=’between’. Defines which periods to calculate between.

  • grossify (bool) – set to True to add one to all variables then subtract one at the end

  • multiprocess (Union[bool, int]) – set to True to use all available processors, set to False to use only one, pass an int less or equal to than number of processors to use that amount of processors

  • replace (bool) – True to return df with passed columns replaced with cumulated columns. False to return df with both passed columns and cumulated columns

Returns

Examples

For example:

For example, if our input data was for date 1/5/2006, but we had shifted dates:
     permno  date      RET  shift_date
     10516   1/5/2006  110%  1/5/2006
     10516   1/5/2006  120%  1/6/2006
     10516   1/5/2006  105%  1/7/2006
     10516   1/5/2006  130%  1/8/2006
 Then cumulate(df, 'RET', cumret='between', time=[1,3], get='RET', periodvar='shift_date') would return:
     permno  date      RET  shift_date  cumret
     10516   1/5/2006  110%  1/5/2006    110%
     10516   1/5/2006  120%  1/6/2006    120%
     10516   1/5/2006  105%  1/7/2006    126%
     10516   1/5/2006  130%  1/8/2006    130%
 Then cumulate(df, 'RET', cumret='first', periodvar='shift_date') would return:
     permno  date      RET  shift_date  cumret
     10516   1/5/2006  110%  1/5/2006    110%
     10516   1/5/2006  120%  1/6/2006    120%
     10516   1/5/2006  105%  1/7/2006    126%
     10516   1/5/2006  130%  1/8/2006    163.8%
pd_utils.cum.window_mapping(time, col, method='between')[source]

Takes a pandas series of dates as inputs, calculates windows, and returns a series of which windows each observation are in. To be used with groupby.transform()

pd_utils.datetime_utils module

class pd_utils.datetime_utils.USTradingCalendar(*args, **kwargs)[source]

Bases: pandas.tseries.holiday.AbstractHolidayCalendar

The US trading day calendar behind the function tradedays().

rules: list[Holiday] = [pandas.tseries.holiday.Holiday, pandas.tseries.holiday.USMartinLutherKingJr, pandas.tseries.holiday.USPresidentsDay, pandas.tseries.holiday.GoodFriday, pandas.tseries.holiday.USMemorialDay, pandas.tseries.holiday.Holiday, pandas.tseries.holiday.USLaborDay, pandas.tseries.holiday.USThanksgivingDay, pandas.tseries.holiday.Holiday]
pd_utils.datetime_utils.convert_sas_date_to_pandas_date(sasdates)[source]

Converts a date or Series of dates loaded from a SAS SAS7BDAT file to a pandas date type.

Parameters

sasdates (Union[Series, int]) – SAS7BDAT-loaded date(s) to convert

Return type

Union[Series, Timestamp]

Returns

pd_utils.datetime_utils.expand_months(df, datevar='Date', newdatevar='Daily Date', trade_days=True)[source]

Takes a monthly dataframe and returns a daily (trade day or calendar day) dataframe. For each row in the input data, duplicates that row over each trading/calendar day in the month of the date in that row. Creates a new date column containing the daily date.

Notes

If the input dataset has multiple observations per month, all of these will be expanded. Therefore you will have one row for each trade day for each original observation.

Parameters
  • df (DataFrame) – DataFrame containing a date variable

  • datevar (str) – name of column containing dates in the input df

  • newdatevar (str) – name of new column to be created containing daily dates

  • trade_days (bool) – True to use trading days and False to use calendar days

Returns

pd_utils.datetime_utils.expand_time(df, intermediate_periods=False, datevar='Date', freq='m', time=[12, 24, 36, 48, 60], newdate='Shift Date', shiftvar='Shift', custom_business_day=None)[source]

Creates new observations in the dataset advancing the time by the int or list given. Creates a new date variable.

Parameters
  • df (DataFrame) –

  • intermediate_periods (bool) – Specify intermediate_periods=True to get periods in between given time periods, e.g. passing time=[12,24,36] will get periods 12, 13, 14, …, 35, 36.

  • datevar (str) – column name of date variable

  • freq (str) – ‘d’ for daily, ‘m’ for monthly, ‘a’ for annual

  • time (List[int]) – number of periods to advance by

  • newdate (str) – name of new date in the output data

  • shiftvar (str) – name of variable which specifies how much the time has been shifted

  • custom_business_day (Optional[CustomBusinessDay]) – Only used for daily frequency. Defaults to using trading days based on US market holiday calendar. Can pass custom business days for other calendars

Returns

pd_utils.datetime_utils.tradedays()[source]

Used for constructing a range of dates with pandas date_range function.

Example
>>> import pandas as pd
>>> import pd_utils
>>> pd.date_range(
>>>     start='1/1/2000',
>>>     end='1/31/2000',
>>>     freq=pd_utils.tradedays()
>>> )
pd.DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06',
           '2000-01-07', '2000-01-10', '2000-01-11', '2000-01-12',
           '2000-01-13', '2000-01-14', '2000-01-18', '2000-01-19',
           '2000-01-20', '2000-01-21', '2000-01-24', '2000-01-25',
           '2000-01-26', '2000-01-27', '2000-01-28', '2000-01-31'],
          dtype='datetime64[ns]', freq='C')
pd_utils.datetime_utils.year_month_from_date(df, date='Date', yearname='Year', monthname='Month')[source]

Takes a dataframe with a datetime object and creates year and month variables

Parameters
  • df (DataFrame) –

  • date (str) – name of date column

  • yearname (str) – name of year column to be created

  • monthname (str) – name of month column to be created

Returns

pd_utils.datetime_utils.year_month_from_single_date(date)[source]

pd_utils.filldata module

pd_utils.filldata.add_missing_group_rows(df, group_id_cols, non_group_id_cols, fill_method='ffill', fill_limit=None)[source]

Adds rows so that each group has all non group IDs, optionally filling values by a pandas fill method

Parameters
  • df

  • group_id_cols (List[str]) – typically entity ids. these ids represents groups in the data. data will not be forward/back filled across differences in these ids.

  • non_group_id_cols (List[str]) – typically date or time ids. data will be forward/back filled across differences in these ids

  • fill_method (Optional[str]) – pandas fill methods, None to not fill

  • fill_limit (Optional[int]) – pandas fill limit

Returns

pd_utils.filldata.drop_missing_group_rows(df, fill_id_cols)[source]
pd_utils.filldata.fill_excluded_rows(df, byvars, fillvars=None, **fillna_kwargs)[source]

Takes a dataframe which does not contain all possible combinations of byvars as rows. Creates those rows if fillna_kwargs are passed, calls fillna using fillna_kwargs for fillvars

Parameters
  • df

  • byvars – variables on which dataset should be expanded to product. Can pass a str, list of strs, or a list of pd.Series.

  • fillvars – optional variables to apply fillna to

  • fillna_kwargs – See pandas.DataFrame.fillna for kwargs, value=0 is common

Returns

Example

An example:

df:
             date     id  var
    0  2003-06-09 42223C    1
    1  2003-06-10 09255G    2

with fillna_for_excluded_rows(df, byvars=['date','id'], fillvars='var', value=0) becomes:

              date     id  var
    0  2003-06-09 42223C    1
    1  2003-06-10 42223C    0
    2  2003-06-09 09255G    0
    3  2003-06-10 09255G    2
pd_utils.filldata.fillna_by_groups(df, byvars, exclude_cols=None, str_vars='first', num_vars='mean')[source]

Fills missing values by group, with different handling for string variables versus numeric

WARNING: do not use if index is important, it will be dropped

pd_utils.filldata.fillna_by_groups_and_keep_one_per_group(df, byvars, exclude_cols=None, str_vars='first', num_vars='mean')[source]

Fills missing values by group, with different handling for string variables versus numeric, then keeps one observation per group.

WARNING: do not use if index is important, it will be dropped

pd_utils.load module

pd_utils.load.load_sas(filepath, csv=True, **read_csv_kwargs)[source]

Loads sas sas7bdat file into a pandas DataFrame.

Parameters
  • filepath (str) – str of location of sas7bdat file

  • csv (bool) – when set to True, saves a csv version of the data in the same directory as the sas7bdat. Next time load_sas will load from the csv version rather than sas7bdat, which speeds up load times about 3x. If the sas7bdat file is modified more recently than the csv, the sas7bdat will automatically be loaded and saved to the csv again.

  • read_csv_kwargs – kwargs to pass to pd.read_csv if csv option is True

Returns

pd_utils.merge module

pd_utils.merge.apply_func_to_unique_and_merge(series, func)[source]

This function reduces the given series down to unique values, applies the function, then expands back up to the original shape of the data.

Many Pandas functions can be slow because they’re doing repeated work. This can help optimize some operations.

Parameters
  • series (Series) –

  • func (Callable) – function to be applied to the series

Return type

Series

Returns

Usage

>>>import functools >>>to_datetime = functools.partial(pd.to_datetime, format=’%Y%m’) >>>apply_func_to_unique_and_merge(df[‘MONTH’], to_datetime)

pd_utils.merge.groupby_index(df, byvars, sortvars=None, ascending=True)[source]

Returns a dataframe which is a copy of the old one with an additional column containing an index by groups. Each time the bygroup changes, the index restarts at 0.

Parameters
Returns

pd_utils.merge.groupby_merge(df, byvars, func_str, *func_args, subset='all', replace=False)[source]

Creates a pandas groupby object, applies the aggregation function in func_str, and merges back the aggregated data to the original dataframe.

Parameters
  • df

  • byvars (Union[str, List[str]]) – column names which uniquely identify groups

  • func_str (str) – name of groupby aggregation function such as ‘min’, ‘max’, ‘sum’, ‘count’, etc.

  • func_args – arguments to pass to func

  • subset (Union[str, List[str]]) – column names for which to apply aggregation functions or ‘all’ for all columns

  • replace (bool) – True to replace original columns in the data with aggregated/transformed columns

Returns

Example

>>> import pd_utils
>>> df = pd_utils.groupby_merge(df, ['PERMNO','byvar'], 'max', subset='RET')
pd_utils.merge.left_merge_latest(df, df2, on, left_datevar='Date', right_datevar='Date', max_offset=None, backend='pandas', low_memory=False)[source]

Left merges df2 to df using on, but grabbing the most recent observation (right_datevar will be the soonest earlier than left_datevar). Useful for situations where data needs to be merged with mismatched dates, and just the most recent data available is needed.

Parameters
  • df (DataFrame) – Pandas dataframe containing source data (all rows will be kept), must have on variables and left_datevar

  • df2 (DataFrame) – Pandas dataframe containing data to be merged (only the most recent rows before source data will be kept)

  • on (Union[str, List[str]]) – names of columns on which to match, excluding date

  • left_datevar (str) – name of date variable on which to merge in df

  • right_datevar (str) – name of date variable on which to merge in df2

  • max_offset (Union[int, timedelta, None]) – maximum amount of time to go back to look for a match. When datevar is a datetime column, pass datetime.timedelta. When datevar is an int column (e.g. year), pass an int. Currently only applicable for backend ‘pandas’

  • backend (str) – ‘pandas’ or ‘sql’. Specify the underlying machinery used to perform the merge. ‘pandas’ means native pandas, while ‘sql’ uses pandasql. Try ‘sql’ if you run out of memory.

  • low_memory (bool) – True to reduce memory usage but decrease calculation speed

Returns

pd_utils.plot module

pd_utils.plot.ordinal(n)
pd_utils.plot.plot_multi_axis(df, cols=None, spacing=0.1, colored_axes=True, axis_locations_in_legend=True, legend_kwargs=None, **kwargs)[source]

Plot multiple series with different y-axes

Adapted from https://stackoverflow.com/a/50655786

Parameters
  • df (DataFrame) – Data to be plotted

  • cols (Optional[List[str]]) – subset of columns to plot

  • spacing (float) – Amount of space between y-axes beyond the two which are on the sides of the box

  • colored_axes (bool) – Whether to make axis labels and ticks colored the same as the line on the graph

  • axis_locations_in_legend (bool) – Whether to add to the legend which axis corresponds to which plot

  • legend_kwargs (Optional[Dict[str, Any]]) – Keyword arguments to pass to ax.legend

  • kwargs – df.plot kwargs

Return type

Axes

Returns

pd_utils.port module

pd_utils.port.long_short_portfolio(df, portvar, byvars=None, retvars=None, top_minus_bot=True)[source]

Takes a df with a column of numbered portfolios and creates a new portfolio which is long the top portfolio and short the bottom portfolio.

Parameters
  • df (DataFrame) – dataframe containing a column with portfolio numbers

  • portvar (str) – name of column containing portfolios

  • byvars (Union[str, List[str], None]) – column names containing groups for portfolios. Calculates long-short within these groups. These should be the same groups in which portfolios were formed.

  • retvars (Union[str, List[str], None]) – variables to return in the long-short dataset. By default, will use all numeric variables in the df

  • top_minus_bot (bool) – True to be long the top portfolio, short the bottom portfolio. False to be long the bottom portfolio, short the top portfolio.

Returns

a df of long-short portfolio

pd_utils.port.portfolio(df, groupvar, ngroups=10, cutoffs=None, quant_cutoffs=None, byvars=None, cutdf=None, portvar='portfolio', multiprocess=False)[source]

Constructs portfolios based on percentile values of groupvar.

If ngroups=10, then will form 10 portfolios, with portfolio 1 having the bottom 10 percentile of groupvar, and portfolio 10 having the top 10 percentile of groupvar.

Notes

  • Resets index and drops in output data, so don’t use if index is important (input data not affected)

  • If using a cutdf, MUST have the same bygroups as df. The number of observations within each bygroup can be different, but there MUST be a one-to-one match of bygroups, or this will NOT work correctly. This may require some cleaning of the cutdf first.

  • For some reason, multiprocessing seems to be slower in testing, so it is disabled by default

Parameters
  • df (DataFrame) – input data

  • groupvar (str) – name of variable in df to form portfolios on

  • ngroups (int) – number of portfolios to form. will be ignored if option cutoffs or quant_cutoffs is passed

  • cutoffs (Optional[List[Union[float, int]]]) – e.g. [100, 10000] to form three portfolios, 1 would be < 100, 2 would be > 100 and < 10000, 3 would be > 10000. cannot be used with option ngroups

  • quant_cutoffs (Optional[List[float]]) – eg. [0.1, 0.9] to form three portfolios. 1 would be lowest 10% of data, 2 would be > 10 and < 90 percentiles, 3 would be highest 10%. All will be within byvars if byvars are passed

  • byvars (Union[str, List[str], None]) – name of variable(s) in df, finds portfolios within byvars. For example if byvars=’Month’, would take each month and form portfolios based on the percentiles of the groupvar during only that month

  • cutdf (Optional[DataFrame]) – optionally determine percentiles using another dataset. See second note.

  • portvar (str) – name of portfolio variable in the output dataset

  • multiprocess (bool) – set to True to use all available processors, set to False to use only one, pass an int less or equal to than number of processors to use that amount of processors

Returns

pd_utils.port.portfolio_averages(df, groupvar, avgvars, ngroups=10, byvars=None, cutdf=None, wtvar=None, count=False, portvar='portfolio', avgonly=False)[source]

Creates portfolios and calculates equal- and value-weighted averages of variables within portfolios.

If ngroups=10, then will form 10 portfolios, with portfolio 1 having the bottom 10 percentile of groupvar, and portfolio 10 having the top 10 percentile of groupvar.

Notes

Resets index and drops in output data, so don’t use if index is important (input data not affected)

Parameters
  • df (DataFrame) – input data

  • groupvar (str) – name of variable in df to form portfolios on

  • avgvars (Union[str, List[str]]) – variables to be averaged

  • ngroups (int) – number of portfolios to form

  • byvars (Union[str, List[str], None]) – name of variable(s) in df, finds portfolios within byvars. For example if byvars=’Month’, would take each month and form portfolios based on the percentiles of the groupvar during only that month

  • cutdf (Optional[DataFrame]) – optionally determine percentiles using another dataset

  • wtvar (Optional[str]) – name of variable in df to use for weighting in weighted average

  • count (Union[str, bool]) – pass variable name to get count of non-missing of that variable within groups.

  • portvar (str) – name of portfolio variable in the output dataset

  • avgonly (bool) – True to return only averages, False to return (averages, individual observations with portfolios)

Return type

Union[DataFrame, Tuple[DataFrame, DataFrame]]

Returns

pd_utils.query module

pd_utils.query.select_rows_by_condition_on_columns(df, cols, condition='== 1', logic='or')[source]

Selects rows of a pandas dataframe by evaluating a condition on a subset of the dataframe’s columns.

Parameters
  • df (DataFrame) –

  • cols (List[str]) – column names, the subset of columns on which to evaluate conditions

  • condition (str) – needs to contain comparison operator and right hand side of comparison. For example, ‘== 1’ checks for each row that the value of each column is equal to one.

  • logic (str) – ‘or’ or ‘and’. With ‘or’, only one of the columns in cols need to match the condition for the row to be kept. With ‘and’, all of the columns in cols need to match the condition.

Returns

pd_utils.query.sql(df_list, query)[source]

Convenience function for running a pandasql query. Keeps track of which variables are of datetime type, and converts them back after running the sql query.

Notes

Ensure that dfs are passed in the order that they are used in the query.

Parameters
Returns

pd_utils.regby module

class pd_utils.regby.ResultCounter(num_expected_results)[source]

Bases: object

__init__(num_expected_results)[source]
pd_utils.regby.df_for_reg(df, yvar, xvars, groupvar)[source]
pd_utils.regby.reg_by(df, yvar, xvars, groupvar, merge=False, cons=True, mp=False, stderr=False)[source]

Runs a regression of df[yvar] on df[xvars] by values of groupvar. Outputs a dataframe with values of groupvar and corresponding coefficients, unless merge=True, then outputs the original dataframe with the appropriate coefficients merged in.

Parameters
  • df (DataFrame) –

  • yvar (str) –

  • xvars (List[str]) –

  • groupvar (Union[str, List[str]]) – column names of columns identifying by groups

  • merge (bool) –

  • cons (bool) – True to include a constant, False to not

  • mp (Union[bool, int]) – False to use single processor, True to use all processors, int to use # processors

  • stderr (bool) – True to include standard errors of coefficients

Returns

pd_utils.testing module

pd_utils.testing.to_copy_paste(df, index=False, column_names=True)[source]

Takes a dataframe and prints all of its data in such a format that it can be copy-pasted to create a new dataframe from the pandas.DataFrame() constructor.

Parameters
  • df (DataFrame) –

  • index (bool) – True to include index

  • column_names (bool) – False to exclude column names

Returns

pd_utils.timer module

pd_utils.timer.estimate_time(length, i, start_time, output=True)[source]

Returns the estimate of when a looping operation will be finished.

Parameters
  • length (int) – total number of iterations for the loop

  • i (int) – iterator for the loop

  • start_time (float) – time created from timeit.default_timer(), see examples

  • output (bool) – False to suppress printing estimated time. Use this if you want to just store the time for some other use or custom output.

Returns

Examples

This function goes at the end of the loop to be timed. Outside of this function at the beginning of the loop, you must start a timer object as follows:

start_time = timeit.default_timer()

So the entire loop will look like this:

my_start_time = timeit.default_timer()
for i, item in enumerate(my_list):

    #Do loop stuff here

     estimate_time(len(my_list),i,my_start_time)

pd_utils.transform module

pd_utils.transform.averages(df, avgvars, byvars, wtvar=None, count=False, flatten=True)[source]

Returns equal- and value-weighted averages of variables within groups

Parameters
  • df (DataFrame) –

  • avgvars (Union[str, List[str]]) – variable names to take averages of

  • byvars (Union[str, List[str]]) – variable names for by groups

  • wtvar (Optional[str]) – variable to use for calculating weights in weighted average

  • count (Union[str, bool]) – string of variable name, pass variable name to get count of non-missing of that variable within groups.

  • flatten (bool) – False to return df with multi-level index

Returns

pd_utils.transform.join_col_strings(df, cols)[source]

Takes a dataframe and column name(s) and concatenates string versions of the columns with those names. Useful for when a group is identified by several variables and we need one key variable to describe a group. Returns a pandas Series.

Parameters
  • df (DataFrame) –

  • cols (Union[str, List[str]]) – names of columns in df to be concatenated

Returns

pd_utils.transform.long_to_wide(df, groupvars, values, colindex=None, colindex_only=False)[source]

Takes a “long” format DataFrame and converts to a “wide” format

Parameters
  • df (DataFrame) –

  • groupvars (Union[str, List[str]]) – variables which signify unique observations in the output dataset

  • values (Union[str, List[str]]) – variables which contain the values which need to be transposed

  • colindex (Union[str, List[str], None]) – columns containing extension for column name in the output dataset. If not specified, just uses the count of the row within the group. If a list is provided, each column value will be appended in order separated by _

  • colindex_only (bool) – If True, column names in output data will be only the colindex, and will not include the name of the values variable. Only valid when passing a single value, otherwise multiple columns would have the same name.

Returns

Examples

For example:

If we had a long dataset of returns, with returns 12, 24, 36, 48, and 60 months after the date:
        ticker    ret    months
        AA        .01    12
        AA        .15    24
        AA        .21    36
        AA       -.10    48
        AA        .22    60
and we want to get this to one observation per ticker:
        ticker    ret12    ret24    ret36    ret48    ret60
        AA        .01      .15      .21     -.10      .22
We would use:
long_to_wide(df, groupvars='ticker', values='ret', colindex='months')
pd_utils.transform.state_abbrev(df, col, toabbrev=False)[source]

Replaces a DataFrame’s column of a state abbreviation or state name to the opposite

Parameters
  • df (DataFrame) –

  • col (str) – name of column containing state names or state abbreviations

  • toabbrev (bool) – True to convert state names to abbreviations, defaults to converting abbreviations to state names

Returns

pd_utils.transform.var_change_by_groups(df, var, byvars, datevar='Date', numlags=1)[source]

Used for getting variable changes over time within bygroups.

Notes

Dataset is not sorted in this process. Sort the data in the order in which you wish lags to be created before running this command.

Parameters
  • df (DataFrame) – dataframe containing bygroups, a date variable, and variables of interest

  • var (Union[str, List[str]]) – column names of variables to get changes

  • byvars (Union[str, List[str]]) – column names of variables identifying by groups

  • datevar (str) – column names of variables identifying periods

  • numlags (int) – number of periods to go back to get change

Returns

pd_utils.transform.winsorize(df, pct, subset=None, byvars=None, bot=True, top=True)[source]

Finds observations above the pct percentile and replaces the with the pct percentile value. Does this for all columns, or the subset given by subset.

Parameters
  • df (DataFrame) –

  • pct (Union[float, Tuple[float, float]]) – 0 < float < 1 or list of two values 0 < float < 1. If two values are given, the first will be used for the bottom percentile and the second will be used for the top. If one value is given and both bot and top are True, will use the same value for both.

  • subset (Union[str, List[str], None]) – column name(s) to winsorize

  • byvars (Union[str, List[str], None]) – Column names of columns identifying groups in the data. Winsorizing will be done within those groups.

  • bot (bool) – True to winsorize bottom observations

  • top (bool) – True to winsorize top observations

Return type

DataFrame

Returns

Examples
>>> winsorize(df, .05, subset='RET') # replaces observations of RET below the 5% and above the 95% values
>>> winsorize(df, (.05, .1), subset='RET') #replaces observations of RET below the 5% and above the 90% values

pd_utils.utils module

pd_utils.utils.split(df, keepvars, keyvar='__key_var__')[source]

Splits a dataframe into a list of arrays based on a key variable

pd_utils.utils.split_gen(df, keepvars, keyvar='__key_var__')[source]

Splits a dataframe into a list of arrays based on a key variable