pd_utils package¶
High-level tools for common Pandas workflows
Subpackages¶
Submodules¶
pd_utils.corr module¶
pd_utils.cum module¶
- pd_utils.cum.cumulate(df, cumvars, method, periodvar='Date', byvars=None, time=None, grossify=False, multiprocess=True, replace=False)[source]¶
Cumulates a variable over time. Typically used to get cumulative returns.
- Parameters
df¶ (
DataFrame
) –method¶ (
str
) – ‘between’, ‘zero’, or ‘first’. If ‘zero’, will give returns since the original date. Note: for periods before the original date, this will turn positive returns negative as we are going backwards in time. If ‘between’, will give returns since the prior requested time period. Note that the first period is period 0. If ‘first’, will give returns since the first requested time period.periodvar¶ –
byvars¶ (
Union
[str
,List
[str
],None
]) – column names to use to separate by groupstime¶ (
Optional
[Sequence
[int
]]) – for use with method=’between’. Defines which periods to calculate between.grossify¶ (
bool
) – set to True to add one to all variables then subtract one at the endmultiprocess¶ (
Union
[bool
,int
]) – set to True to use all available processors, set to False to use only one, pass an int less or equal to than number of processors to use that amount of processorsreplace¶ (
bool
) – True to return df with passed columns replaced with cumulated columns. False to return df with both passed columns and cumulated columns
- Returns
- Examples
For example:
For example, if our input data was for date 1/5/2006, but we had shifted dates: permno date RET shift_date 10516 1/5/2006 110% 1/5/2006 10516 1/5/2006 120% 1/6/2006 10516 1/5/2006 105% 1/7/2006 10516 1/5/2006 130% 1/8/2006 Then cumulate(df, 'RET', cumret='between', time=[1,3], get='RET', periodvar='shift_date') would return: permno date RET shift_date cumret 10516 1/5/2006 110% 1/5/2006 110% 10516 1/5/2006 120% 1/6/2006 120% 10516 1/5/2006 105% 1/7/2006 126% 10516 1/5/2006 130% 1/8/2006 130% Then cumulate(df, 'RET', cumret='first', periodvar='shift_date') would return: permno date RET shift_date cumret 10516 1/5/2006 110% 1/5/2006 110% 10516 1/5/2006 120% 1/6/2006 120% 10516 1/5/2006 105% 1/7/2006 126% 10516 1/5/2006 130% 1/8/2006 163.8%
pd_utils.datetime_utils module¶
- class pd_utils.datetime_utils.USTradingCalendar(*args, **kwargs)[source]¶
Bases:
pandas.tseries.holiday.AbstractHolidayCalendar
The US trading day calendar behind the function
tradedays()
.- rules: list[Holiday] = [pandas.tseries.holiday.Holiday, pandas.tseries.holiday.USMartinLutherKingJr, pandas.tseries.holiday.USPresidentsDay, pandas.tseries.holiday.GoodFriday, pandas.tseries.holiday.USMemorialDay, pandas.tseries.holiday.Holiday, pandas.tseries.holiday.USLaborDay, pandas.tseries.holiday.USThanksgivingDay, pandas.tseries.holiday.Holiday]¶
- pd_utils.datetime_utils.convert_sas_date_to_pandas_date(sasdates)[source]¶
Converts a date or Series of dates loaded from a SAS SAS7BDAT file to a pandas date type.
- pd_utils.datetime_utils.expand_months(df, datevar='Date', newdatevar='Daily Date', trade_days=True)[source]¶
Takes a monthly dataframe and returns a daily (trade day or calendar day) dataframe. For each row in the input data, duplicates that row over each trading/calendar day in the month of the date in that row. Creates a new date column containing the daily date.
- Notes
If the input dataset has multiple observations per month, all of these will be expanded. Therefore you will have one row for each trade day for each original observation.
- pd_utils.datetime_utils.expand_time(df, intermediate_periods=False, datevar='Date', freq='m', time=[12, 24, 36, 48, 60], newdate='Shift Date', shiftvar='Shift', custom_business_day=None)[source]¶
Creates new observations in the dataset advancing the time by the int or list given. Creates a new date variable.
- Parameters
df¶ (
DataFrame
) –intermediate_periods¶ (
bool
) – Specify intermediate_periods=True to get periods in between given time periods, e.g. passing time=[12,24,36] will get periods 12, 13, 14, …, 35, 36.freq¶ (
str
) – ‘d’ for daily, ‘m’ for monthly, ‘a’ for annualshiftvar¶ (
str
) – name of variable which specifies how much the time has been shiftedcustom_business_day¶ (
Optional
[CustomBusinessDay
]) – Only used for daily frequency. Defaults to using trading days based on US market holiday calendar. Can pass custom business days for other calendars
- Returns
- pd_utils.datetime_utils.tradedays()[source]¶
Used for constructing a range of dates with pandas date_range function.
- Example
>>> import pandas as pd >>> import pd_utils >>> pd.date_range( >>> start='1/1/2000', >>> end='1/31/2000', >>> freq=pd_utils.tradedays() >>> ) pd.DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-10', '2000-01-11', '2000-01-12', '2000-01-13', '2000-01-14', '2000-01-18', '2000-01-19', '2000-01-20', '2000-01-21', '2000-01-24', '2000-01-25', '2000-01-26', '2000-01-27', '2000-01-28', '2000-01-31'], dtype='datetime64[ns]', freq='C')
pd_utils.filldata module¶
- pd_utils.filldata.add_missing_group_rows(df, group_id_cols, non_group_id_cols, fill_method='ffill', fill_limit=None)[source]¶
Adds rows so that each group has all non group IDs, optionally filling values by a pandas fill method
- Parameters
df¶ –
group_id_cols¶ (
List
[str
]) – typically entity ids. these ids represents groups in the data. data will not be forward/back filled across differences in these ids.non_group_id_cols¶ (
List
[str
]) – typically date or time ids. data will be forward/back filled across differences in these idsfill_method¶ (
Optional
[str
]) – pandas fill methods, None to not fill
- Returns
- pd_utils.filldata.fill_excluded_rows(df, byvars, fillvars=None, **fillna_kwargs)[source]¶
Takes a dataframe which does not contain all possible combinations of byvars as rows. Creates those rows if fillna_kwargs are passed, calls fillna using fillna_kwargs for fillvars
- Parameters
- Returns
- Example
An example:
df: date id var 0 2003-06-09 42223C 1 1 2003-06-10 09255G 2 with fillna_for_excluded_rows(df, byvars=['date','id'], fillvars='var', value=0) becomes: date id var 0 2003-06-09 42223C 1 1 2003-06-10 42223C 0 2 2003-06-09 09255G 0 3 2003-06-10 09255G 2
- pd_utils.filldata.fillna_by_groups(df, byvars, exclude_cols=None, str_vars='first', num_vars='mean')[source]¶
Fills missing values by group, with different handling for string variables versus numeric
WARNING: do not use if index is important, it will be dropped
- pd_utils.filldata.fillna_by_groups_and_keep_one_per_group(df, byvars, exclude_cols=None, str_vars='first', num_vars='mean')[source]¶
Fills missing values by group, with different handling for string variables versus numeric, then keeps one observation per group.
WARNING: do not use if index is important, it will be dropped
pd_utils.load module¶
- pd_utils.load.load_sas(filepath, csv=True, **read_csv_kwargs)[source]¶
Loads sas sas7bdat file into a pandas DataFrame.
- Parameters
csv¶ (
bool
) – when set to True, saves a csv version of the data in the same directory as the sas7bdat. Next time load_sas will load from the csv version rather than sas7bdat, which speeds up load times about 3x. If the sas7bdat file is modified more recently than the csv, the sas7bdat will automatically be loaded and saved to the csv again.read_csv_kwargs¶ – kwargs to pass to pd.read_csv if csv option is True
- Returns
pd_utils.merge module¶
- pd_utils.merge.apply_func_to_unique_and_merge(series, func)[source]¶
This function reduces the given series down to unique values, applies the function, then expands back up to the original shape of the data.
Many Pandas functions can be slow because they’re doing repeated work. This can help optimize some operations.
- Parameters
- Return type
Series
- Returns
- Usage
>>>import functools >>>to_datetime = functools.partial(pd.to_datetime, format=’%Y%m’) >>>apply_func_to_unique_and_merge(df[‘MONTH’], to_datetime)
- pd_utils.merge.groupby_index(df, byvars, sortvars=None, ascending=True)[source]¶
Returns a dataframe which is a copy of the old one with an additional column containing an index by groups. Each time the bygroup changes, the index restarts at 0.
- pd_utils.merge.groupby_merge(df, byvars, func_str, *func_args, subset='all', replace=False)[source]¶
Creates a pandas groupby object, applies the aggregation function in func_str, and merges back the aggregated data to the original dataframe.
- Parameters
df¶ –
byvars¶ (
Union
[str
,List
[str
]]) – column names which uniquely identify groupsfunc_str¶ (
str
) – name of groupby aggregation function such as ‘min’, ‘max’, ‘sum’, ‘count’, etc.func_args¶ – arguments to pass to func
subset¶ (
Union
[str
,List
[str
]]) – column names for which to apply aggregation functions or ‘all’ for all columnsreplace¶ (
bool
) – True to replace original columns in the data with aggregated/transformed columns
- Returns
- Example
>>> import pd_utils >>> df = pd_utils.groupby_merge(df, ['PERMNO','byvar'], 'max', subset='RET')
- pd_utils.merge.left_merge_latest(df, df2, on, left_datevar='Date', right_datevar='Date', max_offset=None, backend='pandas', low_memory=False)[source]¶
Left merges df2 to df using on, but grabbing the most recent observation (right_datevar will be the soonest earlier than left_datevar). Useful for situations where data needs to be merged with mismatched dates, and just the most recent data available is needed.
- Parameters
df¶ (
DataFrame
) – Pandas dataframe containing source data (all rows will be kept), must have on variables and left_datevardf2¶ (
DataFrame
) – Pandas dataframe containing data to be merged (only the most recent rows before source data will be kept)on¶ (
Union
[str
,List
[str
]]) – names of columns on which to match, excluding dateleft_datevar¶ (
str
) – name of date variable on which to merge in dfright_datevar¶ (
str
) – name of date variable on which to merge in df2max_offset¶ (
Union
[int
,timedelta
,None
]) – maximum amount of time to go back to look for a match. When datevar is a datetime column, pass datetime.timedelta. When datevar is an int column (e.g. year), pass an int. Currently only applicable for backend ‘pandas’backend¶ (
str
) – ‘pandas’ or ‘sql’. Specify the underlying machinery used to perform the merge. ‘pandas’ means native pandas, while ‘sql’ uses pandasql. Try ‘sql’ if you run out of memory.low_memory¶ (
bool
) – True to reduce memory usage but decrease calculation speed
- Returns
pd_utils.plot module¶
- pd_utils.plot.ordinal(n)¶
- pd_utils.plot.plot_multi_axis(df, cols=None, spacing=0.1, colored_axes=True, axis_locations_in_legend=True, legend_kwargs=None, **kwargs)[source]¶
Plot multiple series with different y-axes
Adapted from https://stackoverflow.com/a/50655786
- Parameters
df¶ (
DataFrame
) – Data to be plottedspacing¶ (
float
) – Amount of space between y-axes beyond the two which are on the sides of the boxcolored_axes¶ (
bool
) – Whether to make axis labels and ticks colored the same as the line on the graphaxis_locations_in_legend¶ (
bool
) – Whether to add to the legend which axis corresponds to which plotlegend_kwargs¶ (
Optional
[Dict
[str
,Any
]]) – Keyword arguments to pass to ax.legendkwargs¶ – df.plot kwargs
- Return type
- Returns
pd_utils.port module¶
- pd_utils.port.long_short_portfolio(df, portvar, byvars=None, retvars=None, top_minus_bot=True)[source]¶
Takes a df with a column of numbered portfolios and creates a new portfolio which is long the top portfolio and short the bottom portfolio.
- Parameters
df¶ (
DataFrame
) – dataframe containing a column with portfolio numbersbyvars¶ (
Union
[str
,List
[str
],None
]) – column names containing groups for portfolios. Calculates long-short within these groups. These should be the same groups in which portfolios were formed.retvars¶ (
Union
[str
,List
[str
],None
]) – variables to return in the long-short dataset. By default, will use all numeric variables in the dftop_minus_bot¶ (
bool
) – True to be long the top portfolio, short the bottom portfolio. False to be long the bottom portfolio, short the top portfolio.
- Returns
a df of long-short portfolio
- pd_utils.port.portfolio(df, groupvar, ngroups=10, cutoffs=None, quant_cutoffs=None, byvars=None, cutdf=None, portvar='portfolio', multiprocess=False)[source]¶
Constructs portfolios based on percentile values of groupvar.
If ngroups=10, then will form 10 portfolios, with portfolio 1 having the bottom 10 percentile of groupvar, and portfolio 10 having the top 10 percentile of groupvar.
- Notes
Resets index and drops in output data, so don’t use if index is important (input data not affected)
If using a cutdf, MUST have the same bygroups as df. The number of observations within each bygroup can be different, but there MUST be a one-to-one match of bygroups, or this will NOT work correctly. This may require some cleaning of the cutdf first.
For some reason, multiprocessing seems to be slower in testing, so it is disabled by default
- Parameters
df¶ (
DataFrame
) – input datagroupvar¶ (
str
) – name of variable in df to form portfolios onngroups¶ (
int
) – number of portfolios to form. will be ignored if option cutoffs or quant_cutoffs is passedcutoffs¶ (
Optional
[List
[Union
[float
,int
]]]) – e.g. [100, 10000] to form three portfolios, 1 would be < 100, 2 would be > 100 and < 10000, 3 would be > 10000. cannot be used with option ngroupsquant_cutoffs¶ (
Optional
[List
[float
]]) – eg. [0.1, 0.9] to form three portfolios. 1 would be lowest 10% of data, 2 would be > 10 and < 90 percentiles, 3 would be highest 10%. All will be within byvars if byvars are passedbyvars¶ (
Union
[str
,List
[str
],None
]) – name of variable(s) in df, finds portfolios within byvars. For example if byvars=’Month’, would take each month and form portfolios based on the percentiles of the groupvar during only that monthcutdf¶ (
Optional
[DataFrame
]) – optionally determine percentiles using another dataset. See second note.portvar¶ (
str
) – name of portfolio variable in the output datasetmultiprocess¶ (
bool
) – set to True to use all available processors, set to False to use only one, pass an int less or equal to than number of processors to use that amount of processors
- Returns
- pd_utils.port.portfolio_averages(df, groupvar, avgvars, ngroups=10, byvars=None, cutdf=None, wtvar=None, count=False, portvar='portfolio', avgonly=False)[source]¶
Creates portfolios and calculates equal- and value-weighted averages of variables within portfolios.
If ngroups=10, then will form 10 portfolios, with portfolio 1 having the bottom 10 percentile of groupvar, and portfolio 10 having the top 10 percentile of groupvar.
- Notes
Resets index and drops in output data, so don’t use if index is important (input data not affected)
- Parameters
df¶ (
DataFrame
) – input datagroupvar¶ (
str
) – name of variable in df to form portfolios onbyvars¶ (
Union
[str
,List
[str
],None
]) – name of variable(s) in df, finds portfolios within byvars. For example if byvars=’Month’, would take each month and form portfolios based on the percentiles of the groupvar during only that monthcutdf¶ (
Optional
[DataFrame
]) – optionally determine percentiles using another datasetwtvar¶ (
Optional
[str
]) – name of variable in df to use for weighting in weighted averagecount¶ (
Union
[str
,bool
]) – pass variable name to get count of non-missing of that variable within groups.portvar¶ (
str
) – name of portfolio variable in the output datasetavgonly¶ (
bool
) – True to return only averages, False to return (averages, individual observations with portfolios)
- Return type
- Returns
pd_utils.query module¶
- pd_utils.query.select_rows_by_condition_on_columns(df, cols, condition='== 1', logic='or')[source]¶
Selects rows of a pandas dataframe by evaluating a condition on a subset of the dataframe’s columns.
- Parameters
df¶ (
DataFrame
) –cols¶ (
List
[str
]) – column names, the subset of columns on which to evaluate conditionscondition¶ (
str
) – needs to contain comparison operator and right hand side of comparison. For example, ‘== 1’ checks for each row that the value of each column is equal to one.logic¶ (
str
) – ‘or’ or ‘and’. With ‘or’, only one of the columns in cols need to match the condition for the row to be kept. With ‘and’, all of the columns in cols need to match the condition.
- Returns
pd_utils.regby module¶
- pd_utils.regby.reg_by(df, yvar, xvars, groupvar, merge=False, cons=True, mp=False, stderr=False)[source]¶
Runs a regression of df[yvar] on df[xvars] by values of groupvar. Outputs a dataframe with values of groupvar and corresponding coefficients, unless merge=True, then outputs the original dataframe with the appropriate coefficients merged in.
pd_utils.testing module¶
pd_utils.timer module¶
- pd_utils.timer.estimate_time(length, i, start_time, output=True)[source]¶
Returns the estimate of when a looping operation will be finished.
- Parameters
- Returns
- Examples
This function goes at the end of the loop to be timed. Outside of this function at the beginning of the loop, you must start a timer object as follows:
start_time = timeit.default_timer()
So the entire loop will look like this:
my_start_time = timeit.default_timer() for i, item in enumerate(my_list): #Do loop stuff here estimate_time(len(my_list),i,my_start_time)
pd_utils.transform module¶
- pd_utils.transform.averages(df, avgvars, byvars, wtvar=None, count=False, flatten=True)[source]¶
Returns equal- and value-weighted averages of variables within groups
- Parameters
df¶ (
DataFrame
) –avgvars¶ (
Union
[str
,List
[str
]]) – variable names to take averages ofbyvars¶ (
Union
[str
,List
[str
]]) – variable names for by groupswtvar¶ (
Optional
[str
]) – variable to use for calculating weights in weighted averagecount¶ (
Union
[str
,bool
]) – string of variable name, pass variable name to get count of non-missing of that variable within groups.
- Returns
- pd_utils.transform.join_col_strings(df, cols)[source]¶
Takes a dataframe and column name(s) and concatenates string versions of the columns with those names. Useful for when a group is identified by several variables and we need one key variable to describe a group. Returns a pandas Series.
- pd_utils.transform.long_to_wide(df, groupvars, values, colindex=None, colindex_only=False)[source]¶
Takes a “long” format DataFrame and converts to a “wide” format
- Parameters
df¶ (
DataFrame
) –groupvars¶ (
Union
[str
,List
[str
]]) – variables which signify unique observations in the output datasetvalues¶ (
Union
[str
,List
[str
]]) – variables which contain the values which need to be transposedcolindex¶ (
Union
[str
,List
[str
],None
]) – columns containing extension for column name in the output dataset. If not specified, just uses the count of the row within the group. If a list is provided, each column value will be appended in order separated by _colindex_only¶ (
bool
) – If True, column names in output data will be only the colindex, and will not include the name of the values variable. Only valid when passing a single value, otherwise multiple columns would have the same name.
- Returns
- Examples
For example:
If we had a long dataset of returns, with returns 12, 24, 36, 48, and 60 months after the date: ticker ret months AA .01 12 AA .15 24 AA .21 36 AA -.10 48 AA .22 60 and we want to get this to one observation per ticker: ticker ret12 ret24 ret36 ret48 ret60 AA .01 .15 .21 -.10 .22 We would use: long_to_wide(df, groupvars='ticker', values='ret', colindex='months')
- pd_utils.transform.state_abbrev(df, col, toabbrev=False)[source]¶
Replaces a DataFrame’s column of a state abbreviation or state name to the opposite
- pd_utils.transform.var_change_by_groups(df, var, byvars, datevar='Date', numlags=1)[source]¶
Used for getting variable changes over time within bygroups.
- Notes
Dataset is not sorted in this process. Sort the data in the order in which you wish lags to be created before running this command.
- Parameters
df¶ (
DataFrame
) – dataframe containing bygroups, a date variable, and variables of interestvar¶ (
Union
[str
,List
[str
]]) – column names of variables to get changesbyvars¶ (
Union
[str
,List
[str
]]) – column names of variables identifying by groupsdatevar¶ (
str
) – column names of variables identifying periods
- Returns
- pd_utils.transform.winsorize(df, pct, subset=None, byvars=None, bot=True, top=True)[source]¶
Finds observations above the pct percentile and replaces the with the pct percentile value. Does this for all columns, or the subset given by subset.
- Parameters
df¶ (
DataFrame
) –pct¶ (
Union
[float
,Tuple
[float
,float
]]) – 0 < float < 1 or list of two values 0 < float < 1. If two values are given, the first will be used for the bottom percentile and the second will be used for the top. If one value is given and both bot and top are True, will use the same value for both.subset¶ (
Union
[str
,List
[str
],None
]) – column name(s) to winsorizebyvars¶ (
Union
[str
,List
[str
],None
]) – Column names of columns identifying groups in the data. Winsorizing will be done within those groups.
- Return type
DataFrame
- Returns
- Examples
>>> winsorize(df, .05, subset='RET') # replaces observations of RET below the 5% and above the 95% values >>> winsorize(df, (.05, .1), subset='RET') #replaces observations of RET below the 5% and above the 90% values