pd_utils package¶

High-level tools for common Pandas workflows

Subpackages¶

pd_utils.optimize package

Submodules¶

pd_utils.corr module¶

pd_utils.corr.formatted_corr_df(df, cols=None)[source]¶

Calculates correlations on a DataFrame and displays only the lower triangular of the resulting correlation DataFrame.

Parameters

df¶ (DataFrame) –
cols¶ (Optional[Sequence[str]]) – subset of column names on which to calculate correlations

Return type

DataFrame

Returns

pd_utils.cum module¶

pd_utils.cum.create_windows(periods, time, method='between')[source]¶

pd_utils.cum.cumulate(df, cumvars, method, periodvar='Date', byvars=None, time=None, grossify=False, multiprocess=True, replace=False)[source]¶

Cumulates a variable over time. Typically used to get cumulative returns.

Parameters

df¶ (DataFrame) –
cumvars¶ (Union[str, List[str]]) – column names to cumulate
method¶ (str) – ‘between’, ‘zero’, or ‘first’. If ‘zero’, will give returns since the original date. Note: for periods before the original date, this will turn positive returns negative as we are going backwards in time. If ‘between’, will give returns since the prior requested time period. Note that the first period is period 0. If ‘first’, will give returns since the first requested time period.
periodvar¶ –
byvars¶ (Union[str, List[str], None]) – column names to use to separate by groups
time¶ (Optional[Sequence[int]]) – for use with method=’between’. Defines which periods to calculate between.
grossify¶ (bool) – set to True to add one to all variables then subtract one at the end
multiprocess¶ (Union[bool, int]) – set to True to use all available processors, set to False to use only one, pass an int less or equal to than number of processors to use that amount of processors
replace¶ (bool) – True to return df with passed columns replaced with cumulated columns. False to return df with both passed columns and cumulated columns

Returns

Examples

For example:

For example, if our input data was for date 1/5/2006, but we had shifted dates:
     permno  date      RET  shift_date
     10516   1/5/2006  110%  1/5/2006
     10516   1/5/2006  120%  1/6/2006
     10516   1/5/2006  105%  1/7/2006
     10516   1/5/2006  130%  1/8/2006
 Then cumulate(df, 'RET', cumret='between', time=[1,3], get='RET', periodvar='shift_date') would return:
     permno  date      RET  shift_date  cumret
     10516   1/5/2006  110%  1/5/2006    110%
     10516   1/5/2006  120%  1/6/2006    120%
     10516   1/5/2006  105%  1/7/2006    126%
     10516   1/5/2006  130%  1/8/2006    130%
 Then cumulate(df, 'RET', cumret='first', periodvar='shift_date') would return:
     permno  date      RET  shift_date  cumret
     10516   1/5/2006  110%  1/5/2006    110%
     10516   1/5/2006  120%  1/6/2006    120%
     10516   1/5/2006  105%  1/7/2006    126%
     10516   1/5/2006  130%  1/8/2006    163.8%

pd_utils.cum.window_mapping(time, col, method='between')[source]¶: Takes a pandas series of dates as inputs, calculates windows, and returns a series of which windows each observation are in. To be used with groupby.transform()

pd_utils.datetime_utils module¶

class pd_utils.datetime_utils.USTradingCalendar(*args, **kwargs)[source]¶

Bases: pandas.tseries.holiday.AbstractHolidayCalendar

The US trading day calendar behind the function tradedays().

rules: list[Holiday] = [pandas.tseries.holiday.Holiday, pandas.tseries.holiday.USMartinLutherKingJr, pandas.tseries.holiday.USPresidentsDay, pandas.tseries.holiday.GoodFriday, pandas.tseries.holiday.USMemorialDay, pandas.tseries.holiday.Holiday, pandas.tseries.holiday.USLaborDay, pandas.tseries.holiday.USThanksgivingDay, pandas.tseries.holiday.Holiday]¶

pd_utils.datetime_utils.convert_sas_date_to_pandas_date(sasdates)[source]¶

Converts a date or Series of dates loaded from a SAS SAS7BDAT file to a pandas date type.

Parameters: sasdates¶ (Union[Series, int]) – SAS7BDAT-loaded date(s) to convert
Return type: Union[Series, Timestamp]
Returns

pd_utils.datetime_utils.expand_months(df, datevar='Date', newdatevar='Daily Date', trade_days=True)[source]¶

Takes a monthly dataframe and returns a daily (trade day or calendar day) dataframe. For each row in the input data, duplicates that row over each trading/calendar day in the month of the date in that row. Creates a new date column containing the daily date.

Notes

If the input dataset has multiple observations per month, all of these will be expanded. Therefore you will have one row for each trade day for each original observation.

Parameters

df¶ (DataFrame) – DataFrame containing a date variable
datevar¶ (str) – name of column containing dates in the input df
newdatevar¶ (str) – name of new column to be created containing daily dates
trade_days¶ (bool) – True to use trading days and False to use calendar days

Returns

pd_utils.datetime_utils.expand_time(df, intermediate_periods=False, datevar='Date', freq='m', time=[12, 24, 36, 48, 60], newdate='Shift Date', shiftvar='Shift', custom_business_day=None)[source]¶

Creates new observations in the dataset advancing the time by the int or list given. Creates a new date variable.

Parameters

df¶ (DataFrame) –
intermediate_periods¶ (bool) – Specify intermediate_periods=True to get periods in between given time periods, e.g. passing time=[12,24,36] will get periods 12, 13, 14, …, 35, 36.
datevar¶ (str) – column name of date variable
freq¶ (str) – ‘d’ for daily, ‘m’ for monthly, ‘a’ for annual
time¶ (List[int]) – number of periods to advance by
newdate¶ (str) – name of new date in the output data
shiftvar¶ (str) – name of variable which specifies how much the time has been shifted
custom_business_day¶ (Optional[CustomBusinessDay]) – Only used for daily frequency. Defaults to using trading days based on US market holiday calendar. Can pass custom business days for other calendars

Returns

pd_utils.datetime_utils.tradedays()[source]¶

Used for constructing a range of dates with pandas date_range function.

Example

>>> import pandas as pd
>>> import pd_utils
>>> pd.date_range(
>>>     start='1/1/2000',
>>>     end='1/31/2000',
>>>     freq=pd_utils.tradedays()
>>> )
pd.DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06',
           '2000-01-07', '2000-01-10', '2000-01-11', '2000-01-12',
           '2000-01-13', '2000-01-14', '2000-01-18', '2000-01-19',
           '2000-01-20', '2000-01-21', '2000-01-24', '2000-01-25',
           '2000-01-26', '2000-01-27', '2000-01-28', '2000-01-31'],
          dtype='datetime64[ns]', freq='C')

pd_utils.datetime_utils.year_month_from_date(df, date='Date', yearname='Year', monthname='Month')[source]¶

Takes a dataframe with a datetime object and creates year and month variables

Parameters

df¶ (DataFrame) –
date¶ (str) – name of date column
yearname¶ (str) – name of year column to be created
monthname¶ (str) – name of month column to be created

Returns

pd_utils.datetime_utils.year_month_from_single_date(date)[source]¶

pd_utils.filldata module¶

pd_utils.filldata.add_missing_group_rows(df, group_id_cols, non_group_id_cols, fill_method='ffill', fill_limit=None)[source]¶

Adds rows so that each group has all non group IDs, optionally filling values by a pandas fill method

Parameters

df¶ –
group_id_cols¶ (List[str]) – typically entity ids. these ids represents groups in the data. data will not be forward/back filled across differences in these ids.
non_group_id_cols¶ (List[str]) – typically date or time ids. data will be forward/back filled across differences in these ids
fill_method¶ (Optional[str]) – pandas fill methods, None to not fill
fill_limit¶ (Optional[int]) – pandas fill limit

Returns

pd_utils.filldata.drop_missing_group_rows(df, fill_id_cols)[source]¶

pd_utils.filldata.fill_excluded_rows(df, byvars, fillvars=None, **fillna_kwargs)[source]¶

Takes a dataframe which does not contain all possible combinations of byvars as rows. Creates those rows if fillna_kwargs are passed, calls fillna using fillna_kwargs for fillvars

Parameters

df¶ –
byvars¶ – variables on which dataset should be expanded to product. Can pass a str, list of strs, or a list of pd.Series.
fillvars¶ – optional variables to apply fillna to
fillna_kwargs¶ – See pandas.DataFrame.fillna for kwargs, value=0 is common

Returns

Example

An example:

df:
             date     id  var
    0  2003-06-09 42223C    1
    1  2003-06-10 09255G    2

with fillna_for_excluded_rows(df, byvars=['date','id'], fillvars='var', value=0) becomes:

              date     id  var
    0  2003-06-09 42223C    1
    1  2003-06-10 42223C    0
    2  2003-06-09 09255G    0
    3  2003-06-10 09255G    2

pd_utils.filldata.fillna_by_groups(df, byvars, exclude_cols=None, str_vars='first', num_vars='mean')[source]¶

Fills missing values by group, with different handling for string variables versus numeric

WARNING: do not use if index is important, it will be dropped

pd_utils.filldata.fillna_by_groups_and_keep_one_per_group(df, byvars, exclude_cols=None, str_vars='first', num_vars='mean')[source]¶

Fills missing values by group, with different handling for string variables versus numeric, then keeps one observation per group.

WARNING: do not use if index is important, it will be dropped

pd_utils.load module¶

pd_utils.load.load_sas(filepath, csv=True, **read_csv_kwargs)[source]¶

Loads sas sas7bdat file into a pandas DataFrame.

Parameters

filepath¶ (str) – str of location of sas7bdat file
csv¶ (bool) – when set to True, saves a csv version of the data in the same directory as the sas7bdat. Next time load_sas will load from the csv version rather than sas7bdat, which speeds up load times about 3x. If the sas7bdat file is modified more recently than the csv, the sas7bdat will automatically be loaded and saved to the csv again.
read_csv_kwargs¶ – kwargs to pass to pd.read_csv if csv option is True

Returns

pd_utils.merge module¶

pd_utils.merge.apply_func_to_unique_and_merge(series, func)[source]¶

This function reduces the given series down to unique values, applies the function, then expands back up to the original shape of the data.

Many Pandas functions can be slow because they’re doing repeated work. This can help optimize some operations.

Parameters

series¶ (Series) –
func¶ (Callable) – function to be applied to the series

Return type

Series

Returns

Usage

>>>import functools >>>to_datetime = functools.partial(pd.to_datetime, format=’%Y%m’) >>>apply_func_to_unique_and_merge(df[‘MONTH’], to_datetime)

pd_utils.merge.groupby_index(df, byvars, sortvars=None, ascending=True)[source]¶

Returns a dataframe which is a copy of the old one with an additional column containing an index by groups. Each time the bygroup changes, the index restarts at 0.

Parameters

df¶ (DataFrame) –
byvars¶ (Union[str, List[str]]) – column names containing group identifiers
sortvars¶ (Union[str, List[str], None]) – column names to sort by within by groups
ascending¶ (bool) – direction of sort

Returns

pd_utils.merge.groupby_merge(df, byvars, func_str, *func_args, subset='all', replace=False)[source]¶

Creates a pandas groupby object, applies the aggregation function in func_str, and merges back the aggregated data to the original dataframe.

Parameters

df¶ –
byvars¶ (Union[str, List[str]]) – column names which uniquely identify groups
func_str¶ (str) – name of groupby aggregation function such as ‘min’, ‘max’, ‘sum’, ‘count’, etc.
func_args¶ – arguments to pass to func
subset¶ (Union[str, List[str]]) – column names for which to apply aggregation functions or ‘all’ for all columns
replace¶ (bool) – True to replace original columns in the data with aggregated/transformed columns

Returns

Example

>>> import pd_utils
>>> df = pd_utils.groupby_merge(df, ['PERMNO','byvar'], 'max', subset='RET')

pd_utils.merge.left_merge_latest(df, df2, on, left_datevar='Date', right_datevar='Date', max_offset=None, backend='pandas', low_memory=False)[source]¶

Left merges df2 to df using on, but grabbing the most recent observation (right_datevar will be the soonest earlier than left_datevar). Useful for situations where data needs to be merged with mismatched dates, and just the most recent data available is needed.

Parameters

df¶ (DataFrame) – Pandas dataframe containing source data (all rows will be kept), must have on variables and left_datevar
df2¶ (DataFrame) – Pandas dataframe containing data to be merged (only the most recent rows before source data will be kept)
on¶ (Union[str, List[str]]) – names of columns on which to match, excluding date
left_datevar¶ (str) – name of date variable on which to merge in df
right_datevar¶ (str) – name of date variable on which to merge in df2
max_offset¶ (Union[int, timedelta, None]) – maximum amount of time to go back to look for a match. When datevar is a datetime column, pass datetime.timedelta. When datevar is an int column (e.g. year), pass an int. Currently only applicable for backend ‘pandas’
backend¶ (str) – ‘pandas’ or ‘sql’. Specify the underlying machinery used to perform the merge. ‘pandas’ means native pandas, while ‘sql’ uses pandasql. Try ‘sql’ if you run out of memory.
low_memory¶ (bool) – True to reduce memory usage but decrease calculation speed

Returns

pd_utils.plot module¶

pd_utils.plot.ordinal(n)¶

pd_utils.plot.plot_multi_axis(df, cols=None, spacing=0.1, colored_axes=True, axis_locations_in_legend=True, legend_kwargs=None, **kwargs)[source]¶

Plot multiple series with different y-axes

Adapted from https://stackoverflow.com/a/50655786

Parameters

df¶ (DataFrame) – Data to be plotted
cols¶ (Optional[List[str]]) – subset of columns to plot
spacing¶ (float) – Amount of space between y-axes beyond the two which are on the sides of the box
colored_axes¶ (bool) – Whether to make axis labels and ticks colored the same as the line on the graph
axis_locations_in_legend¶ (bool) – Whether to add to the legend which axis corresponds to which plot
legend_kwargs¶ (Optional[Dict[str, Any]]) – Keyword arguments to pass to ax.legend
kwargs¶ – df.plot kwargs

Return type

Axes

Returns

pd_utils.port module¶

pd_utils.port.long_short_portfolio(df, portvar, byvars=None, retvars=None, top_minus_bot=True)[source]¶

Takes a df with a column of numbered portfolios and creates a new portfolio which is long the top portfolio and short the bottom portfolio.

Parameters

df¶ (DataFrame) – dataframe containing a column with portfolio numbers
portvar¶ (str) – name of column containing portfolios
byvars¶ (Union[str, List[str], None]) – column names containing groups for portfolios. Calculates long-short within these groups. These should be the same groups in which portfolios were formed.
retvars¶ (Union[str, List[str], None]) – variables to return in the long-short dataset. By default, will use all numeric variables in the df
top_minus_bot¶ (bool) – True to be long the top portfolio, short the bottom portfolio. False to be long the bottom portfolio, short the top portfolio.

Returns

a df of long-short portfolio

pd_utils.port.portfolio(df, groupvar, ngroups=10, cutoffs=None, quant_cutoffs=None, byvars=None, cutdf=None, portvar='portfolio', multiprocess=False)[source]¶

Constructs portfolios based on percentile values of groupvar.

If ngroups=10, then will form 10 portfolios, with portfolio 1 having the bottom 10 percentile of groupvar, and portfolio 10 having the top 10 percentile of groupvar.

Notes

Resets index and drops in output data, so don’t use if index is important (input data not affected)
If using a cutdf, MUST have the same bygroups as df. The number of observations within each bygroup can be different, but there MUST be a one-to-one match of bygroups, or this will NOT work correctly. This may require some cleaning of the cutdf first.
For some reason, multiprocessing seems to be slower in testing, so it is disabled by default

Parameters

df¶ (DataFrame) – input data
groupvar¶ (str) – name of variable in df to form portfolios on
ngroups¶ (int) – number of portfolios to form. will be ignored if option cutoffs or quant_cutoffs is passed
cutoffs¶ (Optional[List[Union[float, int]]]) – e.g. [100, 10000] to form three portfolios, 1 would be < 100, 2 would be > 100 and < 10000, 3 would be > 10000. cannot be used with option ngroups
quant_cutoffs¶ (Optional[List[float]]) – eg. [0.1, 0.9] to form three portfolios. 1 would be lowest 10% of data, 2 would be > 10 and < 90 percentiles, 3 would be highest 10%. All will be within byvars if byvars are passed
byvars¶ (Union[str, List[str], None]) – name of variable(s) in df, finds portfolios within byvars. For example if byvars=’Month’, would take each month and form portfolios based on the percentiles of the groupvar during only that month
cutdf¶ (Optional[DataFrame]) – optionally determine percentiles using another dataset. See second note.
portvar¶ (str) – name of portfolio variable in the output dataset
multiprocess¶ (bool) – set to True to use all available processors, set to False to use only one, pass an int less or equal to than number of processors to use that amount of processors

Returns

pd_utils.port.portfolio_averages(df, groupvar, avgvars, ngroups=10, byvars=None, cutdf=None, wtvar=None, count=False, portvar='portfolio', avgonly=False)[source]¶

Creates portfolios and calculates equal- and value-weighted averages of variables within portfolios.

If ngroups=10, then will form 10 portfolios, with portfolio 1 having the bottom 10 percentile of groupvar, and portfolio 10 having the top 10 percentile of groupvar.

Notes

Resets index and drops in output data, so don’t use if index is important (input data not affected)

Parameters

df¶ (DataFrame) – input data
groupvar¶ (str) – name of variable in df to form portfolios on
avgvars¶ (Union[str, List[str]]) – variables to be averaged
ngroups¶ (int) – number of portfolios to form
byvars¶ (Union[str, List[str], None]) – name of variable(s) in df, finds portfolios within byvars. For example if byvars=’Month’, would take each month and form portfolios based on the percentiles of the groupvar during only that month
cutdf¶ (Optional[DataFrame]) – optionally determine percentiles using another dataset
wtvar¶ (Optional[str]) – name of variable in df to use for weighting in weighted average
count¶ (Union[str, bool]) – pass variable name to get count of non-missing of that variable within groups.
portvar¶ (str) – name of portfolio variable in the output dataset
avgonly¶ (bool) – True to return only averages, False to return (averages, individual observations with portfolios)

Return type

Union[DataFrame, Tuple[DataFrame, DataFrame]]

Returns

pd_utils.query module¶

pd_utils.query.select_rows_by_condition_on_columns(df, cols, condition='== 1', logic='or')[source]¶

Selects rows of a pandas dataframe by evaluating a condition on a subset of the dataframe’s columns.

Parameters

df¶ (DataFrame) –
cols¶ (List[str]) – column names, the subset of columns on which to evaluate conditions
condition¶ (str) – needs to contain comparison operator and right hand side of comparison. For example, ‘== 1’ checks for each row that the value of each column is equal to one.
logic¶ (str) – ‘or’ or ‘and’. With ‘or’, only one of the columns in cols need to match the condition for the row to be kept. With ‘and’, all of the columns in cols need to match the condition.

Returns

pd_utils.query.sql(df_list, query)[source]¶

Convenience function for running a pandasql query. Keeps track of which variables are of datetime type, and converts them back after running the sql query.

Notes

Ensure that dfs are passed in the order that they are used in the query.

Parameters

df_list¶ (List[DataFrame]) –
query¶ (str) –

Returns

pd_utils.regby module¶

class pd_utils.regby.ResultCounter(num_expected_results)[source]¶

Bases: object

__init__(num_expected_results)[source]¶

pd_utils.regby.df_for_reg(df, yvar, xvars, groupvar)[source]¶

pd_utils.regby.reg_by(df, yvar, xvars, groupvar, merge=False, cons=True, mp=False, stderr=False)[source]¶

Runs a regression of df[yvar] on df[xvars] by values of groupvar. Outputs a dataframe with values of groupvar and corresponding coefficients, unless merge=True, then outputs the original dataframe with the appropriate coefficients merged in.

Parameters

df¶ (DataFrame) –
yvar¶ (str) –
xvars¶ (List[str]) –
groupvar¶ (Union[str, List[str]]) – column names of columns identifying by groups
merge¶ (bool) –
cons¶ (bool) – True to include a constant, False to not
mp¶ (Union[bool, int]) – False to use single processor, True to use all processors, int to use # processors
stderr¶ (bool) – True to include standard errors of coefficients

Returns

pd_utils.testing module¶

pd_utils.testing.to_copy_paste(df, index=False, column_names=True)[source]¶

Takes a dataframe and prints all of its data in such a format that it can be copy-pasted to create a new dataframe from the pandas.DataFrame() constructor.

Parameters

df¶ (DataFrame) –
index¶ (bool) – True to include index
column_names¶ (bool) – False to exclude column names

Returns

pd_utils.timer module¶

pd_utils.timer.estimate_time(length, i, start_time, output=True)[source]¶

Returns the estimate of when a looping operation will be finished.

Parameters

length¶ (int) – total number of iterations for the loop
i¶ (int) – iterator for the loop
start_time¶ (float) – time created from timeit.default_timer(), see examples
output¶ (bool) – False to suppress printing estimated time. Use this if you want to just store the time for some other use or custom output.

Returns

Examples

This function goes at the end of the loop to be timed. Outside of this function at the beginning of the loop, you must start a timer object as follows:

start_time = timeit.default_timer()

So the entire loop will look like this:

my_start_time = timeit.default_timer()
for i, item in enumerate(my_list):

    #Do loop stuff here

     estimate_time(len(my_list),i,my_start_time)

pd_utils.transform module¶

pd_utils.transform.averages(df, avgvars, byvars, wtvar=None, count=False, flatten=True)[source]¶

Returns equal- and value-weighted averages of variables within groups

Parameters

df¶ (DataFrame) –
avgvars¶ (Union[str, List[str]]) – variable names to take averages of
byvars¶ (Union[str, List[str]]) – variable names for by groups
wtvar¶ (Optional[str]) – variable to use for calculating weights in weighted average
count¶ (Union[str, bool]) – string of variable name, pass variable name to get count of non-missing of that variable within groups.
flatten¶ (bool) – False to return df with multi-level index

Returns

pd_utils.transform.join_col_strings(df, cols)[source]¶

Takes a dataframe and column name(s) and concatenates string versions of the columns with those names. Useful for when a group is identified by several variables and we need one key variable to describe a group. Returns a pandas Series.

Parameters

df¶ (DataFrame) –
cols¶ (Union[str, List[str]]) – names of columns in df to be concatenated

Returns

pd_utils.transform.long_to_wide(df, groupvars, values, colindex=None, colindex_only=False)[source]¶

Takes a “long” format DataFrame and converts to a “wide” format

Parameters

df¶ (DataFrame) –
groupvars¶ (Union[str, List[str]]) – variables which signify unique observations in the output dataset
values¶ (Union[str, List[str]]) – variables which contain the values which need to be transposed
colindex¶ (Union[str, List[str], None]) – columns containing extension for column name in the output dataset. If not specified, just uses the count of the row within the group. If a list is provided, each column value will be appended in order separated by _
colindex_only¶ (bool) – If True, column names in output data will be only the colindex, and will not include the name of the values variable. Only valid when passing a single value, otherwise multiple columns would have the same name.

Returns

Examples

For example:

If we had a long dataset of returns, with returns 12, 24, 36, 48, and 60 months after the date:
        ticker    ret    months
        AA        .01    12
        AA        .15    24
        AA        .21    36
        AA       -.10    48
        AA        .22    60
and we want to get this to one observation per ticker:
        ticker    ret12    ret24    ret36    ret48    ret60
        AA        .01      .15      .21     -.10      .22
We would use:
long_to_wide(df, groupvars='ticker', values='ret', colindex='months')

pd_utils.transform.state_abbrev(df, col, toabbrev=False)[source]¶

Replaces a DataFrame’s column of a state abbreviation or state name to the opposite

Parameters

df¶ (DataFrame) –
col¶ (str) – name of column containing state names or state abbreviations
toabbrev¶ (bool) – True to convert state names to abbreviations, defaults to converting abbreviations to state names

Returns

pd_utils.transform.var_change_by_groups(df, var, byvars, datevar='Date', numlags=1)[source]¶

Used for getting variable changes over time within bygroups.

Notes

Dataset is not sorted in this process. Sort the data in the order in which you wish lags to be created before running this command.

Parameters

df¶ (DataFrame) – dataframe containing bygroups, a date variable, and variables of interest
var¶ (Union[str, List[str]]) – column names of variables to get changes
byvars¶ (Union[str, List[str]]) – column names of variables identifying by groups
datevar¶ (str) – column names of variables identifying periods
numlags¶ (int) – number of periods to go back to get change

Returns

pd_utils.transform.winsorize(df, pct, subset=None, byvars=None, bot=True, top=True)[source]¶

Finds observations above the pct percentile and replaces the with the pct percentile value. Does this for all columns, or the subset given by subset.

Parameters

df¶ (DataFrame) –
pct¶ (Union[float, Tuple[float, float]]) – 0 < float < 1 or list of two values 0 < float < 1. If two values are given, the first will be used for the bottom percentile and the second will be used for the top. If one value is given and both bot and top are True, will use the same value for both.
subset¶ (Union[str, List[str], None]) – column name(s) to winsorize
byvars¶ (Union[str, List[str], None]) – Column names of columns identifying groups in the data. Winsorizing will be done within those groups.
bot¶ (bool) – True to winsorize bottom observations
top¶ (bool) – True to winsorize top observations

Return type

DataFrame

Returns

Examples

>>> winsorize(df, .05, subset='RET') # replaces observations of RET below the 5% and above the 95% values
>>> winsorize(df, (.05, .1), subset='RET') #replaces observations of RET below the 5% and above the 90% values

pd_utils.utils module¶

pd_utils.utils.split(df, keepvars, keyvar='__key_var__')[source]¶: Splits a dataframe into a list of arrays based on a key variable

pd_utils.utils.split_gen(df, keepvars, keyvar='__key_var__')[source]¶: Splits a dataframe into a list of arrays based on a key variable