regtools package¶

High-level tools for running regressions. Handles fixed effects, 2+ way clustering, hypothesis testing, lagged variables, differenced variables, interaction effects, iteration tools, and producing summaries for a variety of models including OLS, Logit, Probit, Quantile, and Fama-Macbeth.

Subpackages¶

Submodules¶

regtools.args module¶

class regtools.args.RegressionSetArgs(df, yvar, xvars_list, fe_list=None, **reg_kwargs)[source]¶

Bases: object

__init__(df, yvar, xvars_list, fe_list=None, **reg_kwargs)[source]¶: Initialize self. See help(type(self)) for accurate signature.

keys()[source]¶

regtools.chooser module¶

regtools.chooser.any_reg(reg_type, *reg_args, **reg_kwargs)[source]¶

Runs any regression.

Parameters

reg_type¶ – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth
reg_args¶ –
reg_kwargs¶ –

Returns

regtools.controls module¶

regtools.controls.suppress_controls_in_summary_df(summ_df, regressor_order, dummy_col_dicts, info_dict)[source]¶

regtools.dataprep module¶

regtools.differenced module¶

regtools.differenced.create_differenced_variables(df, diff_cols, id_col='TICKER', date_col='Date', difference_lag=1, fill_method='ffill', fill_limit=None)[source]¶: Note: partially inplace

regtools.differenced.diff_reg(df, yvar, xvars, id_col, date_col, difference_lag=1, diff_cols=None, diff_fill_method='ffill', diff_fill_limit=None, **reg_kwargs)[source]¶

Fits a differenced regression.

Parameters

df¶ (DataFrame) –
yvar¶ (str) – column name of outcome y variable
xvars¶ (Sequence[str]) – column names of x variables for regression
id_col¶ (str) – column name of variable representing entities in the data
date_col¶ (str) – column name of variable representing time in the data
difference_lag¶ (int) – Number of lags to use for difference
diff_cols¶ (Optional[Sequence[str]]) – columns to take differences on
diff_fill_method¶ (str) – pandas fill methods, ‘ffill’ or ‘bfill’
diff_fill_limit¶ (Optional[int]) – maximum number of periods to fill missing data, default no limit
reg_kwargs¶ –

Returns

regtools.ext_statsmodels module¶

regtools.ext_statsmodels.summary_col(results, float_format='%.4f', model_names=[], stars=False, info_dict=None, regressor_order=[])[source]¶

Summarize multiple results instances side-by-side (coefs and SEs)

results : statsmodels results instance or list of result instances float_format : string

float format for coefficients and standard errors Default : ‘%.4f’

model_nameslist of strings of length len(results) if the names are not: unique, a roman number will be appended to all model names
starsbool: print significance stars
info_dictdict: dict of lambda functions to be applied to results instances to retrieve model info. To use specific information for different models, add a (nested) info_dict with model name as the key. Example: info_dict = {“N”:…, “R2”: …, “OLS”:{“R2”:…}} would only show R2 for OLS regression models, but additionally N for all other results. Default : None (use the info_dict specified in result.default_model_infos, if this property exists)
regressor_orderlist of strings: list of names of the regressors in the desired order. All regressors not specified will be appended to the end of the list.

regtools.ext_statsmodels.update_statsmodel_result_with_new_cov_matrix(result, cov_matrix)[source]¶

Note: inplace

Statsmodels results have caching going on. Need to update all the properties which depend on the covariance matrix

regtools.interact module¶

regtools.interact.create_interaction_variables(df, interaction_tuples)[source]¶: Note: inplace

regtools.interact.delete_interaction_variables(df, interaction_tuples)[source]¶: Note: inplace

regtools.iter module¶

regtools.iter.reg_for_each_combo(df, yvar, xvars, reg_type='reg', **reg_kwargs)[source]¶

Takes each possible combination of xvars (starting from each var individually, then each pair of vars, etc. all the way up to all xvars), and regresses yvar on each set of xvars. .

Parameters

df¶ (DataFrame) –
yvar¶ (str) – column name of y variable
xvars¶ (Sequence[str]) – column names of x variables
reg_type¶ (str) – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth
reg_kwargs¶ –

Returns

a list of fitted regressions

regtools.iter.reg_for_each_combo_select_and_produce_summary(df, yvar, xvars, robust=True, cluster=False, keepnum=5, stderr=False, t_stats=True, float_format='%0.1f', regressor_order=(), **other_reg_kwargs)[source]¶

Convenience function to run regressions for every combination of xvars, select the best models, and present them in a summary format.

Parameters

df¶ (DataFrame) –
yvar¶ (str) – column name of y variable
xvars¶ (Sequence[str]) – column names of x variables
robust¶ (bool) – False to not use heteroskedasticity-robust standard errors
cluster¶ (Union[bool, str, Sequence[str]]) – set to a column name to calculate standard errors within clusters given by unique values of given column name, or multiple column names for multi-way clustering
keepnum¶ (int) – number to keep for each amount of x variables. The total number of outputted regressions will be roughly keepnum * len(xvars)
stderr¶ (bool) – set to True to keep rows for standard errors below coefficient estimates
t_stats¶ (bool) – set to True to keep rows for standard errors below coefficient estimates and convert them to t-stats
float_format¶ (str) – format string for how to format results in summary
regressor_order¶ (Sequence[str]) – sequence of column names to put first in the regression results
reg_type¶ – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth
other_reg_kwargs¶ –

Returns

a tuple of (reg_list, summary) where reg_list

is a list of fitted regression models, and summary is a single dataframe of results

regtools.iter.reg_for_each_lag(df, yvar, xvars, lag_tuple=(1, 2, 3, 4), reg_type='reg', **reg_kwargs)[source]¶

Convenience function to run regressions with the same y and x variables for every passed number of lags

Parameters

df¶ (DataFrame) –
yvar¶ (str) – column name of y variable
xvars¶ (Sequence[str]) – column names of x variables
lag_tuple¶ (Sequence[int]) – sequence containing how many lags to create and run regressions for each variable
reg_type¶ (str) – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth
reg_kwargs¶ –

Returns

regtools.iter.reg_for_each_lag_and_produce_summary(df, yvar, xvars, regressor_order=(), lag_tuple=(1, 2, 3, 4), consolidate_lags=True, reg_type='reg', stderr=False, t_stats=True, float_format='%0.2f', suppress_other_regressors=False, **reg_kwargs)[source]¶

Convenience function to run regressions with the same y and x variables for every passed number of lags and produce a summary.

Parameters

df¶ –
yvar¶ – column name of y variable
xvars¶ – column names of x variables
regressor_order¶ (Sequence[str]) – sequence of column names to put first in the regression results
lag_tuple¶ (Sequence[int]) – sequence containing how many lags to create and run regressions for each variable
consolidate_lags¶ (bool) – True to condense lags for a single variable into a single row
reg_type¶ (str) – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth
stderr¶ (bool) – set to True to keep rows for standard errors below coefficient estimates
t_stats¶ (bool) – set to True to keep rows for standard errors below coefficient estimates and convert them to t-stats
float_format¶ (str) – format string for how to format results in summary
suppress_other_regressors¶ (bool) – only used if regressor_order is passed. Then pass True to hide all coefficient rows besides those in regressor_order
reg_kwargs¶ –

Returns

a tuple of (reg_list, summary) where reg_list

is a list of fitted regression models, and summary is a single dataframe of results

regtools.iter.reg_for_each_xvar_set(df, yvar, xvars_list, reg_type='reg', **reg_kwargs)[source]¶

Runs regressions on the same y variable for each set of x variables passed. xvars_list should be a list of lists, where each individual list is one set of x variables for one model.

Notes

If fe is passed, should either pass a string to use fe in all models, or a list of strings or None of same length as num models

Parameters

df¶ (DataFrame) –
yvar¶ (str) – column name of y variable
xvars_list¶ (Sequence[Sequence[str]]) – sequence where each element is itself a sequence containing column names of x variables for that numbered regression
reg_type¶ (str) – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth
reg_kwargs¶ –

Returns

a list of fitted regressions

regtools.iter.reg_for_each_xvar_set_and_produce_summary(df, yvar, xvars_list, robust=True, cluster=False, stderr=False, t_stats=True, fe=None, float_format='%0.2f', suppress_other_regressors=False, regressor_order=(), **other_reg_kwargs)[source]¶

Convenience function to run regressions for every set of xvars passed and present them in a summary format.

Notes

Only specify at most one of robust and cluster.
Don’t set both stderr and t_stats to True

Parameters

df¶ (DataFrame) –
yvar¶ (str) – column name of y variable
xvars_list¶ (Sequence[Sequence[str]]) – sequence where each element is itself a sequence containing column names of x variables for that numbered regression
robust¶ (bool) – False to not use heteroskedasticity-robust standard errors
cluster¶ (Union[bool, str, Sequence[str]]) – set to a column name to calculate standard errors within clusters given by unique values of given column name, or multiple column names for multi-way clustering
stderr¶ (bool) – set to True to keep rows for standard errors below coefficient estimates
t_stats¶ (bool) – set to True to keep rows for standard errors below coefficient estimates and convert them to t-stats
fe¶ (Union[str, Sequence[Optional[str]], None]) – If fe is passed, should either pass a string to use fe in all models, or a list of strings or None of same length as num models
float_format¶ (str) – format string for how to format results in summary
suppress_other_regressors¶ (bool) – only used if regressor_order is passed. Then pass True to hide all coefficient rows besides those in regressor_order
regressor_order¶ (Sequence[str]) – sequence of column names to put first in the regression results
reg_type¶ – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth
other_reg_kwargs¶ –

Returns

a tuple of (reg_list, summary) where reg_list

is a list of fitted regression models, and summary is a single dataframe of results.

regtools.iter.reg_for_each_yvar(df, yvars, xvars, reg_type='reg', **reg_kwargs)[source]¶

Convenience function to run regressions for multiple y variables with the same x variables

Parameters

df¶ –
yvars¶ – column names of y variables
xvars¶ – column names of x variables
reg_type¶ – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth
reg_kwargs¶ –

Returns

a list of fitted regressions

regtools.iter.reg_for_each_yvar_and_produce_summary(df, yvars, xvars, regressor_order=(), reg_type='reg', stderr=False, t_stats=True, float_format='%0.2f', **reg_kwargs)[source]¶

Convenience function to run regressions for multiple y variables with the same x variables and and present them in a summary format.

Parameters

df¶ (DataFrame) –
yvars¶ (Sequence[str]) – column names of y variables
xvars¶ (Sequence[str]) – column names of x variables
regressor_order¶ (Sequence[str]) – sequence of column names to put first in the regression results
reg_type¶ (str) – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth
stderr¶ (bool) – set to True to keep rows for standard errors below coefficient estimates
t_stats¶ (bool) – set to True to keep rows for standard errors below coefficient estimates and convert them to t-stats
float_format¶ (str) – format string for how to format results in summary
reg_kwargs¶ –

Returns

a tuple of (reg_list, summary) where reg_list

is a list of fitted regression models, and summary is a single dataframe of results.

regtools.models module¶

regtools.models.get_model_class_by_string(model_string)[source]¶

regtools.models.get_model_name_by_string(model_string)[source]¶

Return type: str

regtools.order module¶

regtools.order.convert_regressor_order_for_diff(regressor_order, reg_kwargs)[source]¶

regtools.order.convert_regressor_order_for_lags(regressor_order, reg_kwargs)[source]¶

regtools.quantile module¶

regtools.quantile.quantile_plot_from_quantile_result_df(result_df, yvar, main_iv, outpath=None, clear_figure=True)[source]¶

Creates a plot of effect of main_iv on yvar at different quantiles. To be used after reg_for_each_quantile_produce_result_df

Parameters

result_df¶ – pd.DataFrame, result from reg_for_each_quantile_produce_result_df
yvar¶ – str, label of dependent variable
main_iv¶ – str, label of independent variable of interest
outpath¶ – str, filepath to output figure. must include matplotlib supported extension such as .pdf or .png
clear_figure¶ – bool, True wipe memory of matplotlib figure after running function

Returns

regtools.quantile.quantile_reg(df, yvar, xvars, q=0.5, robust=True, cluster=False, cons=True, fe=None, interaction_tuples=None, num_lags=0, lag_variables='xvars', lag_period_var='Date', lag_id_var='TICKER', lag_fill_method='ffill', lag_fill_limit=None)[source]¶

Returns a fitted quantile regression. Takes df, produces a regression df with no missing among needed variables, and fits a regression model. If robust is specified, uses heteroskedasticity- robust standard errors. If cluster is specified, calculated clustered standard errors by the given variable.

Notes

Only specify at most one of robust and cluster.

Parameters

df¶ (DataFrame) –
yvar¶ (str) – column name of outcome y variable
xvars¶ (Sequence[str]) – column names of x variables for regression
q¶ (float) – quantile to use
robust¶ (bool) – set to True to use heterskedasticity-robust standard errors
cluster¶ (Union[bool, str, Sequence[str]]) – set to a column name to calculate standard errors within clusters given by unique values of given column name. set to multiple column names for multiway clustering following Cameron, Gelbach, and Miller (2011). NOTE: will get exponentially slower as more cluster variables are added
cons¶ (bool) – set to False to not include a constant in the regression
fe¶ (Union[str, Sequence[str], None]) – If a str or list of strs is passed, uses these categorical variables to construct dummies for fixed effects.
interaction_tuples¶ (Union[Tuple[str, str], Sequence[Tuple[str, str]], None]) – tuple or list of tuples of column names to interact and include as xvars
num_lags¶ (int) – Number of periods to lag variables. Setting to other than 0 will activate lags
lag_variables¶ (Union[str, Sequence[str]]) – ‘all’, ‘xvars’, or list of strs of names of columns to lag for regressions.
lag_period_var¶ (str) – only used if lag_variables is not None. name of column which contains period variable for lagging
lag_id_var¶ (str) – only used if lag_variables is not None. name of column which contains identifier variable for lagging
lag_fill_method¶ (Optional[str]) – ‘ffill’ or ‘bfill’ for which method to use to fill in missing rows when creating lag variables. Set to None to not fill and have missing instead. See pandas.DataFrame.fillna for more details
lag_fill_limit¶ (Optional[int]) – maximum number of periods to fill with lag_fill_method

Returns

statsmodels regression result

regtools.quantile.reg_for_each_quantile_output_plot(main_iv, *reg_args, num_quantiles=8, main_iv_label=None, outpath=None, clear_figure=True, **reg_kwargs)[source]¶

Creates a plot of effect of main_iv on yvar at different quantiles. To be used after reg_for_each_quantile_produce_result_df

Parameters

main_iv¶ – str, column name of independent variable of interest
reg_args¶ – see quantile_reg for args
num_quantiles¶ – number of quantile regressions to run. will be spaced evenly. higher numbers produce smoother graphs but take longer to run
main_iv_label¶ – str, label of independent variable of interest
outpath¶ – str, filepath to output figure. must include matplotlib supported extension such as .pdf or .png
clear_figure¶ – bool, True wipe memory of matplotlib figure after running function
reg_kwargs¶ – see quantile_reg for kwargs

Returns

result_df from reg_for_each_quantile_produce_result_df

regtools.quantile.reg_for_each_quantile_produce_result_df(main_iv, *reg_args, num_quantiles=8, **reg_kwargs)[source]¶

Produce result DataFrame of running multiple quantile regressions spaced out between the (0,1) interval

Parameters

main_iv¶ – str, column name of independent variable of interest
reg_args¶ – see quantile_reg for args
num_quantiles¶ – number of quantile regressions to run. will be spaced evenly. higher numbers produce smoother graphs but take longer to run
reg_kwargs¶ – see quantile_reg for kwargs

Returns

regtools.reg module¶

regtools.reg.reg(df, yvar, xvars, robust=True, cluster=False, cons=True, fe=None, interaction_tuples=None, num_lags=0, lag_variables='xvars', lag_period_var='Date', lag_id_var='TICKER', lag_fill_method='ffill', reg_type='OLS')[source]¶

Returns a fitted regression. Takes df, produces a regression df with no missing among needed variables, and fits a regression model. If robust is specified, uses heteroskedasticity- robust standard errors. If cluster is specified, calculated clustered standard errors by the given variable.

Notes

Only specify at most one of robust and cluster.

Parameters

df¶ (DataFrame) –
yvar¶ (str) – column name of outcome y variable
xvars¶ (Sequence[str]) – column names of x variables for regression
robust¶ (bool) – set to True to use heterskedasticity-robust standard errors
cluster¶ (Union[bool, str, Sequence[str]]) – set to a column name to calculate standard errors within clusters given by unique values of given column name. set to multiple column names for multiway clustering following Cameron, Gelbach, and Miller (2011). NOTE: will get exponentially slower as more cluster variables are added
cons¶ (bool) – set to False to not include a constant in the regression
fe¶ (Union[str, Sequence[str], None]) – If a str or list of strs is passed, uses these categorical variables to construct dummies for fixed effects.
interaction_tuples¶ (Union[Tuple[str, str], Sequence[Tuple[str, str]], None]) – tuple or list of tuples of column names to interact and include as xvars
num_lags¶ (int) – Number of periods to lag variables. Setting to other than 0 will activate lags
lag_variables¶ (Union[str, Sequence[str]]) – ‘all’, ‘xvars’, or list of strs of names of columns to lag for regressions.
lag_period_var¶ (str) – only used if lag_variables is not None. name of column which contains period variable for lagging
lag_id_var¶ (str) – only used if lag_variables is not None. name of column which contains identifier variable for lagging
lag_fill_method¶ (Optional[str]) – ‘ffill’ or ‘bfill’ for which method to use to fill in missing rows when creating lag variables. Set to None to not fill and have missing instead. See pandas.DataFrame.fillna for more details
reg_type¶ (str) – ‘OLS’, ‘probit’, or ‘logit’ for type of model

Returns

statsmodels regression result.

regtools.regtypes module¶

regtools.select module¶

regtools.select.select_models(reg_list, keepnum, xvars)[source]¶

Takes a list of fitted regression models and selects among them based on adjusted R-Squared. For each number of variables involved in the regressions, keepnum with the highest R-squareds will be kept.

For example, if reg_list contains 3 regressions with two variables and 6 regressions with three variables, and keepnum is 2, will return a list of four regressions, 2 with two variables and 2 with three variables.

Parameters

reg_list¶ –
keepnum¶ (int) – number to keep for each amount of x variables. The total number of outputted regressions will be roughly keepnum * len(xvars)
xvars¶ (Sequence[str]) – column names of x variables

Returns

regtools package¶

Subpackages¶

Submodules¶

regtools.args module¶

regtools.chooser module¶

regtools.controls module¶

regtools.dataprep module¶

regtools.differenced module¶

regtools.ext_statsmodels module¶

regtools.interact module¶

regtools.iter module¶

regtools.models module¶

regtools.order module¶

regtools.quantile module¶

regtools.reg module¶

regtools.regtypes module¶

regtools.select module¶

regtools.tools module¶