regtools package

High-level tools for running regressions. Handles fixed effects, 2+ way clustering, hypothesis testing, lagged variables, differenced variables, interaction effects, iteration tools, and producing summaries for a variety of models including OLS, Logit, Probit, Quantile, and Fama-Macbeth.

Submodules

regtools.args module

class regtools.args.RegressionSetArgs(df, yvar, xvars_list, fe_list=None, **reg_kwargs)[source]

Bases: object

__init__(df, yvar, xvars_list, fe_list=None, **reg_kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

keys()[source]

regtools.chooser module

regtools.chooser.any_reg(reg_type, *reg_args, **reg_kwargs)[source]

Runs any regression.

Parameters
  • reg_type – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth

  • reg_args

  • reg_kwargs

Returns

regtools.controls module

regtools.controls.suppress_controls_in_summary_df(summ_df, regressor_order, dummy_col_dicts, info_dict)[source]

regtools.dataprep module

regtools.differenced module

regtools.differenced.create_differenced_variables(df, diff_cols, id_col='TICKER', date_col='Date', difference_lag=1, fill_method='ffill', fill_limit=None)[source]

Note: partially inplace

regtools.differenced.diff_reg(df, yvar, xvars, id_col, date_col, difference_lag=1, diff_cols=None, diff_fill_method='ffill', diff_fill_limit=None, **reg_kwargs)[source]

Fits a differenced regression.

Parameters
  • df (DataFrame) –

  • yvar (str) – column name of outcome y variable

  • xvars (Sequence[str]) – column names of x variables for regression

  • id_col (str) – column name of variable representing entities in the data

  • date_col (str) – column name of variable representing time in the data

  • difference_lag (int) – Number of lags to use for difference

  • diff_cols (Optional[Sequence[str]]) – columns to take differences on

  • diff_fill_method (str) – pandas fill methods, ‘ffill’ or ‘bfill’

  • diff_fill_limit (Optional[int]) – maximum number of periods to fill missing data, default no limit

  • reg_kwargs

Returns

regtools.ext_statsmodels module

regtools.ext_statsmodels.summary_col(results, float_format='%.4f', model_names=[], stars=False, info_dict=None, regressor_order=[])[source]

Summarize multiple results instances side-by-side (coefs and SEs)

results : statsmodels results instance or list of result instances float_format : string

float format for coefficients and standard errors Default : ‘%.4f’

model_nameslist of strings of length len(results) if the names are not

unique, a roman number will be appended to all model names

starsbool

print significance stars

info_dictdict

dict of lambda functions to be applied to results instances to retrieve model info. To use specific information for different models, add a (nested) info_dict with model name as the key. Example: info_dict = {“N”:…, “R2”: …, “OLS”:{“R2”:…}} would only show R2 for OLS regression models, but additionally N for all other results. Default : None (use the info_dict specified in result.default_model_infos, if this property exists)

regressor_orderlist of strings

list of names of the regressors in the desired order. All regressors not specified will be appended to the end of the list.

regtools.ext_statsmodels.update_statsmodel_result_with_new_cov_matrix(result, cov_matrix)[source]

Note: inplace

Statsmodels results have caching going on. Need to update all the properties which depend on the covariance matrix

regtools.interact module

regtools.interact.create_interaction_variables(df, interaction_tuples)[source]

Note: inplace

regtools.interact.delete_interaction_variables(df, interaction_tuples)[source]

Note: inplace

regtools.iter module

regtools.iter.reg_for_each_combo(df, yvar, xvars, reg_type='reg', **reg_kwargs)[source]

Takes each possible combination of xvars (starting from each var individually, then each pair of vars, etc. all the way up to all xvars), and regresses yvar on each set of xvars. .

Parameters
  • df (DataFrame) –

  • yvar (str) – column name of y variable

  • xvars (Sequence[str]) – column names of x variables

  • reg_type (str) – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth

  • reg_kwargs

Returns

a list of fitted regressions

regtools.iter.reg_for_each_combo_select_and_produce_summary(df, yvar, xvars, robust=True, cluster=False, keepnum=5, stderr=False, t_stats=True, float_format='%0.1f', regressor_order=(), **other_reg_kwargs)[source]

Convenience function to run regressions for every combination of xvars, select the best models, and present them in a summary format.

Parameters
  • df (DataFrame) –

  • yvar (str) – column name of y variable

  • xvars (Sequence[str]) – column names of x variables

  • robust (bool) – False to not use heteroskedasticity-robust standard errors

  • cluster (Union[bool, str, Sequence[str]]) – set to a column name to calculate standard errors within clusters given by unique values of given column name, or multiple column names for multi-way clustering

  • keepnum (int) – number to keep for each amount of x variables. The total number of outputted regressions will be roughly keepnum * len(xvars)

  • stderr (bool) – set to True to keep rows for standard errors below coefficient estimates

  • t_stats (bool) – set to True to keep rows for standard errors below coefficient estimates and convert them to t-stats

  • float_format (str) – format string for how to format results in summary

  • regressor_order (Sequence[str]) – sequence of column names to put first in the regression results

  • reg_type – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth

  • other_reg_kwargs

Returns

a tuple of (reg_list, summary) where reg_list

is a list of fitted regression models, and summary is a single dataframe of results

regtools.iter.reg_for_each_lag(df, yvar, xvars, lag_tuple=(1, 2, 3, 4), reg_type='reg', **reg_kwargs)[source]

Convenience function to run regressions with the same y and x variables for every passed number of lags

Parameters
  • df (DataFrame) –

  • yvar (str) – column name of y variable

  • xvars (Sequence[str]) – column names of x variables

  • lag_tuple (Sequence[int]) – sequence containing how many lags to create and run regressions for each variable

  • reg_type (str) – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth

  • reg_kwargs

Returns

regtools.iter.reg_for_each_lag_and_produce_summary(df, yvar, xvars, regressor_order=(), lag_tuple=(1, 2, 3, 4), consolidate_lags=True, reg_type='reg', stderr=False, t_stats=True, float_format='%0.2f', suppress_other_regressors=False, **reg_kwargs)[source]

Convenience function to run regressions with the same y and x variables for every passed number of lags and produce a summary.

Parameters
  • df

  • yvar – column name of y variable

  • xvars – column names of x variables

  • regressor_order (Sequence[str]) – sequence of column names to put first in the regression results

  • lag_tuple (Sequence[int]) – sequence containing how many lags to create and run regressions for each variable

  • consolidate_lags (bool) – True to condense lags for a single variable into a single row

  • reg_type (str) – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth

  • stderr (bool) – set to True to keep rows for standard errors below coefficient estimates

  • t_stats (bool) – set to True to keep rows for standard errors below coefficient estimates and convert them to t-stats

  • float_format (str) – format string for how to format results in summary

  • suppress_other_regressors (bool) – only used if regressor_order is passed. Then pass True to hide all coefficient rows besides those in regressor_order

  • reg_kwargs

Returns

a tuple of (reg_list, summary) where reg_list

is a list of fitted regression models, and summary is a single dataframe of results

regtools.iter.reg_for_each_xvar_set(df, yvar, xvars_list, reg_type='reg', **reg_kwargs)[source]

Runs regressions on the same y variable for each set of x variables passed. xvars_list should be a list of lists, where each individual list is one set of x variables for one model.

Notes

If fe is passed, should either pass a string to use fe in all models, or a list of strings or None of same length as num models

Parameters
  • df (DataFrame) –

  • yvar (str) – column name of y variable

  • xvars_list (Sequence[Sequence[str]]) – sequence where each element is itself a sequence containing column names of x variables for that numbered regression

  • reg_type (str) – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth

  • reg_kwargs

Returns

a list of fitted regressions

regtools.iter.reg_for_each_xvar_set_and_produce_summary(df, yvar, xvars_list, robust=True, cluster=False, stderr=False, t_stats=True, fe=None, float_format='%0.2f', suppress_other_regressors=False, regressor_order=(), **other_reg_kwargs)[source]

Convenience function to run regressions for every set of xvars passed and present them in a summary format.

Notes

  • Only specify at most one of robust and cluster.

  • Don’t set both stderr and t_stats to True

Parameters
  • df (DataFrame) –

  • yvar (str) – column name of y variable

  • xvars_list (Sequence[Sequence[str]]) – sequence where each element is itself a sequence containing column names of x variables for that numbered regression

  • robust (bool) – False to not use heteroskedasticity-robust standard errors

  • cluster (Union[bool, str, Sequence[str]]) – set to a column name to calculate standard errors within clusters given by unique values of given column name, or multiple column names for multi-way clustering

  • stderr (bool) – set to True to keep rows for standard errors below coefficient estimates

  • t_stats (bool) – set to True to keep rows for standard errors below coefficient estimates and convert them to t-stats

  • fe (Union[str, Sequence[Optional[str]], None]) – If fe is passed, should either pass a string to use fe in all models, or a list of strings or None of same length as num models

  • float_format (str) – format string for how to format results in summary

  • suppress_other_regressors (bool) – only used if regressor_order is passed. Then pass True to hide all coefficient rows besides those in regressor_order

  • regressor_order (Sequence[str]) – sequence of column names to put first in the regression results

  • reg_type – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth

  • other_reg_kwargs

Returns

a tuple of (reg_list, summary) where reg_list

is a list of fitted regression models, and summary is a single dataframe of results.

regtools.iter.reg_for_each_yvar(df, yvars, xvars, reg_type='reg', **reg_kwargs)[source]

Convenience function to run regressions for multiple y variables with the same x variables

Parameters
  • df

  • yvars – column names of y variables

  • xvars – column names of x variables

  • reg_type – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth

  • reg_kwargs

Returns

a list of fitted regressions

regtools.iter.reg_for_each_yvar_and_produce_summary(df, yvars, xvars, regressor_order=(), reg_type='reg', stderr=False, t_stats=True, float_format='%0.2f', **reg_kwargs)[source]

Convenience function to run regressions for multiple y variables with the same x variables and and present them in a summary format.

Parameters
  • df (DataFrame) –

  • yvars (Sequence[str]) – column names of y variables

  • xvars (Sequence[str]) – column names of x variables

  • regressor_order (Sequence[str]) – sequence of column names to put first in the regression results

  • reg_type (str) – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth

  • stderr (bool) – set to True to keep rows for standard errors below coefficient estimates

  • t_stats (bool) – set to True to keep rows for standard errors below coefficient estimates and convert them to t-stats

  • float_format (str) – format string for how to format results in summary

  • reg_kwargs

Returns

a tuple of (reg_list, summary) where reg_list

is a list of fitted regression models, and summary is a single dataframe of results.

regtools.models module

regtools.models.get_model_class_by_string(model_string)[source]
regtools.models.get_model_name_by_string(model_string)[source]
Return type

str

regtools.order module

regtools.order.convert_regressor_order_for_diff(regressor_order, reg_kwargs)[source]
regtools.order.convert_regressor_order_for_lags(regressor_order, reg_kwargs)[source]

regtools.quantile module

regtools.quantile.quantile_plot_from_quantile_result_df(result_df, yvar, main_iv, outpath=None, clear_figure=True)[source]

Creates a plot of effect of main_iv on yvar at different quantiles. To be used after reg_for_each_quantile_produce_result_df

Parameters
  • result_df – pd.DataFrame, result from reg_for_each_quantile_produce_result_df

  • yvar – str, label of dependent variable

  • main_iv – str, label of independent variable of interest

  • outpath – str, filepath to output figure. must include matplotlib supported extension such as .pdf or .png

  • clear_figure – bool, True wipe memory of matplotlib figure after running function

Returns

regtools.quantile.quantile_reg(df, yvar, xvars, q=0.5, robust=True, cluster=False, cons=True, fe=None, interaction_tuples=None, num_lags=0, lag_variables='xvars', lag_period_var='Date', lag_id_var='TICKER', lag_fill_method='ffill', lag_fill_limit=None)[source]

Returns a fitted quantile regression. Takes df, produces a regression df with no missing among needed variables, and fits a regression model. If robust is specified, uses heteroskedasticity- robust standard errors. If cluster is specified, calculated clustered standard errors by the given variable.

Notes

Only specify at most one of robust and cluster.

Parameters
  • df (DataFrame) –

  • yvar (str) – column name of outcome y variable

  • xvars (Sequence[str]) – column names of x variables for regression

  • q (float) – quantile to use

  • robust (bool) – set to True to use heterskedasticity-robust standard errors

  • cluster (Union[bool, str, Sequence[str]]) – set to a column name to calculate standard errors within clusters given by unique values of given column name. set to multiple column names for multiway clustering following Cameron, Gelbach, and Miller (2011). NOTE: will get exponentially slower as more cluster variables are added

  • cons (bool) – set to False to not include a constant in the regression

  • fe (Union[str, Sequence[str], None]) – If a str or list of strs is passed, uses these categorical variables to construct dummies for fixed effects.

  • interaction_tuples (Union[Tuple[str, str], Sequence[Tuple[str, str]], None]) – tuple or list of tuples of column names to interact and include as xvars

  • num_lags (int) – Number of periods to lag variables. Setting to other than 0 will activate lags

  • lag_variables (Union[str, Sequence[str]]) – ‘all’, ‘xvars’, or list of strs of names of columns to lag for regressions.

  • lag_period_var (str) – only used if lag_variables is not None. name of column which contains period variable for lagging

  • lag_id_var (str) – only used if lag_variables is not None. name of column which contains identifier variable for lagging

  • lag_fill_method (Optional[str]) – ‘ffill’ or ‘bfill’ for which method to use to fill in missing rows when creating lag variables. Set to None to not fill and have missing instead. See pandas.DataFrame.fillna for more details

  • lag_fill_limit (Optional[int]) – maximum number of periods to fill with lag_fill_method

Returns

statsmodels regression result

regtools.quantile.reg_for_each_quantile_output_plot(main_iv, *reg_args, num_quantiles=8, main_iv_label=None, outpath=None, clear_figure=True, **reg_kwargs)[source]

Creates a plot of effect of main_iv on yvar at different quantiles. To be used after reg_for_each_quantile_produce_result_df

Parameters
  • main_iv – str, column name of independent variable of interest

  • reg_args – see quantile_reg for args

  • num_quantiles – number of quantile regressions to run. will be spaced evenly. higher numbers produce smoother graphs but take longer to run

  • main_iv_label – str, label of independent variable of interest

  • outpath – str, filepath to output figure. must include matplotlib supported extension such as .pdf or .png

  • clear_figure – bool, True wipe memory of matplotlib figure after running function

  • reg_kwargs – see quantile_reg for kwargs

Returns

result_df from reg_for_each_quantile_produce_result_df

regtools.quantile.reg_for_each_quantile_produce_result_df(main_iv, *reg_args, num_quantiles=8, **reg_kwargs)[source]

Produce result DataFrame of running multiple quantile regressions spaced out between the (0,1) interval

Parameters
  • main_iv – str, column name of independent variable of interest

  • reg_args – see quantile_reg for args

  • num_quantiles – number of quantile regressions to run. will be spaced evenly. higher numbers produce smoother graphs but take longer to run

  • reg_kwargs – see quantile_reg for kwargs

Returns

regtools.reg module

regtools.reg.reg(df, yvar, xvars, robust=True, cluster=False, cons=True, fe=None, interaction_tuples=None, num_lags=0, lag_variables='xvars', lag_period_var='Date', lag_id_var='TICKER', lag_fill_method='ffill', reg_type='OLS')[source]

Returns a fitted regression. Takes df, produces a regression df with no missing among needed variables, and fits a regression model. If robust is specified, uses heteroskedasticity- robust standard errors. If cluster is specified, calculated clustered standard errors by the given variable.

Notes

Only specify at most one of robust and cluster.

Parameters
  • df (DataFrame) –

  • yvar (str) – column name of outcome y variable

  • xvars (Sequence[str]) – column names of x variables for regression

  • robust (bool) – set to True to use heterskedasticity-robust standard errors

  • cluster (Union[bool, str, Sequence[str]]) – set to a column name to calculate standard errors within clusters given by unique values of given column name. set to multiple column names for multiway clustering following Cameron, Gelbach, and Miller (2011). NOTE: will get exponentially slower as more cluster variables are added

  • cons (bool) – set to False to not include a constant in the regression

  • fe (Union[str, Sequence[str], None]) – If a str or list of strs is passed, uses these categorical variables to construct dummies for fixed effects.

  • interaction_tuples (Union[Tuple[str, str], Sequence[Tuple[str, str]], None]) – tuple or list of tuples of column names to interact and include as xvars

  • num_lags (int) – Number of periods to lag variables. Setting to other than 0 will activate lags

  • lag_variables (Union[str, Sequence[str]]) – ‘all’, ‘xvars’, or list of strs of names of columns to lag for regressions.

  • lag_period_var (str) – only used if lag_variables is not None. name of column which contains period variable for lagging

  • lag_id_var (str) – only used if lag_variables is not None. name of column which contains identifier variable for lagging

  • lag_fill_method (Optional[str]) – ‘ffill’ or ‘bfill’ for which method to use to fill in missing rows when creating lag variables. Set to None to not fill and have missing instead. See pandas.DataFrame.fillna for more details

  • reg_type (str) – ‘OLS’, ‘probit’, or ‘logit’ for type of model

Returns

statsmodels regression result.

regtools.regtypes module

regtools.select module

regtools.select.select_models(reg_list, keepnum, xvars)[source]

Takes a list of fitted regression models and selects among them based on adjusted R-Squared. For each number of variables involved in the regressions, keepnum with the highest R-squareds will be kept.

For example, if reg_list contains 3 regressions with two variables and 6 regressions with three variables, and keepnum is 2, will return a list of four regressions, 2 with two variables and 2 with three variables.

Parameters
  • reg_list

  • keepnum (int) – number to keep for each amount of x variables. The total number of outputted regressions will be roughly keepnum * len(xvars)

  • xvars (Sequence[str]) – column names of x variables

Returns

regtools.tools module