regtools package¶
High-level tools for running regressions. Handles fixed effects, 2+ way clustering, hypothesis testing, lagged variables, differenced variables, interaction effects, iteration tools, and producing summaries for a variety of models including OLS, Logit, Probit, Quantile, and Fama-Macbeth.
Subpackages¶
Submodules¶
regtools.args module¶
regtools.chooser module¶
regtools.controls module¶
regtools.dataprep module¶
regtools.differenced module¶
-
regtools.differenced.
create_differenced_variables
(df, diff_cols, id_col='TICKER', date_col='Date', difference_lag=1, fill_method='ffill', fill_limit=None)[source]¶ Note: partially inplace
-
regtools.differenced.
diff_reg
(df, yvar, xvars, id_col, date_col, difference_lag=1, diff_cols=None, diff_fill_method='ffill', diff_fill_limit=None, **reg_kwargs)[source]¶ Fits a differenced regression.
- Parameters
df¶ (
DataFrame
) –xvars¶ (
Sequence
[str
]) – column names of x variables for regressionid_col¶ (
str
) – column name of variable representing entities in the datadate_col¶ (
str
) – column name of variable representing time in the datadifference_lag¶ (
int
) – Number of lags to use for differencediff_cols¶ (
Optional
[Sequence
[str
]]) – columns to take differences ondiff_fill_method¶ (
str
) – pandas fill methods, ‘ffill’ or ‘bfill’diff_fill_limit¶ (
Optional
[int
]) – maximum number of periods to fill missing data, default no limitreg_kwargs¶ –
- Returns
regtools.ext_statsmodels module¶
-
regtools.ext_statsmodels.
summary_col
(results, float_format='%.4f', model_names=[], stars=False, info_dict=None, regressor_order=[])[source]¶ Summarize multiple results instances side-by-side (coefs and SEs)
results : statsmodels results instance or list of result instances float_format : string
float format for coefficients and standard errors Default : ‘%.4f’
- model_nameslist of strings of length len(results) if the names are not
unique, a roman number will be appended to all model names
- starsbool
print significance stars
- info_dictdict
dict of lambda functions to be applied to results instances to retrieve model info. To use specific information for different models, add a (nested) info_dict with model name as the key. Example: info_dict = {“N”:…, “R2”: …, “OLS”:{“R2”:…}} would only show R2 for OLS regression models, but additionally N for all other results. Default : None (use the info_dict specified in result.default_model_infos, if this property exists)
- regressor_orderlist of strings
list of names of the regressors in the desired order. All regressors not specified will be appended to the end of the list.
regtools.interact module¶
regtools.iter module¶
-
regtools.iter.
reg_for_each_combo
(df, yvar, xvars, reg_type='reg', **reg_kwargs)[source]¶ Takes each possible combination of xvars (starting from each var individually, then each pair of vars, etc. all the way up to all xvars), and regresses yvar on each set of xvars. .
- Parameters
- Returns
a list of fitted regressions
-
regtools.iter.
reg_for_each_combo_select_and_produce_summary
(df, yvar, xvars, robust=True, cluster=False, keepnum=5, stderr=False, t_stats=True, float_format='%0.1f', regressor_order=(), **other_reg_kwargs)[source]¶ Convenience function to run regressions for every combination of xvars, select the best models, and present them in a summary format.
- Parameters
df¶ (
DataFrame
) –robust¶ (
bool
) – False to not use heteroskedasticity-robust standard errorscluster¶ (
Union
[bool
,str
,Sequence
[str
]]) – set to a column name to calculate standard errors within clusters given by unique values of given column name, or multiple column names for multi-way clusteringkeepnum¶ (
int
) – number to keep for each amount of x variables. The total number of outputted regressions will be roughly keepnum * len(xvars)stderr¶ (
bool
) – set to True to keep rows for standard errors below coefficient estimatest_stats¶ (
bool
) – set to True to keep rows for standard errors below coefficient estimates and convert them to t-statsfloat_format¶ (
str
) – format string for how to format results in summaryregressor_order¶ (
Sequence
[str
]) – sequence of column names to put first in the regression resultsreg_type¶ – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth
other_reg_kwargs¶ –
- Returns
a tuple of (reg_list, summary) where reg_list
is a list of fitted regression models, and summary is a single dataframe of results
-
regtools.iter.
reg_for_each_lag
(df, yvar, xvars, lag_tuple=(1, 2, 3, 4), reg_type='reg', **reg_kwargs)[source]¶ Convenience function to run regressions with the same y and x variables for every passed number of lags
- Parameters
df¶ (
DataFrame
) –lag_tuple¶ (
Sequence
[int
]) – sequence containing how many lags to create and run regressions for each variablereg_type¶ (
str
) – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBethreg_kwargs¶ –
- Returns
-
regtools.iter.
reg_for_each_lag_and_produce_summary
(df, yvar, xvars, regressor_order=(), lag_tuple=(1, 2, 3, 4), consolidate_lags=True, reg_type='reg', stderr=False, t_stats=True, float_format='%0.2f', suppress_other_regressors=False, **reg_kwargs)[source]¶ Convenience function to run regressions with the same y and x variables for every passed number of lags and produce a summary.
- Parameters
df¶ –
yvar¶ – column name of y variable
xvars¶ – column names of x variables
regressor_order¶ (
Sequence
[str
]) – sequence of column names to put first in the regression resultslag_tuple¶ (
Sequence
[int
]) – sequence containing how many lags to create and run regressions for each variableconsolidate_lags¶ (
bool
) – True to condense lags for a single variable into a single rowreg_type¶ (
str
) – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBethstderr¶ (
bool
) – set to True to keep rows for standard errors below coefficient estimatest_stats¶ (
bool
) – set to True to keep rows for standard errors below coefficient estimates and convert them to t-statsfloat_format¶ (
str
) – format string for how to format results in summarysuppress_other_regressors¶ (
bool
) – only used if regressor_order is passed. Then pass True to hide all coefficient rows besides those in regressor_orderreg_kwargs¶ –
- Returns
a tuple of (reg_list, summary) where reg_list
is a list of fitted regression models, and summary is a single dataframe of results
-
regtools.iter.
reg_for_each_xvar_set
(df, yvar, xvars_list, reg_type='reg', **reg_kwargs)[source]¶ Runs regressions on the same y variable for each set of x variables passed. xvars_list should be a list of lists, where each individual list is one set of x variables for one model.
- Notes
If fe is passed, should either pass a string to use fe in all models, or a list of strings or None of same length as num models
- Parameters
df¶ (
DataFrame
) –xvars_list¶ (
Sequence
[Sequence
[str
]]) – sequence where each element is itself a sequence containing column names of x variables for that numbered regressionreg_type¶ (
str
) – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBethreg_kwargs¶ –
- Returns
a list of fitted regressions
-
regtools.iter.
reg_for_each_xvar_set_and_produce_summary
(df, yvar, xvars_list, robust=True, cluster=False, stderr=False, t_stats=True, fe=None, float_format='%0.2f', suppress_other_regressors=False, regressor_order=(), **other_reg_kwargs)[source]¶ Convenience function to run regressions for every set of xvars passed and present them in a summary format.
- Notes
Only specify at most one of robust and cluster.
Don’t set both stderr and t_stats to True
- Parameters
df¶ (
DataFrame
) –xvars_list¶ (
Sequence
[Sequence
[str
]]) – sequence where each element is itself a sequence containing column names of x variables for that numbered regressionrobust¶ (
bool
) – False to not use heteroskedasticity-robust standard errorscluster¶ (
Union
[bool
,str
,Sequence
[str
]]) – set to a column name to calculate standard errors within clusters given by unique values of given column name, or multiple column names for multi-way clusteringstderr¶ (
bool
) – set to True to keep rows for standard errors below coefficient estimatest_stats¶ (
bool
) – set to True to keep rows for standard errors below coefficient estimates and convert them to t-statsfe¶ (
Union
[str
,Sequence
[Optional
[str
]],None
]) – If fe is passed, should either pass a string to use fe in all models, or a list of strings or None of same length as num modelsfloat_format¶ (
str
) – format string for how to format results in summarysuppress_other_regressors¶ (
bool
) – only used if regressor_order is passed. Then pass True to hide all coefficient rows besides those in regressor_orderregressor_order¶ (
Sequence
[str
]) – sequence of column names to put first in the regression resultsreg_type¶ – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBeth
other_reg_kwargs¶ –
- Returns
a tuple of (reg_list, summary) where reg_list
is a list of fitted regression models, and summary is a single dataframe of results.
-
regtools.iter.
reg_for_each_yvar
(df, yvars, xvars, reg_type='reg', **reg_kwargs)[source]¶ Convenience function to run regressions for multiple y variables with the same x variables
-
regtools.iter.
reg_for_each_yvar_and_produce_summary
(df, yvars, xvars, regressor_order=(), reg_type='reg', stderr=False, t_stats=True, float_format='%0.2f', **reg_kwargs)[source]¶ Convenience function to run regressions for multiple y variables with the same x variables and and present them in a summary format.
- Parameters
df¶ (
DataFrame
) –regressor_order¶ (
Sequence
[str
]) – sequence of column names to put first in the regression resultsreg_type¶ (
str
) – ‘diff’ for difference regression, ‘ols’ for OLS, ‘probit’ for Probit, ‘logit’ for Logit, ‘quantile’ for Quantile, or ‘fmb’ for Fama-MacBethstderr¶ (
bool
) – set to True to keep rows for standard errors below coefficient estimatest_stats¶ (
bool
) – set to True to keep rows for standard errors below coefficient estimates and convert them to t-statsfloat_format¶ (
str
) – format string for how to format results in summaryreg_kwargs¶ –
- Returns
a tuple of (reg_list, summary) where reg_list
is a list of fitted regression models, and summary is a single dataframe of results.
regtools.models module¶
regtools.order module¶
regtools.quantile module¶
-
regtools.quantile.
quantile_plot_from_quantile_result_df
(result_df, yvar, main_iv, outpath=None, clear_figure=True)[source]¶ Creates a plot of effect of main_iv on yvar at different quantiles. To be used after reg_for_each_quantile_produce_result_df
- Parameters
result_df¶ – pd.DataFrame, result from reg_for_each_quantile_produce_result_df
yvar¶ – str, label of dependent variable
main_iv¶ – str, label of independent variable of interest
outpath¶ – str, filepath to output figure. must include matplotlib supported extension such as .pdf or .png
clear_figure¶ – bool, True wipe memory of matplotlib figure after running function
- Returns
-
regtools.quantile.
quantile_reg
(df, yvar, xvars, q=0.5, robust=True, cluster=False, cons=True, fe=None, interaction_tuples=None, num_lags=0, lag_variables='xvars', lag_period_var='Date', lag_id_var='TICKER', lag_fill_method='ffill', lag_fill_limit=None)[source]¶ Returns a fitted quantile regression. Takes df, produces a regression df with no missing among needed variables, and fits a regression model. If robust is specified, uses heteroskedasticity- robust standard errors. If cluster is specified, calculated clustered standard errors by the given variable.
- Notes
Only specify at most one of robust and cluster.
- Parameters
df¶ (
DataFrame
) –xvars¶ (
Sequence
[str
]) – column names of x variables for regressionrobust¶ (
bool
) – set to True to use heterskedasticity-robust standard errorscluster¶ (
Union
[bool
,str
,Sequence
[str
]]) – set to a column name to calculate standard errors within clusters given by unique values of given column name. set to multiple column names for multiway clustering following Cameron, Gelbach, and Miller (2011). NOTE: will get exponentially slower as more cluster variables are addedcons¶ (
bool
) – set to False to not include a constant in the regressionfe¶ (
Union
[str
,Sequence
[str
],None
]) – If a str or list of strs is passed, uses these categorical variables to construct dummies for fixed effects.interaction_tuples¶ (
Union
[Tuple
[str
,str
],Sequence
[Tuple
[str
,str
]],None
]) – tuple or list of tuples of column names to interact and include as xvarsnum_lags¶ (
int
) – Number of periods to lag variables. Setting to other than 0 will activate lagslag_variables¶ (
Union
[str
,Sequence
[str
]]) – ‘all’, ‘xvars’, or list of strs of names of columns to lag for regressions.lag_period_var¶ (
str
) – only used if lag_variables is not None. name of column which contains period variable for lagginglag_id_var¶ (
str
) – only used if lag_variables is not None. name of column which contains identifier variable for lagginglag_fill_method¶ (
Optional
[str
]) – ‘ffill’ or ‘bfill’ for which method to use to fill in missing rows when creating lag variables. Set to None to not fill and have missing instead. See pandas.DataFrame.fillna for more detailslag_fill_limit¶ (
Optional
[int
]) – maximum number of periods to fill with lag_fill_method
- Returns
statsmodels regression result
-
regtools.quantile.
reg_for_each_quantile_output_plot
(main_iv, *reg_args, num_quantiles=8, main_iv_label=None, outpath=None, clear_figure=True, **reg_kwargs)[source]¶ Creates a plot of effect of main_iv on yvar at different quantiles. To be used after reg_for_each_quantile_produce_result_df
- Parameters
main_iv¶ – str, column name of independent variable of interest
reg_args¶ – see quantile_reg for args
num_quantiles¶ – number of quantile regressions to run. will be spaced evenly. higher numbers produce smoother graphs but take longer to run
main_iv_label¶ – str, label of independent variable of interest
outpath¶ – str, filepath to output figure. must include matplotlib supported extension such as .pdf or .png
clear_figure¶ – bool, True wipe memory of matplotlib figure after running function
reg_kwargs¶ – see quantile_reg for kwargs
- Returns
result_df from reg_for_each_quantile_produce_result_df
regtools.reg module¶
-
regtools.reg.
reg
(df, yvar, xvars, robust=True, cluster=False, cons=True, fe=None, interaction_tuples=None, num_lags=0, lag_variables='xvars', lag_period_var='Date', lag_id_var='TICKER', lag_fill_method='ffill', reg_type='OLS')[source]¶ Returns a fitted regression. Takes df, produces a regression df with no missing among needed variables, and fits a regression model. If robust is specified, uses heteroskedasticity- robust standard errors. If cluster is specified, calculated clustered standard errors by the given variable.
- Notes
Only specify at most one of robust and cluster.
- Parameters
df¶ (
DataFrame
) –xvars¶ (
Sequence
[str
]) – column names of x variables for regressionrobust¶ (
bool
) – set to True to use heterskedasticity-robust standard errorscluster¶ (
Union
[bool
,str
,Sequence
[str
]]) – set to a column name to calculate standard errors within clusters given by unique values of given column name. set to multiple column names for multiway clustering following Cameron, Gelbach, and Miller (2011). NOTE: will get exponentially slower as more cluster variables are addedcons¶ (
bool
) – set to False to not include a constant in the regressionfe¶ (
Union
[str
,Sequence
[str
],None
]) – If a str or list of strs is passed, uses these categorical variables to construct dummies for fixed effects.interaction_tuples¶ (
Union
[Tuple
[str
,str
],Sequence
[Tuple
[str
,str
]],None
]) – tuple or list of tuples of column names to interact and include as xvarsnum_lags¶ (
int
) – Number of periods to lag variables. Setting to other than 0 will activate lagslag_variables¶ (
Union
[str
,Sequence
[str
]]) – ‘all’, ‘xvars’, or list of strs of names of columns to lag for regressions.lag_period_var¶ (
str
) – only used if lag_variables is not None. name of column which contains period variable for lagginglag_id_var¶ (
str
) – only used if lag_variables is not None. name of column which contains identifier variable for lagginglag_fill_method¶ (
Optional
[str
]) – ‘ffill’ or ‘bfill’ for which method to use to fill in missing rows when creating lag variables. Set to None to not fill and have missing instead. See pandas.DataFrame.fillna for more detailsreg_type¶ (
str
) – ‘OLS’, ‘probit’, or ‘logit’ for type of model
- Returns
statsmodels regression result.
regtools.regtypes module¶
regtools.select module¶
-
regtools.select.
select_models
(reg_list, keepnum, xvars)[source]¶ Takes a list of fitted regression models and selects among them based on adjusted R-Squared. For each number of variables involved in the regressions, keepnum with the highest R-squareds will be kept.
For example, if reg_list contains 3 regressions with two variables and 6 regressions with three variables, and keepnum is 2, will return a list of four regressions, 2 with two variables and 2 with three variables.