Note
Click here to download the full example code or to run this example in your browser via Binder
Additional Research Tools¶
I’ve been using Python actively for research since 2015. One of the beauties of Python is that it’s very easy to write your own functions, modules, and packages for workflows you do often. Every time that I hit something which was pretty difficult in Python, I built a tool for it to make it easy. The result after these years of doing this is that I’ve built up a lot of tools that make empirical research in Python easier. Let’s take a look through them.
Table of Contents¶
pyexlatex
¶
Generate LaTeX directly from Python with a simplified API.
NOTE: You must have a LaTeX distribution installed on your machine for this package to work. Tested with MikTeX and TeXLive on Windows and Linux.
NOTE: It is highly recommended to run this example in Jupyer Lab so that PDFs will be outputted inline in Jupyter.
Find more at the documentation.
The most basic example:
import pyexlatex as pl
doc = pl.Document('woo')
doc
Out:
<Document>
As a LaTeX str:
print(doc)
Out:
\documentclass[]{article}
\usepackage{amsmath}
\usepackage{pdflscape}
\usepackage{booktabs}
\usepackage{array}
\usepackage{threeparttable}
\usepackage{fancyhdr}
\usepackage{lastpage}
\usepackage{textcomp}
\usepackage{dcolumn}
\newcolumntype{L}[1]{>{\raggedright\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\newcolumntype{C}[1]{>{\centering\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\newcolumntype{R}[1]{>{\raggedleft\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\newcolumntype{.}{D{.}{.}{-1}}
\usepackage[T1]{fontenc}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{graphicx}
\usepackage[margin=0.8in, bottom=1.2in]{geometry}
\usepackage[page]{appendix}
\pagestyle{fancy}
\renewcommand{\headrulewidth}{0pt}
\fancyhead{}
\rfoot{Page \thepage\ of \pageref{LastPage}}
\cfoot{}
\begin{document}
woo
\end{document}
Object-oriented API example:
Out:
<Document>
Template-driven API example:
Out:
<Document>
A combination works as well.
Out:
<Document>
Equations are fine too.
Out:
<Document>
You can create tables from DataFrames
.
# Create a DataFrame for example
df = pd.DataFrame(
[
(1, 2, 'Stuff'),
(3, 4, 'Thing'),
(5, 6, 'Other Thing'),
],
columns=['a', 'b', 'c']
)
table = pl.Table.from_list_of_lists_of_dfs([[df]])
pl.Document([table])
Out:
<Document>
Publication-quality multi-panel tables with captions, below text, consolidation of indices, etc. are supported.
df.set_index('c', inplace=True)
table = pl.Table.from_list_of_lists_of_dfs(
[
[df, df],
[df, df]
],
shape=(1, 2),
include_index=True,
panel_names=['Top Panel', 'Bottom Panel'],
caption='My First Complex Table',
below_text="""
Some description of my table. Isn't it nice to be able to do everything all in one command?
""",
label='tables:one'
)
content.append(table)
pl.Document(content)
Out:
<Document>
Figures are supported as well, with an integration for matplotlib
and therefore pandas
as well (can also be loaded from file).
ax = df.plot()
fig = ax.get_figure()
pl_fig = pl.Figure.from_dict_of_names_and_plt_figures(
{
'My Subfigure': fig, # more subfigures can be passed in the same way
},
'.', # output location
figure_name='My Figure',
label='figs:one',
position_str_name_dict={
'My Subfigure': r'[t]{0.95\linewidth}' # LaTeX positioning strings accepted (be sure to use r'' to escape \)
}
)
content.append(pl_fig)
pl.Document(content)
Out:
<Document>
Table/Figure references work just fine and you can use the objects directly if desired.
Out:
<Document>
Support for citations as well. There is an easier way to create these
using bibtex_gen
as shown in that section.
bibtex_item = pl.BibTexArticle(
'using-pyexlatex',
'Nick DeRobertis',
'How to Use pyexlatex',
'The Journal of Awesome Stuff',
'2020',
volume='Vol 1',
pages='1-2',
)
content.extend([
pl.Section(
[
['As shown by', pl.CiteT('using-pyexlatex'), pl.Monospace('pyexlatex'), 'is pretty awesome.']
],
title='Example for Using Citations'
),
pl.Bibliography([bibtex_item], style_name='jof')
])
pl.Document(content)
Out:
<Document>
Add metadata to the document such as author, title, etc.
footnotes = {
'nick': pl.Footnote(
"University of Florida, PhD Candidate, Tel: (352)392-4669, Email: Nicholas.DeRobertis@Warrington.ufl.edu"
),
'other': pl.Footnote(
"Example University, Professor"
)
}
abstract = """
A short abstract which is included for example purposes. There is a lot of configuration available for how the document
itself renders. Feel free to take this as an example of something that looks pretty good and then look through the
documentation for how to modify it.
"""
doc = pl.Document(
content,
authors=[
f'{pl.SmallCaps("Nick DeRobertis")}{footnotes["nick"]}',
f'{pl.SmallCaps("Other Person")}{footnotes["other"]}',
],
title='The title of my paper',
abstract=abstract,
page_modifier_str='margin=1.0in',
section_numbering_styles=dict(
section=r'\Roman{section}',
subsection=r'\thesection.\Alph{subsection}',
subsubsection=r'\thesubsection.\arabic{subsubsection}',
subfigure=r'\roman{subfigure}',
),
floats_at_end=True,
font_size=12,
line_spacing=2,
tables_relative_font_size=-2,
page_style='fancyplain',
custom_headers=[
pl.Header(pl.SmallCaps('My Short Title'), align='left'),
pl.Header(pl.SmallCaps(['Page ', pl.ThisPageNumber()]))
],
page_numbers=False,
separate_abstract_page=True,
extra_title_page_lines=[
[pl.Italics('JEL Classification:'), 'E42, E44, E52, G12, G15, [add more here]'],
[pl.Italics('Keywords:'), 'Thing; stuff; other stuf'],
],
)
doc
Out:
<Document>
LaTeX presentations with Beamer are supported as well.
pres_content = [
pl.Frame(
[
'Some text',
pl.Block(
[
'more text'
],
title='My Block'
),
pl_fig
]
)
]
pl.Presentation(pres_content)
Out:
<Item(name=document, contents=\begin{frame}
Some text
\begin{block}{My Block}
more text
\end{block}
\begin{figure}
\includegraphics[width=0.95\linewidth]{Sources/My_Subfigure.pdf}
\caption{My Figure}
\label{figs:one}
\end{figure}
\end{frame})>
And with sections, metadata, frame templates, etc.
pl_fig
pres_content = [
pl.Section(
[
pl.DimRevealListFrame(
[
'some',
'bullet',
'points'
],
title='First Frame'
),
],
title='First Section'
),
pl.Section(
[
pl.Frame(
pl_fig,
title='Second Frame'
),
],
title='Second Section'
)
]
pl.Presentation(
pres_content,
title='My Presentation',
authors=['Nick DeRobertis', 'Some Person'],
short_title='Pres',
subtitle='An Example Presentation',
short_author='ND',
institutions=[
['University of Florida'],
['University of Florida', 'Some other Place']
],
short_institution='UF',
nav_header=True,
toc_sections=True
)
Out:
<Item(name=document, contents=\title[Pres]{My Presentation}
\subtitle{An Example Presentation}
\author[ND]{Nick DeRobertis\inst{1}, Some Person\inst{2}}
\date{\today}
\begin{frame}
\titlepage
\label{title-frame}
\end{frame}
\begin{section}{First Section}
\begin{frame}
\frametitle{First Frame}
\begin{itemize}
\item<+-> \textcolor<.(1)->{black!30}{some}
\vfill
\item<+-> \textcolor<.(1)->{black!30}{bullet}
\vfill
\item<+-> points
\end{itemize}
\end{frame}
\end{section}
\begin{section}{Second Section}
\begin{frame}
\frametitle{Second Frame}
\begin{figure}
\includegraphics[width=0.95\linewidth]{Sources/My_Subfigure.pdf}
\caption{My Figure}
\label{figs:one}
\end{figure}
\end{frame}
\end{section})>
By default it produces the “slides” version that you would use while presenting. It can also produce a “handouts” version which removes all the effects (overlays).
pl.Presentation(
pres_content,
title='My Presentation',
authors=['Nick DeRobertis', 'Some Person'],
short_title='Pres',
subtitle='An Example Presentation',
short_author='ND',
institutions=[
['University of Florida'],
['University of Florida', 'Some other Place']
],
short_institution='UF',
nav_header=True,
toc_sections=True,
handouts=True # add this to remove presentation effects, good for distributing the PDF
)
Out:
<Item(name=document, contents=\title[Pres]{My Presentation}
\subtitle{An Example Presentation}
\author[ND]{Nick DeRobertis\inst{1}, Some Person\inst{2}}
\date{\today}
\begin{frame}
\titlepage
\label{title-frame}
\end{frame}
\begin{section}{First Section}
\begin{frame}
\frametitle{First Frame}
\begin{itemize}
\item some
\vfill
\item bullet
\vfill
\item points
\end{itemize}
\end{frame}
\end{section}
\begin{section}{Second Section}
\begin{frame}
\frametitle{Second Frame}
\begin{figure}
\includegraphics[width=0.95\linewidth]{Sources/My_Subfigure.pdf}
\caption{My Figure}
\label{figs:one}
\end{figure}
\end{frame}
\end{section})>
Some Clean Up¶
Not important, just cleaning up temporary files from the example.
import os
temp_files = [
'My Subfigure.pdf',
]
for file in temp_files:
os.remove(file)
regtools
¶
High-level tools for running regressions.
Find more at the documentation.
Some Setup¶
Create a DataFrame with Y and X variables and a known relationship between them. Also fill some cells with missing values.
df = pd.DataFrame(
np.random.random((100, 4)),
columns=['X1', 'X2', 'X3', r'$\epsilon$']
)
df['Y'] = df['X1'] * 5 + df['X2'] * 10 + df['X3'] * 20 + df[r'$\epsilon$'] * 10
df['f1'] = np.random.choice(['a', 'b', 'c'], size=(100,))
df['f2'] = np.random.choice(['d', 'e', 'f'], size=(100,))
df['date'] = pd.to_datetime(np.random.choice(['1/1/2000', '1/2/2000', '1/3/2000'], size=(100,)))
df.iloc[1, 2] = nan
df.iloc[3, 4] = nan
df.head()
X1 | X2 | X3 | $\epsilon$ | Y | f1 | f2 | date | |
---|---|---|---|---|---|---|---|---|
0 | 0.898230 | 0.130509 | 0.237716 | 0.225186 | 12.802420 | c | e | 2000-01-02 |
1 | 0.168765 | 0.439407 | NaN | 0.387186 | 27.583171 | c | e | 2000-01-03 |
2 | 0.084249 | 0.904845 | 0.937915 | 0.462776 | 32.855745 | c | f | 2000-01-03 |
3 | 0.500707 | 0.423299 | 0.949985 | 0.665030 | NaN | c | d | 2000-01-03 |
4 | 0.476204 | 0.017644 | 0.643010 | 0.311633 | 18.533991 | c | f | 2000-01-01 |
All regression automatically drop values with missing rows. By default they run with heteroskedasticity-robust standard errors and a constant.
Dep. Variable: | Y | R-squared: | 0.862 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.858 |
Method: | Least Squares | F-statistic: | 231.0 |
Date: | Wed, 19 Feb 2020 | Prob (F-statistic): | 3.09e-43 |
Time: | 15:02:24 | Log-Likelihood: | -236.19 |
No. Observations: | 98 | AIC: | 480.4 |
Df Residuals: | 94 | BIC: | 490.7 |
Df Model: | 3 | ||
Covariance Type: | HC1 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 2.6209 | 0.984 | 2.663 | 0.008 | 0.692 | 4.550 |
X1 | 6.5772 | 0.968 | 6.795 | 0.000 | 4.680 | 8.474 |
X2 | 11.6667 | 0.995 | 11.723 | 0.000 | 9.716 | 13.617 |
X3 | 20.8877 | 0.927 | 22.529 | 0.000 | 19.071 | 22.705 |
Omnibus: | 6.133 | Durbin-Watson: | 2.113 |
---|---|---|---|
Prob(Omnibus): | 0.047 | Jarque-Bera (JB): | 2.947 |
Skew: | 0.137 | Prob(JB): | 0.229 |
Kurtosis: | 2.196 | Cond. No. | 7.12 |
Warnings:
[1] Standard Errors are heteroscedasticity robust (HC1)
Run multiple regressions in one go with iteration tools. All functions also support fixed effects and multiway clustering.
reg_list, summ = regtools.reg_for_each_xvar_set_and_produce_summary(
df,
'Y',
[
['X1'],
['X1', 'X2'],
['X1', 'X3'],
['X1', 'X2', 'X3']
],
fe=[['f1', 'f2']],
entity_var='f1',
time_var='date',
cluster=['f1', 'f2'],
robust=False
)
summ
Out:
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/linearmodels/panel/results.py:85: RuntimeWarning: invalid value encountered in sqrt
return Series(np.sqrt(np.diag(self.cov)), self._var_names, name="std_error")
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in greater
return (a < x) & (x < b)
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in less
return (a < x) & (x < b)
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:1827: RuntimeWarning: invalid value encountered in greater_equal
cond2 = (x >= np.asarray(_b)) & cond0
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/tstat.py:17: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['type'] = ['estimate', 'stderr'] * int(len(df.index) / 2)
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/tstat.py:20: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['regressor'] = [i for sublist in [[j] * 2 for j in df.index[0::2]] for i in sublist]
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_with_indexer(indexer, value)
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/tstat.py:31: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
] = t_values
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/pandas/core/frame.py:3997: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/__init__.py:170: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['type'] = ['estimate', 'stderr'] * int(len(df.index) / 2)
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/__init__.py:173: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['regressor'] = [i for sublist in [[j] * 2 for j in df.index[0::2]] for i in sublist]
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/__init__.py:177: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['idx'] = [i for i in range(len(df))]
Y I | Y II | Y III | Y IIII | |
---|---|---|---|---|
R-squared | nan | nan | nan | nan |
X1 | 2.28 | 3.59* | 5.29*** | 6.83*** |
(0.83) | (1.67) | (3.33) | (8.78) | |
X2 | 10.12*** | 11.09 | ||
(4.02) | ||||
X3 | 20.26*** | 20.63*** | ||
(12.78) | (42.98) | |||
Intercept | 20.88*** | 14.60*** | 10.06*** | 2.94*** |
(15.96) | (5.62) | (6.87) | (4.07) | |
f1 Fixed Effects | Yes | Yes | Yes | Yes |
f2 Fixed Effects | Yes | Yes | Yes | Yes |
Cluster by f1 | Yes | Yes | Yes | Yes |
Cluster by f2 | Yes | Yes | Yes | Yes |
N | 99 | 99 | 98 | 98 |
Default is OLS. Other supported types: Probit, Logit, Quantile,
Fama-Macbeth. Just pass the string to reg_type
.
reg_list, summ = regtools.reg_for_each_xvar_set_and_produce_summary(
df,
'Y',
[
['X1'],
['X1', 'X2'],
['X1', 'X3'],
['X1', 'X2', 'X3']
],
reg_type='quantile',
q=0.9
)
summ
reg_list, summ = regtools.reg_for_each_xvar_set_and_produce_summary(
df,
'Y',
[
['X1'],
['X1', 'X2'],
['X1', 'X3'],
['X1', 'X2', 'X3']
],
reg_type='fama-macbeth',
entity_var='f1',
time_var='date'
)
summ
Out:
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/linearmodels/panel/model.py:2631: InferenceUnavailableWarning: The number of time-series observation available to estimate cross-sectional
regressions, 3, is less than the number of parameters in the model. Parameter
inference is not available.
InferenceUnavailableWarning,
Y I | Y II | Y III | Y IIII | |
---|---|---|---|---|
R-squared | nan | nan | nan | nan |
X1 | 2.67 | 3.87 | 4.34*** | 6.16*** |
(0.72) | (1.15) | (2.67) | (4.00) | |
X2 | 10.24*** | 11.13*** | ||
(4.63) | (15.45) | |||
X3 | 20.60*** | 20.93*** | ||
(14.72) | (43.63) | |||
Intercept | 20.53*** | 14.26*** | 10.25*** | 2.99*** |
(11.98) | (5.19) | (10.81) | (3.83) | |
N | 99 | 99 | 98 | 98 |
pd-utils
¶
Additional utilities to work with Pandas.
Find more at the documentation.
Some Setup¶
df1 = pd.DataFrame(
[
("001076", "3/1/1995"),
("001076", "4/1/1995"),
("001722", "1/1/2012"),
("001722", "7/1/2012"),
("001722", nan),
(nan, "1/1/2012"),
],
columns=["GVKEY", "Date"],
)
df1["Date"] = pd.to_datetime(df1["Date"])
df2 = pd.DataFrame(
[
("001076", "2/1/1995"),
("001076", "3/2/1995"),
("001722", "11/1/2011"),
("001722", "10/1/2011"),
("001722", nan),
(nan, "1/1/2012"),
],
columns=["GVKEY", "Date"],
)
df2["Date"] = pd.to_datetime(df2["Date"])
df3 = pd.DataFrame(
data=[
(10516, "a", "1/1/2000", 1.01, 0),
(10516, "a", "1/2/2000", 1.02, 1),
(10516, "a", "1/3/2000", 1.03, 1),
(10516, "a", "1/4/2000", 1.04, 0),
(10516, "b", "1/1/2000", 1.05, 1),
(10516, "b", "1/2/2000", 1.06, 1),
(10516, "b", "1/3/2000", 1.07, 1),
(10516, "b", "1/4/2000", 1.08, 1),
(10517, "a", "1/1/2000", 1.09, 0),
(10517, "a", "1/2/2000", 1.1, 0),
(10517, "a", "1/3/2000", 1.11, 0),
(10517, "a", "1/4/2000", 1.12, 1),
],
columns=["PERMNO", "byvar", "Date", "RET", "weight"],
)
df1
df2
df3
PERMNO | byvar | Date | RET | weight | |
---|---|---|---|---|---|
0 | 10516 | a | 1/1/2000 | 1.01 | 0 |
1 | 10516 | a | 1/2/2000 | 1.02 | 1 |
2 | 10516 | a | 1/3/2000 | 1.03 | 1 |
3 | 10516 | a | 1/4/2000 | 1.04 | 0 |
4 | 10516 | b | 1/1/2000 | 1.05 | 1 |
5 | 10516 | b | 1/2/2000 | 1.06 | 1 |
6 | 10516 | b | 1/3/2000 | 1.07 | 1 |
7 | 10516 | b | 1/4/2000 | 1.08 | 1 |
8 | 10517 | a | 1/1/2000 | 1.09 | 0 |
9 | 10517 | a | 1/2/2000 | 1.10 | 0 |
10 | 10517 | a | 1/3/2000 | 1.11 | 0 |
11 | 10517 | a | 1/4/2000 | 1.12 | 1 |
tradedays
¶
Work directly with US market trading days.
import pd_utils
pd.date_range(
start='1/1/2000',
end='1/31/2000',
freq=pd_utils.tradedays()
)
Out:
DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06',
'2000-01-07', '2000-01-10', '2000-01-11', '2000-01-12',
'2000-01-13', '2000-01-14', '2000-01-18', '2000-01-19',
'2000-01-20', '2000-01-21', '2000-01-24', '2000-01-25',
'2000-01-26', '2000-01-27', '2000-01-28', '2000-01-31'],
dtype='datetime64[ns]', freq='C')
left_merge_latest
¶
Merge the latest data available in the right DataFrame
to the left
DataFrame
.
pd_utils.left_merge_latest(
df1,
df2,
on='GVKEY',
max_offset=pd.Timedelta(days=30),
# max_offset=pd_utils.tradedays() * 20
)
GVKEY | Date | Date_y | |
---|---|---|---|
0 | 001076 | 1995-03-01 | 1995-02-01 |
1 | 001076 | 1995-04-01 | 1995-03-02 |
2 | 001722 | 2012-01-01 | NaT |
3 | 001722 | 2012-07-01 | NaT |
4 | 001722 | NaT | NaT |
5 | NaN | 2012-01-01 | NaT |
averages
¶
Equal and value-weighted averages, optionally by groups
pd_utils.averages(
df3,
'RET',
['PERMNO', 'byvar'],
wtvar='weight',
)
PERMNO | byvar | RET | RET_wavg | |
---|---|---|---|---|
0 | 10516 | a | 1.025 | 1.025 |
1 | 10516 | b | 1.065 | 1.065 |
2 | 10517 | a | 1.105 | 1.120 |
portfolio
¶
Form porfolios from some numeric column.
pd_utils.portfolio(
df3,
'RET',
ngroups=3,
# cutoffs=[1.02, 1.07],
# quant_cutoffs=[0.2],
byvars='Date',
)
PERMNO | byvar | Date | RET | weight | portfolio | |
---|---|---|---|---|---|---|
0 | 10516 | a | 1/1/2000 | 1.01 | 0 | 1 |
1 | 10516 | a | 1/2/2000 | 1.02 | 1 | 1 |
2 | 10516 | a | 1/3/2000 | 1.03 | 1 | 1 |
3 | 10516 | a | 1/4/2000 | 1.04 | 0 | 1 |
4 | 10516 | b | 1/1/2000 | 1.05 | 1 | 2 |
5 | 10516 | b | 1/2/2000 | 1.06 | 1 | 2 |
6 | 10516 | b | 1/3/2000 | 1.07 | 1 | 2 |
7 | 10516 | b | 1/4/2000 | 1.08 | 1 | 2 |
8 | 10517 | a | 1/1/2000 | 1.09 | 0 | 3 |
9 | 10517 | a | 1/2/2000 | 1.10 | 0 | 3 |
10 | 10517 | a | 1/3/2000 | 1.11 | 0 | 3 |
11 | 10517 | a | 1/4/2000 | 1.12 | 1 | 3 |
long_to_wide
¶
Pandas has a built in wide_to_long
function but not
long_to_wide
. There is .pivot
but it can’t handle multiple by
variables.
pd_utils.long_to_wide(
df3,
["PERMNO", "byvar"],
"RET",
colindex="Date",
colindex_only=True
)
PERMNO | byvar | weight | 1/1/2000 | 1/2/2000 | 1/3/2000 | 1/4/2000 | |
---|---|---|---|---|---|---|---|
0 | 10516 | a | 0 | 1.01 | 1.02 | 1.03 | 1.04 |
1 | 10516 | a | 1 | 1.01 | 1.02 | 1.03 | 1.04 |
2 | 10516 | b | 1 | 1.05 | 1.06 | 1.07 | 1.08 |
3 | 10517 | a | 0 | 1.09 | 1.10 | 1.11 | 1.12 |
4 | 10517 | a | 1 | 1.09 | 1.10 | 1.11 | 1.12 |
winsorize
¶
Winsorize data, optionally by groups, optionally a subset of columns, and optionally only to top or bottom.
pd_utils.winsorize(
df3,
0.4,
subset="RET",
byvars=["PERMNO", "byvar"],
)
PERMNO | byvar | Date | RET | weight | |
---|---|---|---|---|---|
0 | 10516 | a | 1/1/2000 | 1.022624 | 0 |
1 | 10516 | a | 1/2/2000 | 1.022624 | 1 |
2 | 10516 | a | 1/3/2000 | 1.026720 | 1 |
3 | 10516 | a | 1/4/2000 | 1.026720 | 0 |
4 | 10516 | b | 1/1/2000 | 1.062624 | 1 |
5 | 10516 | b | 1/2/2000 | 1.062624 | 1 |
6 | 10516 | b | 1/3/2000 | 1.066720 | 1 |
7 | 10516 | b | 1/4/2000 | 1.066720 | 1 |
8 | 10517 | a | 1/1/2000 | 1.102624 | 0 |
9 | 10517 | a | 1/2/2000 | 1.102624 | 0 |
10 | 10517 | a | 1/3/2000 | 1.106720 | 0 |
11 | 10517 | a | 1/4/2000 | 1.106720 | 1 |
formatted_corr_df
¶
Nicely formatted correlations.
pd_utils.formatted_corr_df(df3)
PERMNO | RET | weight | |
---|---|---|---|
PERMNO | 1.00 | ||
RET | 0.82 | 1.00 | |
weight | -0.48 | -0.12 | 1.00 |
datacode
¶
Data pipelines for humans.
NOTE: Under active development. API is experimental and subject to change.
Find more at the documentation.
Features:
Deal with the concept of variables rather than columns in a DataFrame
Apply transformations to variables to update both the values and name of the variable, but still be able to say it’s the same variable
E.g. take lag of variable “A”, now it is shown as A\(_{t - 1}\) but you can still work with it as the same variable without parsing the name
Access variables by
short_keys
and tab-completion but have them displayed with the label in theDataFrame
.Associate symbols and descriptions with variables. Generates symbols by default and you can override.
Attach data pipelines to generated data sources. It checks when the original sources were last modified, and if they were more recently modified than the pipeline was run, will run the pipeline again automatically.
Easier merges with data merge pipelines and smart merge options
Everything is extendible so you can add your own custom logic
Describe your data sources in detail to enable some features:
Built-in transformations are index-aware. E.g. you have described that rows are indexed by firm and time. When you take a lag of the variable, it will automatically realize it should take the lag across the time dimension and within the firm
(Planned feature): Tell it what variables you want, and it will figure out the merges to make it happen
# TODO [#1]: add examples for datacode
objcache
¶
Easily store Python objects for later (cache results).
Find more at the documentation.
I use this in my workflow so that I can run analysis and store tables and figures with little effort, then later when I generate the paper I retrieve the tables and figures from the cache. That way I can update everything by running the analysis then generating the paper, or I can update just the text in the paper and generate it quickly using the pre-existing tables and figures.
Out:
5
Some Cleanup¶
Not important, just to clean up files generated from the example.
import os
temp_files = [
'cache.zodb',
'cache.zodb.index',
'cache.zodb.lock',
'cache.zodb.tmp'
]
for file in temp_files:
os.remove(file)
pyfileconf
¶
Function and class configuration as Python files, helpful for managing multiple complex configuations.
NOTE: Under active development. API is experimental and subject to change.
Find more at the documentation.
Features:
Easy way to have multiple configurations for a single function or class
Generates Python file templates for configuration, complete with all the arguments, type annotations, and default values of function or class
Run/get configured functions/classes from Python or the command line
Update configurations at run-time in a Python script
Easy to do config-based scripting. E.g.: Run the whole analysis with 3, 4, and 5 portfolios.
Works very well with
datacode
where you need to have many variables, sources, etc. configured
# TODO [#2]: add examples for pyfileconf
Total running time of the script: ( 0 minutes 3.167 seconds)