Note

Click here to download the full example code or to run this example in your browser via Binder

Additional Research Tools¶

I’ve been using Python actively for research since 2015. One of the beauties of Python is that it’s very easy to write your own functions, modules, and packages for workflows you do often. Every time that I hit something which was pretty difficult in Python, I built a tool for it to make it easy. The result after these years of doing this is that I’ve built up a lot of tools that make empirical research in Python easier. Let’s take a look through them.

Table of Contents¶

**pyexlatex**: Generate LaTeX directly from Python with a simplified API
**regtools**: High-level tools for running regressions
**pd-utils**: Additional utilities to work with Pandas
**datacode**: Data pipelines for humans
**bibtex_gen**: Citation management using Mendeley API and BibTeX
**objcache**: Easily store Python objects for later (cache results)
**pyfileconf**: Function and class configuration as Python files, helpful for managing multiple complex configuations

Some General Imports¶

Import some packages we’ll need across the examples.

import pandas as pd
import numpy as np
from numpy import nan

`pyexlatex`¶

Generate LaTeX directly from Python with a simplified API.

NOTE: You must have a LaTeX distribution installed on your machine for this package to work. Tested with MikTeX and TeXLive on Windows and Linux.

NOTE: It is highly recommended to run this example in Jupyer Lab so that PDFs will be outputted inline in Jupyter.

Find more at the documentation.

The most basic example:

import pyexlatex as pl

doc = pl.Document('woo')
doc

Out:

<Document>

As a LaTeX str:

print(doc)

Out:

\documentclass[]{article}
\usepackage{amsmath}
\usepackage{pdflscape}
\usepackage{booktabs}
\usepackage{array}
\usepackage{threeparttable}
\usepackage{fancyhdr}
\usepackage{lastpage}
\usepackage{textcomp}
\usepackage{dcolumn}
\newcolumntype{L}[1]{>{\raggedright\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\newcolumntype{C}[1]{>{\centering\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\newcolumntype{R}[1]{>{\raggedleft\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\newcolumntype{.}{D{.}{.}{-1}}
\usepackage[T1]{fontenc}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{graphicx}
\usepackage[margin=0.8in, bottom=1.2in]{geometry}
\usepackage[page]{appendix}
\pagestyle{fancy}
\renewcommand{\headrulewidth}{0pt}
\fancyhead{}
\rfoot{Page \thepage\  of \pageref{LastPage}}
\cfoot{}
\begin{document}
woo
\end{document}

Object-oriented API example:

my_value = 5

contents = [
    pl.Section(
        [
            f'Some text. My value is {my_value}.',
            pl.UnorderedList([
                'A bullet',
                'List'
            ])
        ],
        title='First Section'
    )
]

doc = pl.Document(contents)
doc

Out:

<Document>

Template-driven API example:

template = """
{% filter Section(title='First Section') %}

Some text. My value is {{ my_value }}.
{{ [
    'A bullet',
    'List'
] | UnorderedList }}

{% endfilter %}
"""

class MyModel(pl.Model):
    my_value = 5

content = [MyModel(template_str=template)]
doc = pl.Document(content)
doc

Out:

<Document>

A combination works as well.

content = [
    MyModel(template_str=template),
    pl.Section(
        [
            f'Some text. My value is {my_value}.',
            pl.UnorderedList([
                'A bullet',
                'List'
            ])
        ],
        title='Second Section'
    )
]

doc = pl.Document(content)
doc

Out:

<Document>

Equations are fine too.

content.append(
    pl.Section(
        [
            ['You can use inline equations', pl.Equation('y = mx + b'),
             'by default, or pass inline=False to separate them', pl.Equation('E = MC^2', inline=False)]
        ],
        title='Equations Example'
    )
)
pl.Document(content)

Out:

<Document>

You can create tables from DataFrames.

# Create a DataFrame for example
df = pd.DataFrame(
    [
        (1, 2, 'Stuff'),
        (3, 4, 'Thing'),
        (5, 6, 'Other Thing'),
    ],
    columns=['a', 'b', 'c']
)

table = pl.Table.from_list_of_lists_of_dfs([[df]])
pl.Document([table])

Out:

<Document>

Publication-quality multi-panel tables with captions, below text, consolidation of indices, etc. are supported.

df.set_index('c', inplace=True)

table = pl.Table.from_list_of_lists_of_dfs(
    [
        [df, df],
        [df, df]
    ],
    shape=(1, 2),
    include_index=True,
    panel_names=['Top Panel', 'Bottom Panel'],
    caption='My First Complex Table',
    below_text="""
    Some description of my table. Isn't it nice to be able to do everything all in one command?
    """,
    label='tables:one'
)
content.append(table)
pl.Document(content)

Out:

<Document>

Figures are supported as well, with an integration for matplotlib and therefore pandas as well (can also be loaded from file).

ax = df.plot()
fig = ax.get_figure()

pl_fig = pl.Figure.from_dict_of_names_and_plt_figures(
    {
        'My Subfigure': fig,  # more subfigures can be passed in the same way
    },
    '.',  # output location
    figure_name='My Figure',
    label='figs:one',
    position_str_name_dict={
        'My Subfigure': r'[t]{0.95\linewidth}'  # LaTeX positioning strings accepted (be sure to use r'' to escape \)
    }
)
content.append(pl_fig)
pl.Document(content)

../_images/sphx_glr_additional_research_tools_001.png

Out:

<Document>

Table/Figure references work just fine and you can use the objects directly if desired.

content.append(
    pl.Section(
        [
            ['See Table', pl.Ref(table.label), 'and Figure', pl.Ref(pl_fig.label)]
        ],
        title='Example for Table and Figure References'
    )
)
pl.Document(content)

Out:

<Document>

Support for citations as well. There is an easier way to create these using bibtex_gen as shown in that section.

bibtex_item = pl.BibTexArticle(
    'using-pyexlatex',
    'Nick DeRobertis',
    'How to Use pyexlatex',
    'The Journal of Awesome Stuff',
    '2020',
    volume='Vol 1',
    pages='1-2',
)

content.extend([
    pl.Section(
        [
            ['As shown by', pl.CiteT('using-pyexlatex'), pl.Monospace('pyexlatex'), 'is pretty awesome.']
        ],
        title='Example for Using Citations'
    ),
    pl.Bibliography([bibtex_item], style_name='jof')
])
pl.Document(content)

Out:

<Document>

Add metadata to the document such as author, title, etc.

footnotes = {
        'nick': pl.Footnote(
            "University of Florida, PhD Candidate, Tel: (352)392-4669, Email: Nicholas.DeRobertis@Warrington.ufl.edu"
        ),
        'other': pl.Footnote(
            "Example University, Professor"
        )
    }

abstract = """
A short abstract which is included for example purposes. There is a lot of configuration available for how the document
itself renders. Feel free to take this as an example of something that looks pretty good and then look through the
documentation for how to modify it.
"""

doc = pl.Document(
    content,
    authors=[
        f'{pl.SmallCaps("Nick DeRobertis")}{footnotes["nick"]}',
        f'{pl.SmallCaps("Other Person")}{footnotes["other"]}',
    ],
    title='The title of my paper',
    abstract=abstract,
    page_modifier_str='margin=1.0in',
    section_numbering_styles=dict(
        section=r'\Roman{section}',
        subsection=r'\thesection.\Alph{subsection}',
        subsubsection=r'\thesubsection.\arabic{subsubsection}',
        subfigure=r'\roman{subfigure}',
    ),
    floats_at_end=True,
    font_size=12,
    line_spacing=2,
    tables_relative_font_size=-2,
    page_style='fancyplain',
    custom_headers=[
        pl.Header(pl.SmallCaps('My Short Title'), align='left'),
        pl.Header(pl.SmallCaps(['Page ', pl.ThisPageNumber()]))
    ],
    page_numbers=False,
    separate_abstract_page=True,
    extra_title_page_lines=[
        [pl.Italics('JEL Classification:'), 'E42, E44, E52, G12, G15, [add more here]'],
        [pl.Italics('Keywords:'), 'Thing; stuff; other stuf'],
    ],
)
doc

Out:

<Document>

LaTeX presentations with Beamer are supported as well.

pres_content = [
    pl.Frame(
        [
            'Some text',
            pl.Block(
                [
                    'more text'
                ],
                title='My Block'
            ),
            pl_fig
        ]
    )
]
pl.Presentation(pres_content)

Out:

<Item(name=document, contents=\begin{frame}
Some text
\begin{block}{My Block}
more text
\end{block}
\begin{figure}
\includegraphics[width=0.95\linewidth]{Sources/My_Subfigure.pdf}
\caption{My Figure}
\label{figs:one}
\end{figure}
\end{frame})>

And with sections, metadata, frame templates, etc.

pl_fig

pres_content = [
    pl.Section(
        [
            pl.DimRevealListFrame(
                [
                    'some',
                    'bullet',
                    'points'
                ],
                title='First Frame'
            ),
        ],
        title='First Section'
    ),
    pl.Section(
        [
            pl.Frame(
                pl_fig,
                title='Second Frame'
            ),
        ],
        title='Second Section'
    )

]
pl.Presentation(
    pres_content,
    title='My Presentation',
    authors=['Nick DeRobertis', 'Some Person'],
    short_title='Pres',
    subtitle='An Example Presentation',
    short_author='ND',
    institutions=[
        ['University of Florida'],
        ['University of Florida', 'Some other Place']
    ],
    short_institution='UF',
    nav_header=True,
    toc_sections=True
)

Out:

<Item(name=document, contents=\title[Pres]{My Presentation}
\subtitle{An Example Presentation}
\author[ND]{Nick DeRobertis\inst{1}, Some Person\inst{2}}
\date{\today}
\begin{frame}
\titlepage
\label{title-frame}
\end{frame}
\begin{section}{First Section}
\begin{frame}
\frametitle{First Frame}
\begin{itemize}
\item<+-> \textcolor<.(1)->{black!30}{some}
\vfill
\item<+-> \textcolor<.(1)->{black!30}{bullet}
\vfill
\item<+-> points
\end{itemize}
\end{frame}
\end{section}
\begin{section}{Second Section}
\begin{frame}
\frametitle{Second Frame}
\begin{figure}
\includegraphics[width=0.95\linewidth]{Sources/My_Subfigure.pdf}
\caption{My Figure}
\label{figs:one}
\end{figure}
\end{frame}
\end{section})>

By default it produces the “slides” version that you would use while presenting. It can also produce a “handouts” version which removes all the effects (overlays).

pl.Presentation(
    pres_content,
    title='My Presentation',
    authors=['Nick DeRobertis', 'Some Person'],
    short_title='Pres',
    subtitle='An Example Presentation',
    short_author='ND',
    institutions=[
        ['University of Florida'],
        ['University of Florida', 'Some other Place']
    ],
    short_institution='UF',
    nav_header=True,
    toc_sections=True,
    handouts=True  # add this to remove presentation effects, good for distributing the PDF
)

Out:

<Item(name=document, contents=\title[Pres]{My Presentation}
\subtitle{An Example Presentation}
\author[ND]{Nick DeRobertis\inst{1}, Some Person\inst{2}}
\date{\today}
\begin{frame}
\titlepage
\label{title-frame}
\end{frame}
\begin{section}{First Section}
\begin{frame}
\frametitle{First Frame}
\begin{itemize}
\item some
\vfill
\item bullet
\vfill
\item points
\end{itemize}
\end{frame}
\end{section}
\begin{section}{Second Section}
\begin{frame}
\frametitle{Second Frame}
\begin{figure}
\includegraphics[width=0.95\linewidth]{Sources/My_Subfigure.pdf}
\caption{My Figure}
\label{figs:one}
\end{figure}
\end{frame}
\end{section})>

Some Clean Up¶

Not important, just cleaning up temporary files from the example.

import os

temp_files = [
    'My Subfigure.pdf',
]

for file in temp_files:
    os.remove(file)

`regtools`¶

High-level tools for running regressions.

Find more at the documentation.

Some Setup¶

Create a DataFrame with Y and X variables and a known relationship between them. Also fill some cells with missing values.

df = pd.DataFrame(
    np.random.random((100, 4)),
    columns=['X1', 'X2', 'X3', r'$\epsilon$']
)
df['Y'] = df['X1'] * 5 + df['X2'] * 10 + df['X3'] * 20 + df[r'$\epsilon$'] * 10
df['f1'] = np.random.choice(['a', 'b', 'c'], size=(100,))
df['f2'] = np.random.choice(['d', 'e', 'f'], size=(100,))
df['date'] = pd.to_datetime(np.random.choice(['1/1/2000', '1/2/2000', '1/3/2000'], size=(100,)))
df.iloc[1, 2] = nan
df.iloc[3, 4] = nan
df.head()

	X1	X2	X3	$\epsilon$	Y	f1	f2	date
0	0.898230	0.130509	0.237716	0.225186	12.802420	c	e	2000-01-02
1	0.168765	0.439407	NaN	0.387186	27.583171	c	e	2000-01-03
2	0.084249	0.904845	0.937915	0.462776	32.855745	c	f	2000-01-03
3	0.500707	0.423299	0.949985	0.665030	NaN	c	d	2000-01-03
4	0.476204	0.017644	0.643010	0.311633	18.533991	c	f	2000-01-01

All regression automatically drop values with missing rows. By default they run with heteroskedasticity-robust standard errors and a constant.

import regtools

result = regtools.reg(
    df,
    'Y',
    ['X1', 'X2', 'X3']
)
result.summary()

OLS Regression Results
Dep. Variable:	Y	R-squared:	0.862
Model:	OLS	Adj. R-squared:	0.858
Method:	Least Squares	F-statistic:	231.0
Date:	Wed, 19 Feb 2020	Prob (F-statistic):	3.09e-43
Time:	15:02:24	Log-Likelihood:	-236.19
No. Observations:	98	AIC:	480.4
Df Residuals:	94	BIC:	490.7
Df Model:	3
Covariance Type:	HC1

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	2.6209	0.984	2.663	0.008	0.692	4.550
X1	6.5772	0.968	6.795	0.000	4.680	8.474
X2	11.6667	0.995	11.723	0.000	9.716	13.617
X3	20.8877	0.927	22.529	0.000	19.071	22.705

Omnibus:	6.133	Durbin-Watson:	2.113
Prob(Omnibus):	0.047	Jarque-Bera (JB):	2.947
Skew:	0.137	Prob(JB):	0.229
Kurtosis:	2.196	Cond. No.	7.12

Warnings:
[1] Standard Errors are heteroscedasticity robust (HC1)

Run multiple regressions in one go with iteration tools. All functions also support fixed effects and multiway clustering.

reg_list, summ = regtools.reg_for_each_xvar_set_and_produce_summary(
    df,
    'Y',
    [
        ['X1'],
        ['X1', 'X2'],
        ['X1', 'X3'],
        ['X1', 'X2', 'X3']
    ],
    fe=[['f1', 'f2']],
    entity_var='f1',
    time_var='date',
    cluster=['f1', 'f2'],
    robust=False
)
summ

Out:

/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/linearmodels/panel/results.py:85: RuntimeWarning: invalid value encountered in sqrt
  return Series(np.sqrt(np.diag(self.cov)), self._var_names, name="std_error")
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in greater
  return (a < x) & (x < b)
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in less
  return (a < x) & (x < b)
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:1827: RuntimeWarning: invalid value encountered in greater_equal
  cond2 = (x >= np.asarray(_b)) & cond0
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/tstat.py:17: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['type'] = ['estimate', 'stderr'] * int(len(df.index) / 2)
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/tstat.py:20: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['regressor'] = [i for sublist in [[j] * 2 for j in df.index[0::2]] for i in sublist]
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/tstat.py:31: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ] = t_values
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/pandas/core/frame.py:3997: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/__init__.py:170: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['type'] = ['estimate', 'stderr'] * int(len(df.index) / 2)
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/__init__.py:173: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['regressor'] = [i for sublist in [[j] * 2 for j in df.index[0::2]] for i in sublist]
/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/__init__.py:177: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['idx'] = [i for i in range(len(df))]

	Y I	Y II	Y III	Y IIII
R-squared	nan	nan	nan	nan

X1	2.28	3.59*	5.29***	6.83***
	(0.83)	(1.67)	(3.33)	(8.78)
X2		10.12***		11.09
		(4.02)
X3			20.26***	20.63***
			(12.78)	(42.98)
Intercept	20.88***	14.60***	10.06***	2.94***
	(15.96)	(5.62)	(6.87)	(4.07)
f1 Fixed Effects	Yes	Yes	Yes	Yes
f2 Fixed Effects	Yes	Yes	Yes	Yes
Cluster by f1	Yes	Yes	Yes	Yes
Cluster by f2	Yes	Yes	Yes	Yes
N	99	99	98	98

Default is OLS. Other supported types: Probit, Logit, Quantile, Fama-Macbeth. Just pass the string to reg_type.

reg_list, summ = regtools.reg_for_each_xvar_set_and_produce_summary(
    df,
    'Y',
    [
        ['X1'],
        ['X1', 'X2'],
        ['X1', 'X3'],
        ['X1', 'X2', 'X3']
    ],
    reg_type='quantile',
    q=0.9
)
summ

reg_list, summ = regtools.reg_for_each_xvar_set_and_produce_summary(
    df,
    'Y',
    [
        ['X1'],
        ['X1', 'X2'],
        ['X1', 'X3'],
        ['X1', 'X2', 'X3']
    ],
    reg_type='fama-macbeth',
    entity_var='f1',
    time_var='date'
)
summ

Out:

/home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/linearmodels/panel/model.py:2631: InferenceUnavailableWarning: The number of time-series observation available to estimate cross-sectional
regressions, 3, is less than the number of parameters in the model. Parameter
inference is not available.
  InferenceUnavailableWarning,

	Y I	Y II	Y III	Y IIII
R-squared	nan	nan	nan	nan

X1	2.67	3.87	4.34***	6.16***
	(0.72)	(1.15)	(2.67)	(4.00)
X2		10.24***		11.13***
		(4.63)		(15.45)
X3			20.60***	20.93***
			(14.72)	(43.63)
Intercept	20.53***	14.26***	10.25***	2.99***
	(11.98)	(5.19)	(10.81)	(3.83)
N	99	99	98	98

`pd-utils`¶

Additional utilities to work with Pandas.

Find more at the documentation.

Some Setup¶

df1 = pd.DataFrame(
    [
        ("001076", "3/1/1995"),
        ("001076", "4/1/1995"),
        ("001722", "1/1/2012"),
        ("001722", "7/1/2012"),
        ("001722", nan),
        (nan, "1/1/2012"),
    ],
    columns=["GVKEY", "Date"],
)
df1["Date"] = pd.to_datetime(df1["Date"])

df2 = pd.DataFrame(
    [
        ("001076", "2/1/1995"),
        ("001076", "3/2/1995"),
        ("001722", "11/1/2011"),
        ("001722", "10/1/2011"),
        ("001722", nan),
        (nan, "1/1/2012"),
    ],
    columns=["GVKEY", "Date"],
)
df2["Date"] = pd.to_datetime(df2["Date"])

df3 = pd.DataFrame(
    data=[
        (10516, "a", "1/1/2000", 1.01, 0),
        (10516, "a", "1/2/2000", 1.02, 1),
        (10516, "a", "1/3/2000", 1.03, 1),
        (10516, "a", "1/4/2000", 1.04, 0),
        (10516, "b", "1/1/2000", 1.05, 1),
        (10516, "b", "1/2/2000", 1.06, 1),
        (10516, "b", "1/3/2000", 1.07, 1),
        (10516, "b", "1/4/2000", 1.08, 1),
        (10517, "a", "1/1/2000", 1.09, 0),
        (10517, "a", "1/2/2000", 1.1, 0),
        (10517, "a", "1/3/2000", 1.11, 0),
        (10517, "a", "1/4/2000", 1.12, 1),
    ],
    columns=["PERMNO", "byvar", "Date", "RET", "weight"],
)

df1

df2

df3

	PERMNO	byvar	Date	RET	weight
0	10516	a	1/1/2000	1.01	0
1	10516	a	1/2/2000	1.02	1
2	10516	a	1/3/2000	1.03	1
3	10516	a	1/4/2000	1.04	0
4	10516	b	1/1/2000	1.05	1
5	10516	b	1/2/2000	1.06	1
6	10516	b	1/3/2000	1.07	1
7	10516	b	1/4/2000	1.08	1
8	10517	a	1/1/2000	1.09	0
9	10517	a	1/2/2000	1.10	0
10	10517	a	1/3/2000	1.11	0
11	10517	a	1/4/2000	1.12	1

`tradedays`¶

Work directly with US market trading days.

import pd_utils

pd.date_range(
    start='1/1/2000',
    end='1/31/2000',
    freq=pd_utils.tradedays()
)

Out:

DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06',
               '2000-01-07', '2000-01-10', '2000-01-11', '2000-01-12',
               '2000-01-13', '2000-01-14', '2000-01-18', '2000-01-19',
               '2000-01-20', '2000-01-21', '2000-01-24', '2000-01-25',
               '2000-01-26', '2000-01-27', '2000-01-28', '2000-01-31'],
              dtype='datetime64[ns]', freq='C')

`left_merge_latest`¶

Merge the latest data available in the right DataFrame to the left DataFrame.

pd_utils.left_merge_latest(
    df1,
    df2,
    on='GVKEY',
    max_offset=pd.Timedelta(days=30),
#     max_offset=pd_utils.tradedays() * 20
)

	GVKEY	Date	Date_y
0	001076	1995-03-01	1995-02-01
1	001076	1995-04-01	1995-03-02
2	001722	2012-01-01	NaT
3	001722	2012-07-01	NaT
4	001722	NaT	NaT
5	NaN	2012-01-01	NaT

`averages`¶

Equal and value-weighted averages, optionally by groups

pd_utils.averages(
    df3,
    'RET',
    ['PERMNO', 'byvar'],
    wtvar='weight',
)

	PERMNO	byvar	RET	RET_wavg
0	10516	a	1.025	1.025
1	10516	b	1.065	1.065
2	10517	a	1.105	1.120

`portfolio`¶

Form porfolios from some numeric column.

pd_utils.portfolio(
    df3,
    'RET',
    ngroups=3,
#     cutoffs=[1.02, 1.07],
#     quant_cutoffs=[0.2],
    byvars='Date',
)

	PERMNO	byvar	Date	RET	weight	portfolio
0	10516	a	1/1/2000	1.01	0	1
1	10516	a	1/2/2000	1.02	1	1
2	10516	a	1/3/2000	1.03	1	1
3	10516	a	1/4/2000	1.04	0	1
4	10516	b	1/1/2000	1.05	1	2
5	10516	b	1/2/2000	1.06	1	2
6	10516	b	1/3/2000	1.07	1	2
7	10516	b	1/4/2000	1.08	1	2
8	10517	a	1/1/2000	1.09	0	3
9	10517	a	1/2/2000	1.10	0	3
10	10517	a	1/3/2000	1.11	0	3
11	10517	a	1/4/2000	1.12	1	3

`long_to_wide`¶

Pandas has a built in wide_to_long function but not long_to_wide. There is .pivot but it can’t handle multiple by variables.

pd_utils.long_to_wide(
    df3,
    ["PERMNO", "byvar"],
    "RET",
    colindex="Date",
    colindex_only=True
)

	PERMNO	byvar	weight	1/1/2000	1/2/2000	1/3/2000	1/4/2000
0	10516	a	0	1.01	1.02	1.03	1.04
1	10516	a	1	1.01	1.02	1.03	1.04
2	10516	b	1	1.05	1.06	1.07	1.08
3	10517	a	0	1.09	1.10	1.11	1.12
4	10517	a	1	1.09	1.10	1.11	1.12

`winsorize`¶

Winsorize data, optionally by groups, optionally a subset of columns, and optionally only to top or bottom.

pd_utils.winsorize(
    df3,
    0.4,
    subset="RET",
    byvars=["PERMNO", "byvar"],
)

	PERMNO	byvar	Date	RET	weight
0	10516	a	1/1/2000	1.022624	0
1	10516	a	1/2/2000	1.022624	1
2	10516	a	1/3/2000	1.026720	1
3	10516	a	1/4/2000	1.026720	0
4	10516	b	1/1/2000	1.062624	1
5	10516	b	1/2/2000	1.062624	1
6	10516	b	1/3/2000	1.066720	1
7	10516	b	1/4/2000	1.066720	1
8	10517	a	1/1/2000	1.102624	0
9	10517	a	1/2/2000	1.102624	0
10	10517	a	1/3/2000	1.106720	0
11	10517	a	1/4/2000	1.106720	1

`formatted_corr_df`¶

Nicely formatted correlations.

pd_utils.formatted_corr_df(df3)

	PERMNO	RET	weight
PERMNO	1.00
RET	0.82	1.00
weight	-0.48	-0.12	1.00

`datacode`¶

Data pipelines for humans.

NOTE: Under active development. API is experimental and subject to change.

Find more at the documentation.

Features:

Deal with the concept of variables rather than columns in a DataFrame
Apply transformations to variables to update both the values and name of the variable, but still be able to say it’s the same variable
- E.g. take lag of variable “A”, now it is shown as A$_{t - 1}$ but you can still work with it as the same variable without parsing the name
Access variables by short_keys and tab-completion but have them displayed with the label in the DataFrame.
Associate symbols and descriptions with variables. Generates symbols by default and you can override.
Attach data pipelines to generated data sources. It checks when the original sources were last modified, and if they were more recently modified than the pipeline was run, will run the pipeline again automatically.
Easier merges with data merge pipelines and smart merge options
Everything is extendible so you can add your own custom logic
Describe your data sources in detail to enable some features:
- Built-in transformations are index-aware. E.g. you have described that rows are indexed by firm and time. When you take a lag of the variable, it will automatically realize it should take the lag across the time dimension and within the firm
- (Planned feature): Tell it what variables you want, and it will figure out the merges to make it happen

# TODO [#1]: add examples for datacode

`bibtex_gen`¶

Citation management using Mendeley API and BibTeX.

Find more at the documentation.

`objcache`¶

Easily store Python objects for later (cache results).

Find more at the documentation.

I use this in my workflow so that I can run analysis and store tables and figures with little effort, then later when I generate the paper I retrieve the tables and figures from the cache. That way I can update everything by running the analysis then generating the paper, or I can update just the text in the paper and generate it quickly using the pre-existing tables and figures.

from objcache import ObjectCache

cache = ObjectCache('cache.zodb', ('a', 'b'))
cache.store(5)

# Later session
cache = ObjectCache('cache.zodb', ('a', 'b'))
result = cache.get()
print(result)

Out:

Some Cleanup¶

Not important, just to clean up files generated from the example.

import os

temp_files = [
    'cache.zodb',
    'cache.zodb.index',
    'cache.zodb.lock',
    'cache.zodb.tmp'
]

for file in temp_files:
    os.remove(file)

`pyfileconf`¶

Function and class configuration as Python files, helpful for managing multiple complex configuations.

NOTE: Under active development. API is experimental and subject to change.

Find more at the documentation.

Features:

Easy way to have multiple configurations for a single function or class
Generates Python file templates for configuration, complete with all the arguments, type annotations, and default values of function or class
Run/get configured functions/classes from Python or the command line
Update configurations at run-time in a Python script
- Easy to do config-based scripting. E.g.: Run the whole analysis with 3, 4, and 5 portfolios.
Works very well with datacode where you need to have many variables, sources, etc. configured

# TODO [#2]: add examples for pyfileconf

Total running time of the script: ( 0 minutes 3.167 seconds)

Gallery generated by Sphinx-Gallery

Additional Research Tools¶

Table of Contents¶

Some General Imports¶

pyexlatex¶

Some Clean Up¶

regtools¶

Some Setup¶

pd-utils¶

Some Setup¶

tradedays¶

left_merge_latest¶

averages¶

portfolio¶

long_to_wide¶

winsorize¶

formatted_corr_df¶

datacode¶

bibtex_gen¶

objcache¶

Some Cleanup¶

pyfileconf¶

`pyexlatex`¶

`regtools`¶

`pd-utils`¶

`tradedays`¶

`left_merge_latest`¶

`averages`¶

`portfolio`¶

`long_to_wide`¶

`winsorize`¶

`formatted_corr_df`¶

`datacode`¶

`bibtex_gen`¶

`objcache`¶

`pyfileconf`¶