.. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_additional_research_tools.py: Additional Research Tools ========================= I've been using Python actively for research since 2015. One of the beauties of Python is that it's very easy to write your own functions, modules, and packages for workflows you do often. Every time that I hit something which was pretty difficult in Python, I built a tool for it to make it easy. The result after these years of doing this is that I've built up a lot of tools that make empirical research in Python easier. Let's take a look through them. Table of Contents ----------------- - `**pyexlatex**: Generate LaTeX directly from Python with a simplified API <#pyexlatex>`__ - `**regtools**: High-level tools for running regressions <#regtools>`__ - `**pd-utils**: Additional utilities to work with Pandas <#pd-utils>`__ - `**datacode**: Data pipelines for humans <#datacode>`__ - `**bibtex\_gen**: Citation management using Mendeley API and BibTeX <#bibtex_gen>`__ - `**objcache**: Easily store Python objects for later (cache results) <#objcache>`__ - `**pyfileconf**: Function and class configuration as Python files, helpful for managing multiple complex configuations <#pyfileconf>`__ Some General Imports ~~~~~~~~~~~~~~~~~~~~ Import some packages we'll need across the examples. .. code-block:: default import pandas as pd import numpy as np from numpy import nan ``pyexlatex`` ------------- Generate LaTeX directly from Python with a simplified API. NOTE: You must have a LaTeX distribution installed on your machine for this package to work. Tested with MikTeX and TeXLive on Windows and Linux. NOTE: It is highly recommended to run this example in Jupyer Lab so that PDFs will be outputted inline in Jupyter. Find more at `the documentation `__. The most basic example: .. code-block:: default import pyexlatex as pl doc = pl.Document('woo') doc .. rst-class:: sphx-glr-script-out Out: .. code-block:: none As a LaTeX str: .. code-block:: default print(doc) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none \documentclass[]{article} \usepackage{amsmath} \usepackage{pdflscape} \usepackage{booktabs} \usepackage{array} \usepackage{threeparttable} \usepackage{fancyhdr} \usepackage{lastpage} \usepackage{textcomp} \usepackage{dcolumn} \newcolumntype{L}[1]{>{\raggedright\let\newline\\\arraybackslash\hspace{0pt}}m{#1}} \newcolumntype{C}[1]{>{\centering\let\newline\\\arraybackslash\hspace{0pt}}m{#1}} \newcolumntype{R}[1]{>{\raggedleft\let\newline\\\arraybackslash\hspace{0pt}}m{#1}} \newcolumntype{.}{D{.}{.}{-1}} \usepackage[T1]{fontenc} \usepackage{caption} \usepackage{subcaption} \usepackage{graphicx} \usepackage[margin=0.8in, bottom=1.2in]{geometry} \usepackage[page]{appendix} \pagestyle{fancy} \renewcommand{\headrulewidth}{0pt} \fancyhead{} \rfoot{Page \thepage\ of \pageref{LastPage}} \cfoot{} \begin{document} woo \end{document} Object-oriented API example: .. code-block:: default my_value = 5 contents = [ pl.Section( [ f'Some text. My value is {my_value}.', pl.UnorderedList([ 'A bullet', 'List' ]) ], title='First Section' ) ] doc = pl.Document(contents) doc .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Template-driven API example: .. code-block:: default template = """ {% filter Section(title='First Section') %} Some text. My value is {{ my_value }}. {{ [ 'A bullet', 'List' ] | UnorderedList }} {% endfilter %} """ class MyModel(pl.Model): my_value = 5 content = [MyModel(template_str=template)] doc = pl.Document(content) doc .. rst-class:: sphx-glr-script-out Out: .. code-block:: none A combination works as well. .. code-block:: default content = [ MyModel(template_str=template), pl.Section( [ f'Some text. My value is {my_value}.', pl.UnorderedList([ 'A bullet', 'List' ]) ], title='Second Section' ) ] doc = pl.Document(content) doc .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Equations are fine too. .. code-block:: default content.append( pl.Section( [ ['You can use inline equations', pl.Equation('y = mx + b'), 'by default, or pass inline=False to separate them', pl.Equation('E = MC^2', inline=False)] ], title='Equations Example' ) ) pl.Document(content) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none You can create tables from ``DataFrames``. .. code-block:: default # Create a DataFrame for example df = pd.DataFrame( [ (1, 2, 'Stuff'), (3, 4, 'Thing'), (5, 6, 'Other Thing'), ], columns=['a', 'b', 'c'] ) table = pl.Table.from_list_of_lists_of_dfs([[df]]) pl.Document([table]) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Publication-quality multi-panel tables with captions, below text, consolidation of indices, etc. are supported. .. code-block:: default df.set_index('c', inplace=True) table = pl.Table.from_list_of_lists_of_dfs( [ [df, df], [df, df] ], shape=(1, 2), include_index=True, panel_names=['Top Panel', 'Bottom Panel'], caption='My First Complex Table', below_text=""" Some description of my table. Isn't it nice to be able to do everything all in one command? """, label='tables:one' ) content.append(table) pl.Document(content) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Figures are supported as well, with an integration for ``matplotlib`` and therefore ``pandas`` as well (can also be loaded from file). .. code-block:: default ax = df.plot() fig = ax.get_figure() pl_fig = pl.Figure.from_dict_of_names_and_plt_figures( { 'My Subfigure': fig, # more subfigures can be passed in the same way }, '.', # output location figure_name='My Figure', label='figs:one', position_str_name_dict={ 'My Subfigure': r'[t]{0.95\linewidth}' # LaTeX positioning strings accepted (be sure to use r'' to escape \) } ) content.append(pl_fig) pl.Document(content) .. image:: /auto_examples/images/sphx_glr_additional_research_tools_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Table/Figure references work just fine and you can use the objects directly if desired. .. code-block:: default content.append( pl.Section( [ ['See Table', pl.Ref(table.label), 'and Figure', pl.Ref(pl_fig.label)] ], title='Example for Table and Figure References' ) ) pl.Document(content) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Support for citations as well. There is an easier way to create these using ``bibtex_gen`` `as shown in that section <#bibtex_gen>`__. .. code-block:: default bibtex_item = pl.BibTexArticle( 'using-pyexlatex', 'Nick DeRobertis', 'How to Use pyexlatex', 'The Journal of Awesome Stuff', '2020', volume='Vol 1', pages='1-2', ) content.extend([ pl.Section( [ ['As shown by', pl.CiteT('using-pyexlatex'), pl.Monospace('pyexlatex'), 'is pretty awesome.'] ], title='Example for Using Citations' ), pl.Bibliography([bibtex_item], style_name='jof') ]) pl.Document(content) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Add metadata to the document such as author, title, etc. .. code-block:: default footnotes = { 'nick': pl.Footnote( "University of Florida, PhD Candidate, Tel: (352)392-4669, Email: Nicholas.DeRobertis@Warrington.ufl.edu" ), 'other': pl.Footnote( "Example University, Professor" ) } abstract = """ A short abstract which is included for example purposes. There is a lot of configuration available for how the document itself renders. Feel free to take this as an example of something that looks pretty good and then look through the documentation for how to modify it. """ doc = pl.Document( content, authors=[ f'{pl.SmallCaps("Nick DeRobertis")}{footnotes["nick"]}', f'{pl.SmallCaps("Other Person")}{footnotes["other"]}', ], title='The title of my paper', abstract=abstract, page_modifier_str='margin=1.0in', section_numbering_styles=dict( section=r'\Roman{section}', subsection=r'\thesection.\Alph{subsection}', subsubsection=r'\thesubsection.\arabic{subsubsection}', subfigure=r'\roman{subfigure}', ), floats_at_end=True, font_size=12, line_spacing=2, tables_relative_font_size=-2, page_style='fancyplain', custom_headers=[ pl.Header(pl.SmallCaps('My Short Title'), align='left'), pl.Header(pl.SmallCaps(['Page ', pl.ThisPageNumber()])) ], page_numbers=False, separate_abstract_page=True, extra_title_page_lines=[ [pl.Italics('JEL Classification:'), 'E42, E44, E52, G12, G15, [add more here]'], [pl.Italics('Keywords:'), 'Thing; stuff; other stuf'], ], ) doc .. rst-class:: sphx-glr-script-out Out: .. code-block:: none LaTeX presentations with Beamer are supported as well. .. code-block:: default pres_content = [ pl.Frame( [ 'Some text', pl.Block( [ 'more text' ], title='My Block' ), pl_fig ] ) ] pl.Presentation(pres_content) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none And with sections, metadata, frame templates, etc. .. code-block:: default pl_fig pres_content = [ pl.Section( [ pl.DimRevealListFrame( [ 'some', 'bullet', 'points' ], title='First Frame' ), ], title='First Section' ), pl.Section( [ pl.Frame( pl_fig, title='Second Frame' ), ], title='Second Section' ) ] pl.Presentation( pres_content, title='My Presentation', authors=['Nick DeRobertis', 'Some Person'], short_title='Pres', subtitle='An Example Presentation', short_author='ND', institutions=[ ['University of Florida'], ['University of Florida', 'Some other Place'] ], short_institution='UF', nav_header=True, toc_sections=True ) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none \textcolor<.(1)->{black!30}{some} \vfill \item<+-> \textcolor<.(1)->{black!30}{bullet} \vfill \item<+-> points \end{itemize} \end{frame} \end{section} \begin{section}{Second Section} \begin{frame} \frametitle{Second Frame} \begin{figure} \includegraphics[width=0.95\linewidth]{Sources/My_Subfigure.pdf} \caption{My Figure} \label{figs:one} \end{figure} \end{frame} \end{section})> By default it produces the "slides" version that you would use while presenting. It can also produce a "handouts" version which removes all the effects (overlays). .. code-block:: default pl.Presentation( pres_content, title='My Presentation', authors=['Nick DeRobertis', 'Some Person'], short_title='Pres', subtitle='An Example Presentation', short_author='ND', institutions=[ ['University of Florida'], ['University of Florida', 'Some other Place'] ], short_institution='UF', nav_header=True, toc_sections=True, handouts=True # add this to remove presentation effects, good for distributing the PDF ) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Some Clean Up ~~~~~~~~~~~~~ Not important, just cleaning up temporary files from the example. .. code-block:: default import os temp_files = [ 'My Subfigure.pdf', ] for file in temp_files: os.remove(file) ``regtools`` ------------ High-level tools for running regressions. Find more at `the documentation `__. Some Setup ~~~~~~~~~~ Create a DataFrame with Y and X variables and a known relationship between them. Also fill some cells with missing values. .. code-block:: default df = pd.DataFrame( np.random.random((100, 4)), columns=['X1', 'X2', 'X3', r'$\epsilon$'] ) df['Y'] = df['X1'] * 5 + df['X2'] * 10 + df['X3'] * 20 + df[r'$\epsilon$'] * 10 df['f1'] = np.random.choice(['a', 'b', 'c'], size=(100,)) df['f2'] = np.random.choice(['d', 'e', 'f'], size=(100,)) df['date'] = pd.to_datetime(np.random.choice(['1/1/2000', '1/2/2000', '1/3/2000'], size=(100,))) df.iloc[1, 2] = nan df.iloc[3, 4] = nan df.head() .. only:: builder_html .. raw:: html

	X1	X2	X3	$\epsilon$	Y	f1	f2	date
0	0.898230	0.130509	0.237716	0.225186	12.802420	c	e	2000-01-02
1	0.168765	0.439407	NaN	0.387186	27.583171	c	e	2000-01-03
2	0.084249	0.904845	0.937915	0.462776	32.855745	c	f	2000-01-03
3	0.500707	0.423299	0.949985	0.665030	NaN	c	d	2000-01-03
4	0.476204	0.017644	0.643010	0.311633	18.533991	c	f	2000-01-01

All regression automatically drop values with missing rows. By default they run with heteroskedasticity-robust standard errors and a constant. .. code-block:: default import regtools result = regtools.reg( df, 'Y', ['X1', 'X2', 'X3'] ) result.summary() .. only:: builder_html .. raw:: html

OLS Regression Results
Dep. Variable:	Y	R-squared:	0.862
Model:	OLS	Adj. R-squared:	0.858
Method:	Least Squares	F-statistic:	231.0
Date:	Wed, 19 Feb 2020	Prob (F-statistic):	3.09e-43
Time:	15:02:24	Log-Likelihood:	-236.19
No. Observations:	98	AIC:	480.4
Df Residuals:	94	BIC:	490.7
Df Model:	3
Covariance Type:	HC1

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	2.6209	0.984	2.663	0.008	0.692	4.550
X1	6.5772	0.968	6.795	0.000	4.680	8.474
X2	11.6667	0.995	11.723	0.000	9.716	13.617
X3	20.8877	0.927	22.529	0.000	19.071	22.705

Omnibus:	6.133	Durbin-Watson:	2.113
Prob(Omnibus):	0.047	Jarque-Bera (JB):	2.947
Skew:	0.137	Prob(JB):	0.229
Kurtosis:	2.196	Cond. No.	7.12

Warnings:
[1] Standard Errors are heteroscedasticity robust (HC1)

Run multiple regressions in one go with iteration tools. All functions also support fixed effects and multiway clustering. .. code-block:: default reg_list, summ = regtools.reg_for_each_xvar_set_and_produce_summary( df, 'Y', [ ['X1'], ['X1', 'X2'], ['X1', 'X3'], ['X1', 'X2', 'X3'] ], fe=[['f1', 'f2']], entity_var='f1', time_var='date', cluster=['f1', 'f2'], robust=False ) summ .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/linearmodels/panel/results.py:85: RuntimeWarning: invalid value encountered in sqrt return Series(np.sqrt(np.diag(self.cov)), self._var_names, name="std_error") /home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in greater return (a < x) & (x < b) /home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in less return (a < x) & (x < b) /home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:1827: RuntimeWarning: invalid value encountered in greater_equal cond2 = (x >= np.asarray(_b)) & cond0 /home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/tstat.py:17: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['type'] = ['estimate', 'stderr'] * int(len(df.index) / 2) /home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/tstat.py:20: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['regressor'] = [i for sublist in [[j] * 2 for j in df.index[0::2]] for i in sublist] /home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_with_indexer(indexer, value) /home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/tstat.py:31: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy ] = t_values /home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/pandas/core/frame.py:3997: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy errors=errors, /home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/__init__.py:170: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['type'] = ['estimate', 'stderr'] * int(len(df.index) / 2) /home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/__init__.py:173: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['regressor'] = [i for sublist in [[j] * 2 for j in df.index[0::2]] for i in sublist] /home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/regtools/summarize/__init__.py:177: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['idx'] = [i for i in range(len(df))] .. only:: builder_html .. raw:: html

	Y I	Y II	Y III	Y IIII
R-squared	nan	nan	nan	nan

X1	2.28	3.59*	5.29***	6.83***
	(0.83)	(1.67)	(3.33)	(8.78)
X2		10.12***		11.09
		(4.02)
X3			20.26***	20.63***
			(12.78)	(42.98)
Intercept	20.88***	14.60***	10.06***	2.94***
	(15.96)	(5.62)	(6.87)	(4.07)
f1 Fixed Effects	Yes	Yes	Yes	Yes
f2 Fixed Effects	Yes	Yes	Yes	Yes
Cluster by f1	Yes	Yes	Yes	Yes
Cluster by f2	Yes	Yes	Yes	Yes
N	99	99	98	98

Default is OLS. Other supported types: Probit, Logit, Quantile, Fama-Macbeth. Just pass the string to ``reg_type``. .. code-block:: default reg_list, summ = regtools.reg_for_each_xvar_set_and_produce_summary( df, 'Y', [ ['X1'], ['X1', 'X2'], ['X1', 'X3'], ['X1', 'X2', 'X3'] ], reg_type='quantile', q=0.9 ) summ reg_list, summ = regtools.reg_for_each_xvar_set_and_produce_summary( df, 'Y', [ ['X1'], ['X1', 'X2'], ['X1', 'X3'], ['X1', 'X2', 'X3'] ], reg_type='fama-macbeth', entity_var='f1', time_var='date' ) summ .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/linearmodels/panel/model.py:2631: InferenceUnavailableWarning: The number of time-series observation available to estimate cross-sectional regressions, 3, is less than the number of parameters in the model. Parameter inference is not available. InferenceUnavailableWarning, .. only:: builder_html .. raw:: html

	Y I	Y II	Y III	Y IIII
R-squared	nan	nan	nan	nan

X1	2.67	3.87	4.34***	6.16***
	(0.72)	(1.15)	(2.67)	(4.00)
X2		10.24***		11.13***
		(4.63)		(15.45)
X3			20.60***	20.93***
			(14.72)	(43.63)
Intercept	20.53***	14.26***	10.25***	2.99***
	(11.98)	(5.19)	(10.81)	(3.83)
N	99	99	98	98

``pd-utils`` ------------ Additional utilities to work with Pandas. Find more at `the documentation `__. Some Setup ~~~~~~~~~~ .. code-block:: default df1 = pd.DataFrame( [ ("001076", "3/1/1995"), ("001076", "4/1/1995"), ("001722", "1/1/2012"), ("001722", "7/1/2012"), ("001722", nan), (nan, "1/1/2012"), ], columns=["GVKEY", "Date"], ) df1["Date"] = pd.to_datetime(df1["Date"]) df2 = pd.DataFrame( [ ("001076", "2/1/1995"), ("001076", "3/2/1995"), ("001722", "11/1/2011"), ("001722", "10/1/2011"), ("001722", nan), (nan, "1/1/2012"), ], columns=["GVKEY", "Date"], ) df2["Date"] = pd.to_datetime(df2["Date"]) df3 = pd.DataFrame( data=[ (10516, "a", "1/1/2000", 1.01, 0), (10516, "a", "1/2/2000", 1.02, 1), (10516, "a", "1/3/2000", 1.03, 1), (10516, "a", "1/4/2000", 1.04, 0), (10516, "b", "1/1/2000", 1.05, 1), (10516, "b", "1/2/2000", 1.06, 1), (10516, "b", "1/3/2000", 1.07, 1), (10516, "b", "1/4/2000", 1.08, 1), (10517, "a", "1/1/2000", 1.09, 0), (10517, "a", "1/2/2000", 1.1, 0), (10517, "a", "1/3/2000", 1.11, 0), (10517, "a", "1/4/2000", 1.12, 1), ], columns=["PERMNO", "byvar", "Date", "RET", "weight"], ) df1 df2 df3 .. only:: builder_html .. raw:: html

	PERMNO	byvar	Date	RET	weight
0	10516	a	1/1/2000	1.01	0
1	10516	a	1/2/2000	1.02	1
2	10516	a	1/3/2000	1.03	1
3	10516	a	1/4/2000	1.04	0
4	10516	b	1/1/2000	1.05	1
5	10516	b	1/2/2000	1.06	1
6	10516	b	1/3/2000	1.07	1
7	10516	b	1/4/2000	1.08	1
8	10517	a	1/1/2000	1.09	0
9	10517	a	1/2/2000	1.10	0
10	10517	a	1/3/2000	1.11	0
11	10517	a	1/4/2000	1.12	1

``tradedays`` ~~~~~~~~~~~~~ Work directly with US market trading days. .. code-block:: default import pd_utils pd.date_range( start='1/1/2000', end='1/31/2000', freq=pd_utils.tradedays() ) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-10', '2000-01-11', '2000-01-12', '2000-01-13', '2000-01-14', '2000-01-18', '2000-01-19', '2000-01-20', '2000-01-21', '2000-01-24', '2000-01-25', '2000-01-26', '2000-01-27', '2000-01-28', '2000-01-31'], dtype='datetime64[ns]', freq='C') ``left_merge_latest`` ~~~~~~~~~~~~~~~~~~~~~ Merge the latest data available in the right ``DataFrame`` to the left ``DataFrame``. .. code-block:: default pd_utils.left_merge_latest( df1, df2, on='GVKEY', max_offset=pd.Timedelta(days=30), # max_offset=pd_utils.tradedays() * 20 ) .. only:: builder_html .. raw:: html

	GVKEY	Date	Date_y
0	001076	1995-03-01	1995-02-01
1	001076	1995-04-01	1995-03-02
2	001722	2012-01-01	NaT
3	001722	2012-07-01	NaT
4	001722	NaT	NaT
5	NaN	2012-01-01	NaT

``averages`` ~~~~~~~~~~~~ Equal and value-weighted averages, optionally by groups .. code-block:: default pd_utils.averages( df3, 'RET', ['PERMNO', 'byvar'], wtvar='weight', ) .. only:: builder_html .. raw:: html

	PERMNO	byvar	RET	RET_wavg
0	10516	a	1.025	1.025
1	10516	b	1.065	1.065
2	10517	a	1.105	1.120

``portfolio`` ~~~~~~~~~~~~~ Form porfolios from some numeric column. .. code-block:: default pd_utils.portfolio( df3, 'RET', ngroups=3, # cutoffs=[1.02, 1.07], # quant_cutoffs=[0.2], byvars='Date', ) .. only:: builder_html .. raw:: html

	PERMNO	byvar	Date	RET	weight	portfolio
0	10516	a	1/1/2000	1.01	0	1
1	10516	a	1/2/2000	1.02	1	1
2	10516	a	1/3/2000	1.03	1	1
3	10516	a	1/4/2000	1.04	0	1
4	10516	b	1/1/2000	1.05	1	2
5	10516	b	1/2/2000	1.06	1	2
6	10516	b	1/3/2000	1.07	1	2
7	10516	b	1/4/2000	1.08	1	2
8	10517	a	1/1/2000	1.09	0	3
9	10517	a	1/2/2000	1.10	0	3
10	10517	a	1/3/2000	1.11	0	3
11	10517	a	1/4/2000	1.12	1	3

``long_to_wide`` ~~~~~~~~~~~~~~~~ Pandas has a built in ``wide_to_long`` function but not ``long_to_wide``. There is ``.pivot`` but it can't handle multiple by variables. .. code-block:: default pd_utils.long_to_wide( df3, ["PERMNO", "byvar"], "RET", colindex="Date", colindex_only=True ) .. only:: builder_html .. raw:: html

	PERMNO	byvar	weight	1/1/2000	1/2/2000	1/3/2000	1/4/2000
0	10516	a	0	1.01	1.02	1.03	1.04
1	10516	a	1	1.01	1.02	1.03	1.04
2	10516	b	1	1.05	1.06	1.07	1.08
3	10517	a	0	1.09	1.10	1.11	1.12
4	10517	a	1	1.09	1.10	1.11	1.12

``winsorize`` ~~~~~~~~~~~~~ Winsorize data, optionally by groups, optionally a subset of columns, and optionally only to top or bottom. .. code-block:: default pd_utils.winsorize( df3, 0.4, subset="RET", byvars=["PERMNO", "byvar"], ) .. only:: builder_html .. raw:: html

	PERMNO	byvar	Date	RET	weight
0	10516	a	1/1/2000	1.022624	0
1	10516	a	1/2/2000	1.022624	1
2	10516	a	1/3/2000	1.026720	1
3	10516	a	1/4/2000	1.026720	0
4	10516	b	1/1/2000	1.062624	1
5	10516	b	1/2/2000	1.062624	1
6	10516	b	1/3/2000	1.066720	1
7	10516	b	1/4/2000	1.066720	1
8	10517	a	1/1/2000	1.102624	0
9	10517	a	1/2/2000	1.102624	0
10	10517	a	1/3/2000	1.106720	0
11	10517	a	1/4/2000	1.106720	1

``formatted_corr_df`` ~~~~~~~~~~~~~~~~~~~~~ Nicely formatted correlations. .. code-block:: default pd_utils.formatted_corr_df(df3) .. only:: builder_html .. raw:: html

	PERMNO	RET	weight
PERMNO	1.00
RET	0.82	1.00
weight	-0.48	-0.12	1.00

``datacode`` ------------ Data pipelines for humans. NOTE: Under active development. API is experimental and subject to change. Find more at `the documentation `__. Features: - Deal with the concept of variables rather than columns in a DataFrame - Apply transformations to variables to update both the values and name of the variable, but still be able to say it's the same variable - E.g. take lag of variable "A", now it is shown as A\ :math:`_{t - 1}` but you can still work with it as the same variable without parsing the name - Access variables by ``short_keys`` and tab-completion but have them displayed with the label in the ``DataFrame``. - Associate symbols and descriptions with variables. Generates symbols by default and you can override. - Attach data pipelines to generated data sources. It checks when the original sources were last modified, and if they were more recently modified than the pipeline was run, will run the pipeline again automatically. - Easier merges with data merge pipelines and smart merge options - Everything is extendible so you can add your own custom logic - Describe your data sources in detail to enable some features: - Built-in transformations are index-aware. E.g. you have described that rows are indexed by firm and time. When you take a lag of the variable, it will automatically realize it should take the lag across the time dimension and within the firm - (Planned feature): Tell it what variables you want, and it will figure out the merges to make it happen .. code-block:: default # TODO [#1]: add examples for datacode ``bibtex_gen`` -------------- Citation management using Mendeley API and BibTeX. Find more at `the documentation `__. ``objcache`` ------------ Easily store Python objects for later (cache results). Find more at `the documentation `__. I use this in my workflow so that I can run analysis and store tables and figures with little effort, then later when I generate the paper I retrieve the tables and figures from the cache. That way I can update everything by running the analysis then generating the paper, or I can update just the text in the paper and generate it quickly using the pre-existing tables and figures. .. code-block:: default from objcache import ObjectCache cache = ObjectCache('cache.zodb', ('a', 'b')) cache.store(5) # Later session cache = ObjectCache('cache.zodb', ('a', 'b')) result = cache.get() print(result) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 5 Some Cleanup ~~~~~~~~~~~~ Not important, just to clean up files generated from the example. .. code-block:: default import os temp_files = [ 'cache.zodb', 'cache.zodb.index', 'cache.zodb.lock', 'cache.zodb.tmp' ] for file in temp_files: os.remove(file) ``pyfileconf`` -------------- Function and class configuration as Python files, helpful for managing multiple complex configuations. NOTE: Under active development. API is experimental and subject to change. Find more at `the documentation `__. Features: - Easy way to have multiple configurations for a single function or class - Generates Python file templates for configuration, complete with all the arguments, type annotations, and default values of function or class - Run/get configured functions/classes from Python or the command line - Update configurations at run-time in a Python script - Easy to do config-based scripting. E.g.: Run the whole analysis with 3, 4, and 5 portfolios. - Works very well with ``datacode`` where you need to have many variables, sources, etc. configured .. code-block:: default # TODO [#2]: add examples for pyfileconf .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 3.167 seconds) .. _sphx_glr_download_auto_examples_additional_research_tools.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: binder-badge .. image:: https://mybinder.org/badge_logo.svg :target: https://mybinder.org/v2/gh/nickderobertis/py-research-workflows/gh-pages?urlpath=lab/tree/notebooks/auto_examples/additional_research_tools.ipynb :width: 150 px .. container:: sphx-glr-download :download:`Download Python source code: additional_research_tools.py ` .. container:: sphx-glr-download :download:`Download Jupyter notebook: additional_research_tools.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_