.. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_data_management_in_python.py: Data Management in Python Using Standard Tooling ================================================ Prior to beginning the tutorial, please install the following packages via ``pip``: - pandas - matplotlib - statsmodels - seaborn - openpyxl - xlrd **Note**: if you're using Anaconda, some of these will be installed already. **Note**: if you're running the example in Binder, no installation is necessary Overview -------- ``pandas`` is a large library which includes a data structure called a ``DataFrame`` which is originally based on R's ``data.frame``. In Python, it is built on top of the array datatype in the similarly popular ``numpy`` library. Each column in a ``DataFrame`` is a ``pandas Series``. I use ``numpy`` for some specific functions or when I need higher performance than ``pandas``, but ``pandas`` is much more convenient to use. Here is what we're going to cover: - `**Selecting Data** <#Selecting-Data>`__ - `**Aggregating** <#Aggregating>`__ - `**Merging** <#Merging>`__ - `**Time series** <#Time-series>`__ - `**Plotting** <#Plotting>`__ - `**Regressions** <#Regressions>`__ - `**Input and Output** <#Input-and-Output>`__ Give me a ``DataFrame``! ------------------------ A ``DataFrame`` can be created in many ways, including: - From a ``csv`` or Excel file - From a ``SAS7BDAT`` (SAS data) or ``dta`` (Stata data) file - From other Python data structures (list of tuples, dictionaries, ``numpy`` arrays) Here I will create an example ``DataFrame`` from a list of tuples. At the end I will show loading and writing to files. .. code-block:: default import pandas as pd #this is the convention for importing pandas, then you can use pd. for functions df = pd.DataFrame( data=[ ('Walmart', 'FL', '1/2/2000', .02), ('Walmart', 'FL', '1/3/2000', .03), ('Walmart', 'FL', '1/4/2000', .04), ('Trader Joes', 'GA', '1/2/2000', .06), ('Trader Joes', 'GA', '1/3/2000', .07), ('Trader Joes', 'GA', '1/4/2000', .08), ('Publix', 'FL', '1/2/2000', .1), ('Publix', 'FL', '1/3/2000', .11), ('Publix', 'FL', '1/4/2000', .12), ], columns = ['Company', 'State', 'Date', 'Return'] ) ``pandas`` combined with Jupyter gives you a nice representation of your data by simply typing the name of the variable storing your ``DataFrame``: .. code-block:: default df .. only:: builder_html .. raw:: html

	Company	State	Date	Return
0	Walmart	FL	1/2/2000	0.02
1	Walmart	FL	1/3/2000	0.03
2	Walmart	FL	1/4/2000	0.04
3	Trader Joes	GA	1/2/2000	0.06
4	Trader Joes	GA	1/3/2000	0.07
5	Trader Joes	GA	1/4/2000	0.08
6	Publix	FL	1/2/2000	0.10
7	Publix	FL	1/3/2000	0.11
8	Publix	FL	1/4/2000	0.12

Working with Data in Pandas --------------------------- One of the ``DataFrame``\ s greatest strengths is how flexibly they can be split, combined, and aggregated. Selecting Data ~~~~~~~~~~~~~~ .. code-block:: default df[df['State'] == 'FL'] # read: dataframe where the dataframe column 'state' is 'FL' df.iloc[1:3] # give me the second through the third rows df.iloc[:, 2] # all rows for the third column (looks different because it's a Series) df['Company'] # company column (Series) best_grocery_stores = ['Trader Joes', 'Publix'] # only rows where company is in the best grocery stores and has a high return, # but also give me only the company and return columns df.loc[df['Company'].isin(best_grocery_stores) & (df['Return'] > 0.07), ['Company', 'Return']] .. only:: builder_html .. raw:: html

	Company	Return
5	Trader Joes	0.08
6	Publix	0.10
7	Publix	0.11
8	Publix	0.12

Aggregating ~~~~~~~~~~~ ``DataFrame``\ s have a ``.groupby`` which works similarly to group by in a SQL (proc SQL) command. .. code-block:: default df.groupby('Company') .. rst-class:: sphx-glr-script-out Out: .. code-block:: none To make it useful, we must aggregate the data somehow: .. code-block:: default df.groupby(['State','Date']).mean() #also .median, .std, .count .. only:: builder_html .. raw:: html

		Return
State	Date
FL	1/2/2000	0.06
	1/3/2000	0.07
	1/4/2000	0.08
GA	1/2/2000	0.06
	1/3/2000	0.07
	1/4/2000	0.08

Note that there the index becomes the groupby columns. If we want keep the columns in the ``DataFrame``, pass ``as_index=False``. .. code-block:: default df.groupby(['State','Date'], as_index=False).mean() #also .median, .std, .count .. only:: builder_html .. raw:: html

	State	Date	Return
0	FL	1/2/2000	0.06
1	FL	1/3/2000	0.07
2	FL	1/4/2000	0.08
3	GA	1/2/2000	0.06
4	GA	1/3/2000	0.07
5	GA	1/4/2000	0.08

Note that the shape of the data when using plain groupby is whatever the shape of the unique values of the groupby columns. If instead we want to add a column to our ``DataFrame`` representing the aggregated values, use ``.transform`` on top of ``groupby``. This example also shows how to assign a new column to a ``DataFrame``. .. code-block:: default df['State Return Average'] = df.groupby(['State','Date']).transform('mean') df .. only:: builder_html .. raw:: html

	Company	State	Date	Return	State Return Average
0	Walmart	FL	1/2/2000	0.02	0.06
1	Walmart	FL	1/3/2000	0.03	0.07
2	Walmart	FL	1/4/2000	0.04	0.08
3	Trader Joes	GA	1/2/2000	0.06	0.06
4	Trader Joes	GA	1/3/2000	0.07	0.07
5	Trader Joes	GA	1/4/2000	0.08	0.08
6	Publix	FL	1/2/2000	0.10	0.06
7	Publix	FL	1/3/2000	0.11	0.07
8	Publix	FL	1/4/2000	0.12	0.08

Columns can be combined with basic math operations .. code-block:: default df['Ratio'] = df['Return'] / df['State Return Average'] df .. only:: builder_html .. raw:: html

	Company	State	Date	Return	State Return Average	Ratio
0	Walmart	FL	1/2/2000	0.02	0.06	0.333333
1	Walmart	FL	1/3/2000	0.03	0.07	0.428571
2	Walmart	FL	1/4/2000	0.04	0.08	0.500000
3	Trader Joes	GA	1/2/2000	0.06	0.06	1.000000
4	Trader Joes	GA	1/3/2000	0.07	0.07	1.000000
5	Trader Joes	GA	1/4/2000	0.08	0.08	1.000000
6	Publix	FL	1/2/2000	0.10	0.06	1.666667
7	Publix	FL	1/3/2000	0.11	0.07	1.571429
8	Publix	FL	1/4/2000	0.12	0.08	1.500000

Functions can be applied to columns or the entire ``DataFrame``: .. code-block:: default import numpy as np # convention for importing numpy. def sort_ratios(value): # If the value is missing or is not a number, return as is # Without this, the function will error out as soon as it hits either of those if pd.isnull(value) or not isinstance(value, np.float): return value # Otherwise, sort into categories based on the value if value == 1: return 'Even' if value < 1: return 'Low' if value >= 1: return 'High' df['Ratio Size'] = df['Ratio'].apply(sort_ratios) # apply function to ratio column, save result as ratio size column df df.applymap(sort_ratios) # apply function to all values in df, but only display and don't save back to df .. only:: builder_html .. raw:: html

	Company	State	Date	Return	State Return Average	Ratio	Ratio Size
0	Walmart	FL	1/2/2000	Low	Low	Low	Low
1	Walmart	FL	1/3/2000	Low	Low	Low	Low
2	Walmart	FL	1/4/2000	Low	Low	Low	Low
3	Trader Joes	GA	1/2/2000	Low	Low	Even	Even
4	Trader Joes	GA	1/3/2000	Low	Low	Even	Even
5	Trader Joes	GA	1/4/2000	Low	Low	Even	Even
6	Publix	FL	1/2/2000	Low	Low	High	High
7	Publix	FL	1/3/2000	Low	Low	High	High
8	Publix	FL	1/4/2000	Low	Low	High	High

Merging ~~~~~~~ See here for more details. Let's create a ``DataFrame`` containing information on employment rates in the various states and merge it to this dataset. .. code-block:: default employment_df = pd.DataFrame( data=[ ('FL', 0.06), ('GA', 0.08), ('PA', 0.07) ], columns=['State', 'Unemployment'] ) employment_df df = df.merge(employment_df, how='left', on='State') df .. only:: builder_html .. raw:: html

	Company	State	Date	Return	State Return Average	Ratio	Ratio Size	Unemployment
0	Walmart	FL	1/2/2000	0.02	0.06	0.333333	Low	0.06
1	Walmart	FL	1/3/2000	0.03	0.07	0.428571	Low	0.06
2	Walmart	FL	1/4/2000	0.04	0.08	0.500000	Low	0.06
3	Trader Joes	GA	1/2/2000	0.06	0.06	1.000000	Even	0.08
4	Trader Joes	GA	1/3/2000	0.07	0.07	1.000000	Even	0.08
5	Trader Joes	GA	1/4/2000	0.08	0.08	1.000000	Even	0.08
6	Publix	FL	1/2/2000	0.10	0.06	1.666667	High	0.06
7	Publix	FL	1/3/2000	0.11	0.07	1.571429	High	0.06
8	Publix	FL	1/4/2000	0.12	0.08	1.500000	High	0.06

Appending is similarly simple. Here I will append a slightly modified ``DataFrame`` to itself: .. code-block:: default copy_df = df.copy() copy_df['Extra Column'] = 5 copy_df.drop('Ratio Size', axis=1, inplace=True) # inplace=True means it gets dropped in the existing DataFrame df.append(copy_df) .. only:: builder_html .. raw:: html

	Company	State	Date	Return	State Return Average	Ratio	Ratio Size	Unemployment	Extra Column
0	Walmart	FL	1/2/2000	0.02	0.06	0.333333	Low	0.06	NaN
1	Walmart	FL	1/3/2000	0.03	0.07	0.428571	Low	0.06	NaN
2	Walmart	FL	1/4/2000	0.04	0.08	0.500000	Low	0.06	NaN
3	Trader Joes	GA	1/2/2000	0.06	0.06	1.000000	Even	0.08	NaN
4	Trader Joes	GA	1/3/2000	0.07	0.07	1.000000	Even	0.08	NaN
5	Trader Joes	GA	1/4/2000	0.08	0.08	1.000000	Even	0.08	NaN
6	Publix	FL	1/2/2000	0.10	0.06	1.666667	High	0.06	NaN
7	Publix	FL	1/3/2000	0.11	0.07	1.571429	High	0.06	NaN
8	Publix	FL	1/4/2000	0.12	0.08	1.500000	High	0.06	NaN
0	Walmart	FL	1/2/2000	0.02	0.06	0.333333	NaN	0.06	5.0
1	Walmart	FL	1/3/2000	0.03	0.07	0.428571	NaN	0.06	5.0
2	Walmart	FL	1/4/2000	0.04	0.08	0.500000	NaN	0.06	5.0
3	Trader Joes	GA	1/2/2000	0.06	0.06	1.000000	NaN	0.08	5.0
4	Trader Joes	GA	1/3/2000	0.07	0.07	1.000000	NaN	0.08	5.0
5	Trader Joes	GA	1/4/2000	0.08	0.08	1.000000	NaN	0.08	5.0
6	Publix	FL	1/2/2000	0.10	0.06	1.666667	NaN	0.06	5.0
7	Publix	FL	1/3/2000	0.11	0.07	1.571429	NaN	0.06	5.0
8	Publix	FL	1/4/2000	0.12	0.08	1.500000	NaN	0.06	5.0

We can append to the side as well! (concatenate) .. code-block:: default temp_df = pd.concat([df, copy_df], axis=1) temp_df .. only:: builder_html .. raw:: html

	Company	State	Date	Return	State Return Average	Ratio	Ratio Size	Unemployment	Company	State	Date	Return	State Return Average	Ratio	Unemployment	Extra Column
0	Walmart	FL	1/2/2000	0.02	0.06	0.333333	Low	0.06	Walmart	FL	1/2/2000	0.02	0.06	0.333333	0.06	5
1	Walmart	FL	1/3/2000	0.03	0.07	0.428571	Low	0.06	Walmart	FL	1/3/2000	0.03	0.07	0.428571	0.06	5
2	Walmart	FL	1/4/2000	0.04	0.08	0.500000	Low	0.06	Walmart	FL	1/4/2000	0.04	0.08	0.500000	0.06	5
3	Trader Joes	GA	1/2/2000	0.06	0.06	1.000000	Even	0.08	Trader Joes	GA	1/2/2000	0.06	0.06	1.000000	0.08	5
4	Trader Joes	GA	1/3/2000	0.07	0.07	1.000000	Even	0.08	Trader Joes	GA	1/3/2000	0.07	0.07	1.000000	0.08	5
5	Trader Joes	GA	1/4/2000	0.08	0.08	1.000000	Even	0.08	Trader Joes	GA	1/4/2000	0.08	0.08	1.000000	0.08	5
6	Publix	FL	1/2/2000	0.10	0.06	1.666667	High	0.06	Publix	FL	1/2/2000	0.10	0.06	1.666667	0.06	5
7	Publix	FL	1/3/2000	0.11	0.07	1.571429	High	0.06	Publix	FL	1/3/2000	0.11	0.07	1.571429	0.06	5
8	Publix	FL	1/4/2000	0.12	0.08	1.500000	High	0.06	Publix	FL	1/4/2000	0.12	0.08	1.500000	0.06	5

Be careful, ``pandas`` allows you to have multiple columns with the same name (generally a bad idea): .. code-block:: default temp_df['Unemployment'] .. only:: builder_html .. raw:: html

	Unemployment	Unemployment
0	0.06	0.06
1	0.06	0.06
2	0.06	0.06
3	0.08	0.08
4	0.08	0.08
5	0.08	0.08
6	0.06	0.06
7	0.06	0.06
8	0.06	0.06

Time series ~~~~~~~~~~~ See here for more details. Lagging ^^^^^^^ Lags are easy with ``pandas``. The number in shift below represents the number of rows to lag. .. code-block:: default df.sort_values(['Company', 'Date'], inplace=True) df['Lag Return'] = df['Return'].shift(1) df .. only:: builder_html .. raw:: html

	Company	State	Date	Return	State Return Average	Ratio	Ratio Size	Unemployment	Lag Return
6	Publix	FL	1/2/2000	0.10	0.06	1.666667	High	0.06	NaN
7	Publix	FL	1/3/2000	0.11	0.07	1.571429	High	0.06	0.10
8	Publix	FL	1/4/2000	0.12	0.08	1.500000	High	0.06	0.11
3	Trader Joes	GA	1/2/2000	0.06	0.06	1.000000	Even	0.08	0.12
4	Trader Joes	GA	1/3/2000	0.07	0.07	1.000000	Even	0.08	0.06
5	Trader Joes	GA	1/4/2000	0.08	0.08	1.000000	Even	0.08	0.07
0	Walmart	FL	1/2/2000	0.02	0.06	0.333333	Low	0.06	0.08
1	Walmart	FL	1/3/2000	0.03	0.07	0.428571	Low	0.06	0.02
2	Walmart	FL	1/4/2000	0.04	0.08	0.500000	Low	0.06	0.03

But really we want the lagged value to come from the same firm: .. code-block:: default df['Lag Return'] = df.groupby('Company')['Return'].shift(1) df .. only:: builder_html .. raw:: html

	Company	State	Date	Return	State Return Average	Ratio	Ratio Size	Unemployment	Lag Return
6	Publix	FL	1/2/2000	0.10	0.06	1.666667	High	0.06	NaN
7	Publix	FL	1/3/2000	0.11	0.07	1.571429	High	0.06	0.10
8	Publix	FL	1/4/2000	0.12	0.08	1.500000	High	0.06	0.11
3	Trader Joes	GA	1/2/2000	0.06	0.06	1.000000	Even	0.08	NaN
4	Trader Joes	GA	1/3/2000	0.07	0.07	1.000000	Even	0.08	0.06
5	Trader Joes	GA	1/4/2000	0.08	0.08	1.000000	Even	0.08	0.07
0	Walmart	FL	1/2/2000	0.02	0.06	0.333333	Low	0.06	NaN
1	Walmart	FL	1/3/2000	0.03	0.07	0.428571	Low	0.06	0.02
2	Walmart	FL	1/4/2000	0.04	0.08	0.500000	Low	0.06	0.03

Things get slightly more complicated if you want to take into account missing dates within a firm. Then you must fill the ``DataFrame`` with missing data for those excluded dates, then run the above function, then drop those missing rows. A bit too much for this tutorial, but I have code available for this upon request. Resampling ^^^^^^^^^^ ``pandas`` has a lot of convenient methods for changing the frequency of the data. Here I will create a df containing intraday returns for the three companies .. code-block:: default import datetime from itertools import product firms = df['Company'].unique().tolist() # list of companies in df dates = df['Date'].unique().tolist() # list of dates in df num_periods_per_day = 13 #30 minute intervals combos = product(firms, dates, [i+1 for i in range(num_periods_per_day)]) # all combinations of company, date, and period number data_tuples = [ ( combo[0], # company datetime.datetime.strptime(combo[1], '%m/%d/%Y') + datetime.timedelta(hours=9.5, minutes=30 * combo[2]), #datetime np.random.rand() * 100 # price ) for combo in combos ] intraday_df = pd.DataFrame(data_tuples, columns=['Company', 'Datetime', 'Price']) intraday_df.head() # now the df is quite long, so we can use df.head() and df.tail() to see beginning and end of df .. only:: builder_html .. raw:: html

	Company	Datetime	Price
0	Publix	2000-01-02 10:00:00	25.250127
1	Publix	2000-01-02 10:30:00	78.297648
2	Publix	2000-01-02 11:00:00	18.903450
3	Publix	2000-01-02 11:30:00	42.139811
4	Publix	2000-01-02 12:00:00	53.523835

First must set the date variable as the index to do resampling .. code-block:: default intraday_df.set_index('Datetime', inplace=True) Now we can resample to aggregate: .. code-block:: default intraday_df.groupby('Company').resample('1D').mean() .. only:: builder_html .. raw:: html

		Price
Company	Datetime
Publix	2000-01-02	51.581870
	2000-01-03	49.985571
	2000-01-04	49.441363
Trader Joes	2000-01-02	44.736422
	2000-01-03	59.731791
	2000-01-04	38.360582
Walmart	2000-01-02	65.725321
	2000-01-03	50.844221
	2000-01-04	42.806827

Or we can increase the frequency of the data, using ``bfill`` to backward fill or ``ffill`` to forward fill. Here I specify to backward fill but only go back one period at most. .. code-block:: default intraday_df.groupby('Company').resample('10min').bfill(limit=1).head(10) .. only:: builder_html .. raw:: html

		Company	Price
Company	Datetime
Publix	2000-01-02 10:00:00	Publix	25.250127
	2000-01-02 10:10:00	NaN	NaN
	2000-01-02 10:20:00	Publix	78.297648
	2000-01-02 10:30:00	Publix	78.297648
	2000-01-02 10:40:00	NaN	NaN
	2000-01-02 10:50:00	Publix	18.903450
	2000-01-02 11:00:00	Publix	18.903450
	2000-01-02 11:10:00	NaN	NaN
	2000-01-02 11:20:00	Publix	42.139811
	2000-01-02 11:30:00	Publix	42.139811

Plotting ~~~~~~~~ Oh yeah, we've got graphs too. ``pandas``' plotting functionality is built on top of the popular ``matplotlib`` library, which is a graphing library based on ``MATLAB``'s graphing functionality. .. code-block:: default # we've got to run this magic once per session if we want graphics to show up in the notebook # %matplotlib inline ``pandas`` tries to guess what you want to plot. By default it will put each numeric column as a y variable and the index as the x variable. .. code-block:: default df.plot() .. image:: /auto_examples/images/sphx_glr_data_management_in_python_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none But we can tell it specifically what we want to do. Maybe we want one plot for each company showing only how the company return moves relative to the state average return over time. .. code-block:: default df df['Date'] = pd.to_datetime(df['Date']) # convert date from string type to datetime type df.groupby('Company').plot(y=['Return', 'State Return Average'], x='Date') .. rst-class:: sphx-glr-horizontal * .. image:: /auto_examples/images/sphx_glr_data_management_in_python_002.png :class: sphx-glr-multi-img * .. image:: /auto_examples/images/sphx_glr_data_management_in_python_003.png :class: sphx-glr-multi-img * .. image:: /auto_examples/images/sphx_glr_data_management_in_python_004.png :class: sphx-glr-multi-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Company Publix AxesSubplot(0.125,0.11;0.775x0.77) Trader Joes AxesSubplot(0.125,0.11;0.775x0.77) Walmart AxesSubplot(0.125,0.11;0.775x0.77) dtype: object ``pandas`` exposes most of the plots in ``matplotlib``. Supported types include: - ‘line’ : line plot (default) - ‘bar’ : vertical bar plot - ‘barh’ : horizontal bar plot - ‘hist’ : histogram - ‘box’ : boxplot - ‘kde’ : Kernel Density Estimation plot - ‘density’ : same as ‘kde’ - ‘area’ : area plot - ‘pie’ : pie plot - ‘scatter’ : scatter plot - ‘hexbin’ : hexbin plot .. code-block:: default df.drop('Ratio', axis=1).plot(kind='box', figsize=(15,8)) df.plot(y=['Return', 'Unemployment'], kind='kde') market_share_df = pd.DataFrame( data=[ ('Trader Joes', .2), ('Walmart', .5), ('Publix', .3) ], columns=['Company', 'Market Share'] ).set_index('Company') market_share_df.plot(y='Market Share', kind='pie', figsize=(6,6)) .. rst-class:: sphx-glr-horizontal * .. image:: /auto_examples/images/sphx_glr_data_management_in_python_005.png :class: sphx-glr-multi-img * .. image:: /auto_examples/images/sphx_glr_data_management_in_python_006.png :class: sphx-glr-multi-img * .. image:: /auto_examples/images/sphx_glr_data_management_in_python_007.png :class: sphx-glr-multi-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Check out the ``seaborn`` package for some cool high level plotting capabilities. .. code-block:: default import seaborn as sns # convention for importing seaborn sns.pairplot(df[['Company', 'Return', 'State Return Average']], hue='Company') .. image:: /auto_examples/images/sphx_glr_data_management_in_python_008.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Regressions ----------- Alright, we've got some cleaned up data. Now we can run regressions on them with the ``statsmodels`` module. Here I will show the "formula" approach to ``statsmodels``, which is just one of the two main interfaces. The ``formula`` approach will feel similar to specifying a regression in ``R`` or ``Stata``. However we can also directly pass ``DataFrames`` containing the y and x variables rather than specifying a formula. .. code-block:: default df import statsmodels.formula.api as smf # convention for importing statsmodels model = smf.ols(formula="Return ~ Unemployment", data=df) result = model.fit() result.summary() .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/scipy/stats/stats.py:1535: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=9 "anyway, n=%i" % int(n)) .. only:: builder_html .. raw:: html

OLS Regression Results
Dep. Variable:	Return	R-squared:	0.000
Model:	OLS	Adj. R-squared:	-0.143
Method:	Least Squares	F-statistic:	0.000
Date:	Wed, 19 Feb 2020	Prob (F-statistic):	1.00
Time:	15:02:22	Log-Likelihood:	17.751
No. Observations:	9	AIC:	-31.50
Df Residuals:	7	BIC:	-31.11
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.0700	0.091	0.770	0.466	-0.145	0.285
Unemployment	5.551e-16	1.350	4.11e-16	1.000	-3.191	3.191

Omnibus:	1.292	Durbin-Watson:	0.765
Prob(Omnibus):	0.524	Jarque-Bera (JB):	0.667
Skew:	0.000	Prob(JB):	0.716
Kurtosis:	1.666	Cond. No.	107.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Looks like a regression summary should. We can also use fixed effects and interaction terms. Here showing fixed effects: .. code-block:: default df.rename(columns={'Ratio Size': 'ratio_size'}, inplace=True) # statsmodels formula doesn't like spaces in column names model2 = smf.ols(formula="Return ~ Unemployment + C(ratio_size)", data=df) result2 = model2.fit() result2.summary() .. only:: builder_html .. raw:: html

OLS Regression Results
Dep. Variable:	Return	R-squared:	0.941
Model:	OLS	Adj. R-squared:	0.922
Method:	Least Squares	F-statistic:	48.00
Date:	Wed, 19 Feb 2020	Prob (F-statistic):	0.000204
Time:	15:02:22	Log-Likelihood:	30.501
No. Observations:	9	AIC:	-55.00
Df Residuals:	6	BIC:	-54.41
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.0696	0.006	12.163	0.000	0.056	0.084
C(ratio_size)[T.High]	0.0401	0.008	4.919	0.003	0.020	0.060
C(ratio_size)[T.Low]	-0.0399	0.008	-4.892	0.003	-0.060	-0.020
Unemployment	0.0056	0.001	7.867	0.000	0.004	0.007

Omnibus:	2.380	Durbin-Watson:	2.333
Prob(Omnibus):	0.304	Jarque-Bera (JB):	0.844
Skew:	-0.000	Prob(JB):	0.656
Kurtosis:	1.500	Cond. No.	3.54e+17

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.96e-35. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Now interaction terms .. code-block:: default # use * for interaction keeping individual variables, : for only interaction model3 = smf.ols(formula="Return ~ Unemployment*Ratio", data=df) result3 = model3.fit() result3.summary() .. only:: builder_html .. raw:: html

OLS Regression Results
Dep. Variable:	Return	R-squared:	0.928
Model:	OLS	Adj. R-squared:	0.904
Method:	Least Squares	F-statistic:	38.83
Date:	Wed, 19 Feb 2020	Prob (F-statistic):	0.000369
Time:	15:02:22	Log-Likelihood:	29.609
No. Observations:	9	AIC:	-53.22
Df Residuals:	6	BIC:	-52.63
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.0020	0.017	0.121	0.907	-0.039	0.043
Unemployment	-0.0020	0.195	-0.010	0.992	-0.479	0.475
Ratio	0.0680	0.014	4.849	0.003	0.034	0.102
Unemployment:Ratio	0.0020	0.195	0.010	0.992	-0.476	0.480

Omnibus:	0.037	Durbin-Watson:	2.212
Prob(Omnibus):	0.982	Jarque-Bera (JB):	0.148
Skew:	0.057	Prob(JB):	0.929
Kurtosis:	2.383	Cond. No.	1.77e+17

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 6.15e-34. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Want to throw together these models into one summary table? .. code-block:: default from statsmodels.iolib.summary2 import summary_col reg_list = [result, result2, result3] summ = summary_col( reg_list, stars=True, info_dict = { 'N': lambda x: "{0:d}".format(int(x.nobs)), 'Adj-R2': lambda x: "{:.2f}".format(x.rsquared_adj) } ) summ .. only:: builder_html .. raw:: html

	Return I	Return II	Return III
C(ratio_size)[T.High]		0.0401***
		(0.0082)
C(ratio_size)[T.Low]		-0.0399***
		(0.0082)
Intercept	0.0700	0.0696***	0.0020
	(0.0909)	(0.0057)	(0.0167)
R-squared	-0.1429	0.9216	0.9044
	0.0000	0.9412	0.9283
Ratio			0.0680***
			(0.0140)
Unemployment	0.0000	0.0056***	-0.0020
	(1.3496)	(0.0007)	(0.1951)
Unemployment:Ratio			0.0020
			(0.1953)
N	9	9	9
Adj-R2	-0.14	0.92	0.90

On the backend, ``statsmodels``' ``summary_col`` uses a ``DataFrame`` which we can access as: .. code-block:: default summ.tables[0] .. only:: builder_html .. raw:: html

	Return I	Return II	Return III
C(ratio_size)[T.High]		0.0401***
		(0.0082)
C(ratio_size)[T.Low]		-0.0399***
		(0.0082)
Intercept	0.0700	0.0696***	0.0020
	(0.0909)	(0.0057)	(0.0167)
R-squared	-0.1429	0.9216	0.9044
	0.0000	0.9412	0.9283
Ratio			0.0680***
			(0.0140)
Unemployment	0.0000	0.0056***	-0.0020
	(1.3496)	(0.0007)	(0.1951)
Unemployment:Ratio			0.0020
			(0.1953)
N	9	9	9
Adj-R2	-0.14	0.92	0.90

Therefore we can write functions to do any cleanup we want on the summary, leveraging ``pandas``: .. code-block:: default def replace_fixed_effects_cols(df): """ hackish way to do this just for example """ out_df = df.iloc[4:] #remove fixed effect dummy rows out_df.loc['Ratio Size Fixed Effects'] = ('No', 'Yes', 'No') return out_df clean_summ = replace_fixed_effects_cols(summ.tables[0]) clean_summ .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_with_indexer(indexer, value) .. only:: builder_html .. raw:: html

	Return I	Return II	Return III
Intercept	0.0700	0.0696***	0.0020
	(0.0909)	(0.0057)	(0.0167)
R-squared	-0.1429	0.9216	0.9044
	0.0000	0.9412	0.9283
Ratio			0.0680***
			(0.0140)
Unemployment	0.0000	0.0056***	-0.0020
	(1.3496)	(0.0007)	(0.1951)
Unemployment:Ratio			0.0020
			(0.1953)
N	9	9	9
Adj-R2	-0.14	0.92	0.90
Ratio Size Fixed Effects	No	Yes	No

Pretty cool, right? Since it's a ``DataFrame``, we can even output it to LaTeX: .. code-block:: default clean_summ.to_latex() .. rst-class:: sphx-glr-script-out Out: .. code-block:: none '\\begin{tabular}{llll}\n\\toprule\n{} & Return I & Return II & Return III \\\\\n\\midrule\nIntercept & 0.0700 & 0.0696*** & 0.0020 \\\\\n & (0.0909) & (0.0057) & (0.0167) \\\\\nR-squared & -0.1429 & 0.9216 & 0.9044 \\\\\n & 0.0000 & 0.9412 & 0.9283 \\\\\nRatio & & & 0.0680*** \\\\\n & & & (0.0140) \\\\\nUnemployment & 0.0000 & 0.0056*** & -0.0020 \\\\\n & (1.3496) & (0.0007) & (0.1951) \\\\\nUnemployment:Ratio & & & 0.0020 \\\\\n & & & (0.1953) \\\\\nN & 9 & 9 & 9 \\\\\nAdj-R2 & -0.14 & 0.92 & 0.90 \\\\\nRatio Size Fixed Effects & No & Yes & No \\\\\n\\bottomrule\n\\end{tabular}\n' Looks messy here, but you can output it to a file. However it's only outputting the direct LaTeX for the table so we can add a wrapper so it will compile as document: .. code-block:: default def _latex_text_wrapper(text): begin_text = r""" \documentclass[12pt]{article} \usepackage{booktabs} \begin{document} \begin{table} """ end_text = r""" \end{table} \end{document} """ return begin_text + text + end_text def to_latex(df, filepath='temp.tex'): latex = df.to_latex() full_latex = _latex_text_wrapper(latex) with open(filepath, 'w') as f: f.write(full_latex) to_latex(clean_summ) # created temp.tex in this folder. Go look and try to compile Input and Output ---------------- Let's output our existing ``DataFrame`` to some different formats and then show it can be loaded in through those formats as well. .. code-block:: default # We are not using the index, so don't write it to file df.to_csv('temp.csv', index=False) df.to_excel('temp.xlsx', index=False) df.to_stata('temp.dta', write_index=False) # NOTE: it is possible to read from SAS7BDAT but not write to it pd.read_csv('temp.csv') pd.read_excel('temp.xlsx') pd.read_stata('temp.dta') # pd.read_sas('temp.sas7bdat') #doesn't exist because we couldn't write to it. But if you already have sas data this will work .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /home/runner/.local/share/virtualenvs/py-research-workflows-rjN0B_bW/lib/python3.7/site-packages/pandas/io/stata.py:2252: InvalidColumnName: Not all pandas column names were valid Stata variable names. The following replacements have been made: b'State Return Average' -> State_Return_Average b'Lag Return' -> Lag_Return If this is not what you expect, please make sure you have Stata-compliant column names in your DataFrame (strings only, max 32 characters, only alphanumerics and underscores, no Stata reserved words) warnings.warn(ws, InvalidColumnName) .. only:: builder_html .. raw:: html

	Company	State	Date	Return	State_Return_Average	Ratio	ratio_size	Unemployment	Lag_Return
0	Publix	FL	2000-01-02	0.10	0.06	1.666667	High	0.06	NaN
1	Publix	FL	2000-01-03	0.11	0.07	1.571429	High	0.06	0.10
2	Publix	FL	2000-01-04	0.12	0.08	1.500000	High	0.06	0.11
3	Trader Joes	GA	2000-01-02	0.06	0.06	1.000000	Even	0.08	NaN
4	Trader Joes	GA	2000-01-03	0.07	0.07	1.000000	Even	0.08	0.06
5	Trader Joes	GA	2000-01-04	0.08	0.08	1.000000	Even	0.08	0.07
6	Walmart	FL	2000-01-02	0.02	0.06	0.333333	Low	0.06	NaN
7	Walmart	FL	2000-01-03	0.03	0.07	0.428571	Low	0.06	0.02
8	Walmart	FL	2000-01-04	0.04	0.08	0.500000	Low	0.06	0.03

Some Clean Up ------------- This section is not important, just cleaning up the temporary files we just generated. .. code-block:: default import os clean_files = [ 'temp.csv', 'temp.xlsx', 'temp.dta', 'temp.tex', ] for file in clean_files: os.remove(file) .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 4.168 seconds) .. _sphx_glr_download_auto_examples_data_management_in_python.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: binder-badge .. image:: https://mybinder.org/badge_logo.svg :target: https://mybinder.org/v2/gh/nickderobertis/py-research-workflows/gh-pages?urlpath=lab/tree/notebooks/auto_examples/data_management_in_python.ipynb :width: 150 px .. container:: sphx-glr-download :download:`Download Python source code: data_management_in_python.py ` .. container:: sphx-glr-download :download:`Download Jupyter notebook: data_management_in_python.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_