Replicating Gu Kelly & Xiu (2020)

This blog post is an attempt to replicate the core elements of the paper Empirical Asset Pricing via Machine Learning by Shihao Gu, Brian Kelly, and Dacheng Xiu¹. In its essence, the paper presents a horse-race for off-the-shelf machine learning (ML) models to predict return premia. For the replication I often refer to the Online Appendix of the paper, which is available here.

Warning

To render this blog post, I used a virtual machine with 32 CPU cores and 192 GB of memory. Given the dataset I work with is huge, the later may play a major role for replication on your own machine. However, the data-intensive tasks can be split into many smaller tasks, so with minor adjustments everybody should be able to replicate everything I do. I provide some reflections on reducing the computational burden at the end of the blog post.

The code relies on the following Python packages. Note that the packages sklearn and tensorflow are only needed if you want to implement the random forests, elastic net, or neural networks, respectively.

import numpy as np
import pandas as pd
import requests
import zipfile
import joblib
from pathlib import Path
from io import BytesIO
from plotnine import *
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

Data preparation

Download stock-characteristics

we build a large collection of stock-level predictive characteristics based on the cross-section of stock returns literature. These include 94 characteristics (61 of which are updated annually, 13 are updated quarterly, and 20 are updated monthly). In addition, we include 74 industry dummies corresponding to the first two digits of Standard Industrial Classification (SIC) codes.

The authors provide the dataset with stock-level characteristics online. While the original sample from the published version ended in December 2016, the author furnished the stock-level characteristics dataset until 2021. I download this (large) dataset directly from within Python from the Dacheng Xiu’s homepage. The data is stored as a .csv file within a zipped folder. For such purposes, the archive package is useful. I set timeout = 1200, which allows the Python session to download the dataset for 20 minutes. To avoid multiple downloads, I first check if the .csv file is already stored in the data-folder. If not, the download will be initiated and the .csv will be saved in the data-folder.

timeout = 1200

url = "https://dachxiu.chicagobooth.edu/download/datashare.zip"
datashare_path = Path("data/datashare.csv")
datashare_path.parent.mkdir(parents=True, exist_ok=True)

if not datashare_path.exists():
    response = requests.get(url, timeout=timeout)
    response.raise_for_status()

    with zipfile.ZipFile(BytesIO(response.content)) as z:
        characteristics = pd.read_csv(z.open("datashare.csv"))

    characteristics.to_csv(datashare_path, index=False)

characteristics = pd.read_csv(datashare_path)

characteristics = (
    characteristics.assign(
        DATE=lambda df: (
            pd.to_datetime(
                df["DATE"],
                format="%Y%m%d",
                errors="raise",
            )
            .dt.to_period("M")
            .dt.to_timestamp()
        )
    )
    .rename(columns={"DATE": "month"})
    .pipe(
        lambda df: df.rename(
            columns={
                col: f"characteristic_{col}"
                for col in df.columns
                if col not in ["permno", "month", "sic2"]
            }
        )
    )
)

characteristics = characteristics.dropna(subset=["sic2"]).assign(
    sic2=lambda df: df["sic2"].astype("category")
)

The characteristics appear in raw format. As many machine-learning applications are sensitive to the scaling of the data, one typically employs some form of standardization.

We cross-sectionally rank all stock characteristics period-by-period and map these ranks into the [-1, 1] interval (footnote 29).

The idea is that at each date, the cross-section of each predictor should be scaled such that the maximum value is 1 and the minimum value is -1. The cross-sectional ranking is time-consuming. The function below explicitly handles NA values, so they do not tamper with the ranking.

def rank_transform(x):
    x = pd.Series(x)

    rank_x = x.rank(method="average", na_option="keep")
    min_rank = 1.0
    max_rank = float(x.notna().sum())

    if max_rank == 0:  # only NAs
        return pd.Series(np.nan, index=x.index)
    elif max_rank == 1:  # exactly one non-NA
        return pd.Series(np.nan, index=x.index)
    else:
        scaled = 2 * ((rank_x - min_rank) / (max_rank - min_rank) - 0.5)
        return scaled


char_cols = [col for col in characteristics.columns if "characteristic" in col]

characteristics[char_cols] = characteristics.groupby("month")[char_cols].transform(
    rank_transform
)

Warning

In this blog post I only run code which fits a random forest. Random forests are known for being low-maintenance when it comes to feature engineering and indeed, the transformation and scaling of the predictors does not affect the fitted Random forest. However, as I also show how to fit methods which are sensitive to the scaling of the predictors, I include this crucial part of data preparation.

Another issue is missing characteristics, which we replace with the cross-sectional median at each month for each stock, respectively (footnote 30).

The paper actually claims that each missing value is replaced by the cross-sectional median at each month. However, one of the paper’s coauthors also claims otherwise:

NA values are set to zero (Q&A file furnished by Shihoa Gu)

I interpret the divergence as follows: First, I replace missing values with the cross-sectional median. However, for some months, few characteristics (e.g. absacc) do not have any value at all, leaving the cross-sectional median undefined. Thus, in a second step, I replace remaining missing characteristics with zero.

characteristics[char_cols] = characteristics.groupby("month")[char_cols].transform(
    lambda x: x.fillna(x.median(skipna=True))
)

characteristics[char_cols] = characteristics[char_cols].fillna(0)

We obtain monthly total individual equity returns from CRSP for all firms listed in the NYSE, AMEX, and NASDAQ. Our sample begins in March 1957 (the start date of the S&P 500) (p. 2248)

To retrieve CRSP excess returns, we merge the stock-level characteristics with the prepared monthly CRSP data as described in WRDS, CRSP, and Compustat. Note that in the book we start the CRSP sample in the year 1960 (partially due to data availability). The difference of three years may explain some deviations in the predictive performance. You can easily adjust the variable start_date <- ymd("1957-03-01") when running the code to reproduce the file tidy_finance_r.sqlite.

To create portfolio sorts based on the predictions, mktcap_lag remains in the sample (albeit it should be mentioned that the main results in Section 2.4.2 of the paper are derived computing equal-weighted portfolios).

We also construct eight macroeconomic predictors following the variable definitions detailed in Welch and Goyal (2008), including dividend-price ratio (dp), earnings-price ratio (ep), book-to-market ratio (bm), net equity expansion (ntis), Treasury-bill rate (tbl), term spread (tms), default spread (dfy), and stock variance (svar). (p. 2248)

We rely on the preparation of the macroeconomic predictors from the corresponding chapter in Accessing and Managing Financial Data. The following code snippets merges the stock-level characteristics with the CRSP sample and the macroeconomic predictors. Note that no further adjustment of the timing of excess returns and stock-level characteristics is necessary:

In this dataset, we’ve already adjusted the lags. (e.g. When DATE=19570329 in our dataset, you can use the monthly RET at 195703 as the response variable.) (readme.txt from the available dataset on Dacheng Xiu’s homepage).

To remain consistent, we only lag the macropredictors by one month.

crsp_monthly = pd.read_parquet("data/crsp_monthly.parquet")
macro_predictors = pd.read_parquet("data/macro_predictors.parquet")

crsp_monthly = (
    crsp_monthly.assign(
        month=lambda df: (
            pd.to_datetime(df["date"], errors="raise")
            .dt.to_period("M")
            .dt.to_timestamp()
        )
    )
    .loc[:, ["month", "permno", "mktcap_lag", "ret"]]
    .rename(columns={"ret": "ret_excess"})
)

macro_predictors = (
    macro_predictors.assign(
        month=lambda df: (
            pd.to_datetime(df["date"], errors="raise")
            .dt.to_period("M")
            .dt.to_timestamp()
        )
    )
    .loc[:, ["month", "dp", "ep", "bm", "ntis", "tbl", "tms", "dfy", "svar"]]
    .rename(
        columns={
            col: f"macro_{col}" for col in macro_predictors.columns if col != "month"
        }
    )
)

macro_predictors = macro_predictors.sort_values("month").assign(
    **{
        col: macro_predictors[col].shift(1)
        for col in macro_predictors.columns
        if col != "month"
    }
)

characteristics = (
    characteristics.merge(crsp_monthly, on=["month", "permno"], how="inner")
    .merge(macro_predictors, on="month", how="inner")
    .sort_values(["month", "permno"])
    .assign(macro_intercept=1)
    .loc[
        :,
        ["permno", "month", "ret_excess", "mktcap_lag", "sic2"]
        + [col for col in characteristics.columns if "macro" in col]
        + [col for col in characteristics.columns if "characteristic" in col],
    ]
)

Create dataset with all interaction terms

Before applying machine-learning models, the authors include interaction terms between macroeconomic predictors and stock characteristics in the sample.

We include 74 industry dummies corresponding to the first two digits of Standard Industrial Classification (SIC) codes. […] [The stock-level covariates also include] interactions between stock-level characteristics and macroeconomic state variables. The total number of covariates is 94×(8+1)+74=920. (p. 2249)

The following recipe creates dummies for each sic2 code and adds interaction terms between macro variables and stock characteristics. Note that data for the recipe is only required to identify the columns in the sample. No computations are performed at the first step. For that reason, I call head() to keep the computational burden a bit lower without affecting the outcome.

id_cols = ["permno", "month", "mktcap_lag"]

y = characteristics["ret_excess"]
X = characteristics.drop(columns=id_cols + ["ret_excess"])

char_cols = [c for c in X.columns if "characteristic" in c]
macro_cols = [c for c in X.columns if "macro" in c]

for c in char_cols:
    for m in macro_cols:
        X[f"{c}_x_{m}"] = X[c] * X[m]

X = pd.get_dummies(X, columns=["sic2"], prefix="sic2")

Next, I create the full dataset containing 3,046,332 monthly observations and 920 covariates.

characteristics_prepared = pd.concat([characteristics[id_cols], y, X], axis=1).assign(
    year=lambda df: df["month"].dt.year
)

I store the prepared file for your convenience. Furthermore, the authors never use the entire dataset for training, instead, only a fraction needs to be available in memory at any given point in time.

for year, df_year in characteristics_prepared.groupby("year"):
    df_year.drop(columns="year").to_parquet(
        f"results/characteristics_prepared/year={year}.parquet", index=False
    )

Sample splitting and tuning

We divide the 60 years of data into 18 years of training sample (1957–1974), 12 years of validation sample (1975–1986), and the remaining 30 years (1987–2016) for out-of-sample testing. Because machine learning algorithms are computationally intensive, we avoid recursively refitting models each month. Instead, we refit once every year as most of our signals are updated once per year. Each time we refit, we increase the training sample by 1 year. We maintain the same size of the validation sample, but roll it forward to include the most recent 12 months

I visualize my understanding of this data-budget below.

validation_length = 12
all_years = np.arange(1967, 2022)
oos_years = np.arange(1987, 2022)

estimation_periods = pd.DataFrame(
    {
        "oos_year": oos_years,
    }
)

estimation_periods = estimation_periods.assign(
    validation_end=lambda df: df["oos_year"] - 1,
    validation_start=lambda df: df["oos_year"] - validation_length,
    training_start=1957,
    training_end=lambda df: df["validation_start"] - 1,
).loc[
    :,
    [
        "training_start",
        "training_end",
        "validation_start",
        "validation_end",
        "oos_year",
    ],
]

rows = []

for _, row in estimation_periods.iterrows():
    for year in all_years:
        if row.training_start <= year <= row.training_end:
            classification = "Training"
        elif row.validation_start <= year <= row.validation_end:
            classification = "Validation"
        elif year == row.oos_year:
            classification = "OOS"
        else:
            continue
        rows.append(
            {"year": year, "oos_year": row.oos_year, "classification": classification}
        )

visualization_data = pd.DataFrame(rows)

(
    ggplot(visualization_data, aes(x="year", y="oos_year", color="classification"))
    + geom_point(size=1)
    + labs(title="Data classification timeline", x=None, y=None, color=None)
    + theme_minimal()
    + theme(axis_text_y=element_blank(), axis_ticks_y=element_blank())
)

The simple scheme makes model tuning computationally less complex. As there is no need to create several validation sets and thus to reestimate the models several times, the overall time to compute predictions is reduced substantially. One may discuss the substantial sampling variation of the resulting MSPE from the validation sample (typically cross-validation is used to obtain more than one estimate of the prediction error). However, the authors clearly state:

We do not use cross-validation to maintain the temporal ordering of the data. (footnote 32)

Model specification

The paper describes a collection of machine learning methods that was used in the analysis. I refer to Section 1 for a detailed description of the individual models. This blog post focuses on the implementation issues that arise when applying off-the-shelf ML methods on such a large dataset.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from sklearn.model_selection import ParameterGrid
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import GridSearchCV, PredefinedSplit
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import ElasticNet

Table A.5 (online appendix) describes the set of hyperparameters and their potential values used for tuning each machine learning model.

In the following I rely on the well-documented tidymodels workflow (e.g. Factor Selection with Machine Learning) to estimate the random forest model.

For the random forest, Table A5 in the Online Appendix states that the number of grown trees is fixed to 300. The number of features in each split is set to \(\in \{3, 5, 10, 20, 30, 50, ...\}\). It is not clear to me what \(...\) means in that context, for that reason I fix the number of choices to six. Tree depth is controlled indirectly via the minimum number of samples per leaf min_samples_leaf, with values [5000, 10000] considered in the grid. This setting roughly correpsonds to pruning: a small min_samples_leaf allows deeper trees, while a larger value produces more pruned trees. I set the option n_jobs=1to use all available CPU cores.

The workflow is implemented using a sklearn Pipeline, with the random forest as the main estimator:

rf_model = RandomForestRegressor(
    n_estimators=300,  # trees
    max_features=None,  # placeholder (mtry tuned later)
    min_samples_leaf=1,  # placeholder (min_n tuned later)
    n_jobs=-1,  # use all cores
    random_state=42,
)

param_grid = {
    "rf__max_features": [3, 5, 10, 20, 30, 50],
    "rf__min_samples_leaf": [5000, 10000],
}

grid = list(ParameterGrid(param_grid))

id_cols_new = ["permno", "month", "mktcap_lag", "year"]

X = characteristics_prepared.drop(columns=id_cols_new + ["ret_excess"])
y = characteristics_prepared["ret_excess"]

Model fitting

The following performs the model estimation to create predictions for all the years until 2021. For a complete replication of the procedure laid out in the paper, you’ll have to run a loop over all out-of-sample years (j).

j = 0

split_dates = estimation_periods.iloc[j]
oos_year = int(split_dates["oos_year"])
pred_file = Path(f"results/predictions_{oos_year}.parquet")

id_cols = ["permno", "month", "mktcap_lag", "year"]

X_train, y_train = characteristics_prepared[
    characteristics_prepared["month"].dt.year < split_dates["training_end"]
].pipe(lambda df: (df.drop(columns=id_cols + ["ret_excess"]), df["ret_excess"]))

X_val, y_val = characteristics_prepared[
    (characteristics_prepared["month"].dt.year >= split_dates["validation_start"])
    & (characteristics_prepared["month"].dt.year <= split_dates["validation_end"])
].pipe(lambda df: (df.drop(columns=id_cols + ["ret_excess"]), df["ret_excess"]))

X_combined = pd.concat([X_train, X_val], axis=0)
y_combined = pd.concat([y_train, y_val], axis=0)

test_fold = np.concatenate([np.full(len(X_train), -1), np.zeros(len(X_val))])

ps = PredefinedSplit(test_fold)

I tune the model and - after selecting the optimal tuning parameter value based on the prediction error in the validation set - fit the penalized regression using the available data (training and validation test set). The process can be faster by distribution the parameter tuning across multiple workers. This is done by setting n_jobs=-1. In sklearn, GridSearchCV handles both the evaluation of multiple hyperparameter combinations and cros-validation in parralel.

rf_model = RandomForestRegressor(n_estimators=300, n_jobs=-1, random_state=42)

param_grid = {"max_features": [3, 5, 10, 20, 30, 50], "min_samples_leaf": [5000, 10000]}

grid_search = GridSearchCV(
    rf_model, param_grid, cv=ps, scoring="neg_root_mean_squared_error", n_jobs=-1
)

grid_search.fit(X_combined, y_combined)

If more computation time is available, it is likely advisable to expand the number of evaluation by providing a wider grid of possible parameter constellations. Next, we finalize the process by selecting the hyperparameter constellation which delivered the smallest mean squared prediction error and fit the model once more using the entire (training and validation) datasets.

ml_fit_best = grid_search.best_params_

Trained models contain a lot of information which may be redundant for the task at hand: generating predictions. The final model can be saved efficiently for reproductibility and later use using joblib. To provide an intermediary anchor (also to ensure reprocudibility), I follow this example to store the trained model. Julia Silge provided some reflections on best practices here which I tried to incorporate.

rf_final = RandomForestRegressor(
    n_estimators=300, n_jobs=-1, random_state=42, **ml_fit_best
)

rf_final.fit(X_combined, y_combined)

joblib.dump(rf_final, f"results/fitted_workflow_{int(split_dates['oos_year'])}.joblib")

out_of_sample = characteristics_prepared[
    characteristics_prepared["month"].dt.year == oos_year
].copy()

X_oos = out_of_sample.drop(columns=id_cols + ["ret_excess"])

out_of_sample[".pred"] = rf_final.predict(X_oos)

out_of_sample.to_parquet(pred_file)

print(f"Finished {oos_year}")

Other ML models

While this blog post only shows how to implement the prediction procedure based on the random forest, it is no problem to generate predictions for the remaining main methods (elastic net, neural networks, …) presented in the original paper. Below, I provide code for models that come as close as possible to the specifications laid out in the Online Appendix of the paper.

For OLS, ENet, GLM, and GBRT, we present their robust versions using Huber loss, which perform better than the version without. (p. 2250)

For the sake of simplicity, I only show how to fit linear models using \(l2\) loss functions. In R, the package hqreg implements regularization paths for Lasso or elastic net models with Huber loss, but Python’s sklearn does not directly support Huber-penalized regularization paths.

For instance, to fit the elastic net it seems necessary to evaluate the mean-squared prediction error in the validation sample for only two values of the penalty term. Instead of tuning the elastic net mixture coefficient, I follow the paper and show how to fix its value to \(0.5\).

elastic_net = ElasticNet(
    alpha=1.0,  
    l1_ratio=0.5,  
    max_iter=10000,
)

penalty_grid = {"alpha": [1e-4, 1e-1]}

For the neural network, the paper specifies the manifold architecture choices

Our shallowest neural network has a single hidden layer of 32 neurons, which we denoted NN1. Next, NN2 has two hidden layers with 32 and 16 neurons, respectively; NN3 has three hidden layers with 32, 16, and 8 neurons, respectively; NN4 has four hidden layers with 32, 16, 8, and 4 neurons, respectively; and NN5 has five hidden layers with 32, 16, 8, 4, and 2 neurons, respectively. (p. 2244)

We use the same activation function at all nodes, and choose a popular functional form in recent literature known as the rectified linear unit (ReLU) (p. 2244)

In the following I show the exemplary implementation of the neural network with two hidden layers using keras. Most information is taken from Table A5 in the online appendix. The table mentions a parameter patience = 5 which I understand as a command on how many iterations with no improvement before stopping the optimization process.

def build_model(learn_rate=0.001, penalty=1e-5, input_dim=None):

    model = keras.Sequential(
        [
            layers.Dense(
                32,
                activation="relu",
                kernel_regularizer=regularizers.l2(penalty),
                input_shape=(input_dim,),
            ),
            layers.Dense(
                16, activation="relu", kernel_regularizer=regularizers.l2(penalty)
            ),
            layers.Dense(1),  # regression output
        ]
    )

    optimizer = keras.optimizers.Adam(learning_rate=learn_rate)

    model.compile(optimizer=optimizer, loss="mse")

    return model


early_stop = keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)

nn_grid = {"learn_rate": [0.001, 0.01], "penalty": [1e-5, 1e-3]}

nn_grid = list(ParameterGrid(nn_grid))

Machine-learning portfolios

After fitting the model(s), there are plenty ways to evaluate the predictive performance. In the remainder of the blog post I focus on the economically most meaningful (imho) application presented in the paper:

At the end of each month, we calculate 1-month-ahead out-of-sample stock return predictions for each method. We then sort stocks into deciles based on each model’s forecasts. We reconstitute portfolios each month using value weights. Finally, we construct a zero-net-investment portfolio that buys the highest expected return stocks (decile 10) and sells the lowest (decile 1). (p. 2264)

The following code chunks performs the predictions for every out-of-sample year (starting from 1987): First, I predict stock excess returns based on the fitted model, then we sort assets into portfolios based on the prediction. In line with the original paper, I chose an equal-weighted scheme which is arguably more sensitive to trading cost (see the discussion on p. 2265 of the paper). Note that in the original paper, the model parameters are updated each year.

fitted_workflow = joblib.load("results/fitted_workflow_1987.joblib")


def create_predictions(oos_year):

    out_of_sample = characteristics_prepared[
        characteristics_prepared["month"].dt.year == oos_year
    ].copy()

    id_cols = ["permno", "month", "mktcap_lag", "year"]

    X_oos = out_of_sample.drop(columns=id_cols + ["ret_excess"])

    out_of_sample[".pred"] = fitted_workflow.predict(X_oos)

    out_of_sample = out_of_sample[
        ["permno", "month", "mktcap_lag", "ret_excess", ".pred"]
    ]

    return out_of_sample


oos_predictions = pd.concat(
    [create_predictions(y) for y in estimation_periods["oos_year"]], axis=0
).reset_index(drop=True)

Warning

Note that I am calling the fitted model from the first out-of-sample year (1987) in each iteration. In other words, the model parameters do not get updated every year as in the original paper. If you estimated the model for each value of \(j\) as described above, you can replace the line above to replicate the original procedure.

To assign portfolios using the predictions, I implement a procedure similar to the assign_portfolio() from the tidyfinance package in R. I use pandas.qcutto divide the predicted reutns into deciles (portfolios 1-10).

def assign_portfolio(group):
    group = group.copy()
    group["portfolio"] = (
        pd.qcut(group[".pred"], 10, labels=False, duplicates="drop") + 1
    )
    return group


ml_portfolios = oos_predictions.groupby("month", group_keys=False).apply(
    assign_portfolio
)

ml_portfolios = (
    ml_portfolios.groupby(["portfolio", "month"])
    .agg(ret_predicted=(".pred", "mean"), ret_excess=("ret_excess", "mean"))
    .reset_index()
)

Finally, we evaluate the performance of the ML-portfolios. The table reports the predicted return, the average realized return, the standard deviation, and the out-of-sample Sharpe ratio for each of the decile portfolios as well as the long-short zero-cost portfolio.

hml = ml_portfolios.pivot(index="month", columns="portfolio")

hml_portfolio = pd.DataFrame(
    {
        "month": hml.index,
        "portfolio": "H-L",
        "ret_excess": hml["ret_excess"][10] - hml["ret_excess"][1],
        "ret_predicted": hml["ret_predicted"][10] - hml["ret_predicted"][1],
    }
).reset_index(drop=True)

final_portfolios = pd.concat([ml_portfolios, hml_portfolio], axis=0)

summary = final_portfolios.groupby("portfolio").agg(
    predicted_mean=("ret_predicted", "mean"),
    realized_mean=("ret_excess", "mean"),
    realized_sd=("ret_excess", "std"),
)

summary["sharpe_ratio"] = summary["realized_mean"] / summary["realized_sd"]

print(summary)

summary_reset = summary.reset_index()
summary_reset["portfolio"] = summary_reset["portfolio"].astype(
    str
)  
summary_reset.to_parquet("results/portfolio_summary.parquet", index=False)

Footnotes

The Review of Financial Studies, Volume 33, Issue 5, May 2020, Pages 2223–2273, https://doi.org/10.1093/rfs/hhaa009 ↩︎