Deep Learning in Investment Data
Deep Learning in Investment Data
Introduction
Deep learning is a machine learning technique utilizing Investing is a niche industry with specialized documents
complex, multi-layered statistical models, o en with tens of only accessible to highly-trained domain experts. Transfer
millions or billions of parameters. Its recent ascent has been learning helps us transcend this limitation by bringing in
fueled by the rise of vast datasets and cheap computing. knowledge gained from bigger, broader domains. Transfer
learning lowers barriers to entry, so that deep learning is no
Deep learning is widely used in the fields of computer vision, longer the plaything of the big tech oligopoly. Multimillion-
natural language processing, and speech recognition, which dollar datasets and hardware not required!
are characterized by large, complex, unstructured datasets.
However, it has seen limited adoption in investment This paper revolves around two practical investment case
management. We believe this is because most investors are studies. First, we show how transfer learning can be used to
still trying to use it on traditional structured data to directly produce state-of-the-art results in earnings call sentiment
predict asset prices. However, structured financial data is analysis. Second, use a proprietary dataset of 1,000 alphas
not fertile ground for deep learning. to show the limitations of using deep learning directly to
predict asset prices.
Exhibit 1
Powered By Deep Learning
Part 1 Unstructured Data
Warning: Natural language processing (NLP) is an extremely
fast-moving field and it is possible that some of the ideas here
may become outdated or even contradicted in the near future.
Source: Sparkline, Waymo, Apple
From Word Vectors to Language Models
In general, artificial intelligence begins its wave of disruption
by first automating the most routine parts of our jobs. A Our June 2019 paper, Investment Management in the
significant portion of the financial analyst’s day is spent Machine Learning Age, discussed word embeddings
reading textual documents ranging from financial news to (word2vec). Introduced in 2013, word embeddings are
broker research. In the age of big data, this has become an matrices that encode the relationships between words. We
increasingly overwhelming task. Fortunately, deep learning showed the graphic below, which illustrates how the words
can greatly streamline the way we consume this data. used in 10-Ks cluster based on common meaning.
1
Deep Learning in Investing: Opportunity in Unstructured Data | July 2020
Exhibit 2
10-K Word Embeddings
2
Deep Learning in Investing: Opportunity in Unstructured Data | July 2020
Exhibit 4 Transfer Learning
Guess What’s Behind the [MASK]?
The big breakthrough came in early 2018 when language
A statistical language model is a probability [MASK] over modeling was combined with transfer learning. The idea
sequences of [MASK]. Given such a [MASK], it assigns a behind transfer learning is to first “pre-train” a model on a
probability to the [MASK] sequence. large general-purpose dataset, then “fine tune” it on a
smaller domain-specific dataset for a specialized task. In our
Source: Sparkline, Wikipedia example above, we could pre-train GPT-2 on 8 million web
pages then fine tune it on our 100,000 10-Ks. This avoids
A er training on millions of documents, language models having to train the model from scratch on a small dataset.
can do some cool stuff. The most obvious application is
autocompletion, where we guess the word (or sequence of Exhibit 6
words) given a prompt. Transfer Learning
Exhibit 5
Autocompletion
Source: Google
One important feature of language models is that they do
not require humans to manually create the training data. Source: Sparkline
Text can be automatically parsed into training examples,
such as by randomly masking words. This enables us to Language models are extremely useful for the pre-training
cheaply create massive training corpuses from millions of stage of transfer learning. It turns out the ability to predict
websites, books, articles, and other written media. words requires a significant level of semantic awareness.
This broad linguistic understanding is foundational for many
However, moving from word embeddings to language other NLP tasks. For example, tasks as disparate as
models has its drawbacks. More complex models are more translation, question answering, and named entity
powerful but require more data and compute to train. Given recognition all benefit from starting with a pre-trained
its simple architecture, we showed that word2vec produced language model.
impressive results when trained on a relatively small sample
of 100,000 10-Ks. And the training process took only a few In practice, fine tuning involves starting with a pre-trained
minutes on standard hardware. language model and swapping out the final layer,
exchanging it for the specific building block that meets your
By comparison, the language model GPT-2 has 1.5 billion needs. For example, if we want to do classification, we
parameters and was trained on 8 million web pages. It has replace the final layer of the language model with a classifier
been estimated that training GPT-2 cost $20-50K in head. We then retrain the model for the new task, adjusting
computing budget spent over 1-10 months. Even putting the existing model’s weights to incorporate learnings from
aside time and money, there simply aren’t enough 10-Ks in the fine-tuning dataset.
existence to train a model of this size. We could of course use
a smaller model, but then we would have to sacrifice
performance.
3
Deep Learning in Investing: Opportunity in Unstructured Data | July 2020
Exhibit 7 We are extremely blessed that the NLP research community
One Model, Many Uses has embraced the open source philosophy. Anyone can
freely download massive language models that have been
pre-trained on millions of documents. This saves hundreds
of thousands of dollars, weeks of training time, and the
redundancy of researchers constantly having to reinvent the
wheel. With the heavy li ing out of the way, the fine tuning
process is quite cheap and tractable even for less-resourced
teams.
Source: Sparkline
The NLP 🚀
One way to better understand fine tuning is by analogy to The combination of language modeling and transfer
computer vision, where transfer learning had been widely learning opened the floodgates for a wave of innovation.
applied prior to its crossover to NLP. In these models, the Over the past couple years, Google, Facebook, Microso ,
lower layers capture basic features such as edges and OpenAI and others have introduced a succession of models
textures, while the higher layers depict more complete building on this foundational concept.
objects such as eyes, faces, legs, and dogs.
These models have gotten bigger and bigger as datasets,
In our context, the lower layers of the neural network computing resources and modeling techniques have
capture the fundamental building blocks of language (e.g. improved (Exhibit 9). In Feb 2018, the state-of-the-art ELMo
words), while the higher layers contain higher-level linguistic model had 94 million parameters. By Oct 2019, the T5
concepts. The final layer is dedicated to our specific task. transformer had pushed the frontier to 11 billion
Fine tuning allows our model to utilize the fundamental parameters. Last month, GPT-3 was released with 175 billion
knowledge from earlier layers, while adjusting the end parameters. The exponential trendline shows that we have
output to our specific task. experienced a 10x increase in model size every 8.5 months
since pre-trained language models were introduced in 2018.
Exhibit 8
Lower Layers Encode Lower-Level Features
4
Deep Learning in Investing: Opportunity in Unstructured Data | July 2020
Exhibit 9
NLP 🚀
5
Deep Learning in Investing: Opportunity in Unstructured Data | July 2020
These breakthroughs have made their way into the real “Understood. I'd say that we probably lost $0.5 million
world. In Oct 2019, Google converted its search engine to to $0.75 million in the fourth quarter of the year due to
BERT. The results are so good that researchers are being some of those headwinds as an approximation for the
forced to confront their ethical implications. For example, combination of outages, weathers and the like.”
OpenAI decided to release GPT-2 in multiple phases to give
6
Deep Learning in Investing: Opportunity in Unstructured Data | July 2020
yourself. Given its specificity, it shouldn’t be a surprise that So far, machine learning has failed. In order to get
no open-source dataset of earnings calls with binary reasonable results, we would need a significantly larger
sentiment labels exists. training sample. But now let’s see if transfer learning can
help. We use the training progression below.
This leads us to two more challenges faced by those in niche
domains such as investing. First, it is a general principle that Exhibit 13
cost per label increases with domain specificity. While pretty Transfer Learning for Earnings Call Sentiment
much anyone can identify images of stop signs, it requires
years of training to recognize signs of financial frauds.
Second, even if money were no object, large datasets in
niche industries may simply not exist. For instance, there are
only a finite number of observations on which to train a
model to find the next Enron Wirecard.
While the media are obsessed with hyping “big data”, in
many cases it is unrealistic to simply throw more data at the Source: Sparkline
problem. We may be better served working to extract the
most insight from the limited data we do have. BERT was originally pre-trained to perform language
modeling on a large corpus of books and Wikipedia articles.
Cross-Training for Computers Instead of initializing our model with random values, we can
use these pre-trained weights. But books and Wikipedia
With this in mind, we ran an experiment to see how well we articles differ greatly from earnings calls in structure, tone,
could do in an extremely data-constrained environment. We and vocabulary. Thus, we continue BERT’s education. This
labeled 100 earnings call transcript snippets by hand, time we have it read earnings call transcripts. Fortunately,
classifying each as positive or negative. We used 50 to train language model training does not require us to manually
the model and 50 to evaluate its out-of-sample label any data. Thus, we can give BERT tens of thousands of
performance. Compared to the 25,000 training samples in unlabeled transcripts to study without our supervision.
IMDb, a 50 observation training set is extremely small.
BERT now understands both general english language and
We used BERT as our representative deep learning model. financial jargon. However, it has never done sentiment
BERT has 340 million parameters so it should be no surprise analysis. We correct this with one final transfer learning step.
that training on 50 observations did not work. We achieved We train BERT on the IMDb dataset from earlier. While movie
testing accuracy of 54%, indistinguishable from random reviews are quite different from earnings calls, the
chance. For comparison, we also trained a simpler model -- sentiment analysis task is highly relevant. Think of all these
logistic regression. This also did not work. Natural language steps as cross-training for computers. Putting in thousands
is very complex. of reps in the pre-season allows BERT to perform on game
day.
As a benchmark, we tested the old-school dictionary
approach. We used the Loughran-McDonald lexicon, which
was created by two finance professors and is widely used in
the industry. We classified texts based on the net occurrence
of positive and negative words. Loughran-Mcdonald
achieved a respectable accuracy of 68%. In a sense,
dictionary methods are a form of transfer learning. Instead
of artificial neural networks, we rely on Profs. Loughran and
McDonald’s actual neurons, pre-trained over their many
years of experience in the field.
7
Deep Learning in Investing: Opportunity in Unstructured Data | July 2020
Exhibit 14
The Power of Transfer Learning
Source: Sparkline
We find that each transfer learning step increases the Exhibit 15
performance of the model. With all three, we achieve 89% Transfer Learning in the Matrix
accuracy. This is a full 21 percentage points better than
Loughran-McDonald. This result is kind of incredible. We
spent an hour labeling and now have a model that can
extract transcript sentiment automatically with much
greater accuracy than the current industry standard.
BERT and its successors are extremely large models. Thus, Source: Sparkline, The Matrix
one might assume they are only useful for huge companies
like Google or Facebook with their billions of search records
and social interactions. The beauty of transfer learning is
that it allows us to take advantage of the vast resources
baked into pre-trained language models for use with small,
specialized datasets.
The fundamental techniques demonstrated here can be
used for many other NLP tasks besides sentiment analysis.
Pre-trained language models are an incredibly powerful
tool, and we encourage you to think about other ways it can
be applied to improve the way we utilize unstructured data
in our industry.
8
Deep Learning in Investing: Opportunity in Unstructured Data | July 2020
9
Deep Learning in Investing: Opportunity in Unstructured Data | July 2020
Exhibit 17
If Goldilocks Were a Statistician
Optimal Complexity
We illustrate this point empirically using our own data.
Sparkline has a library of thousands of alphas. These range
from standard quant factors like price-to-book ratios to
proprietary signals derived from crawling the public
internet. We use a random subset of 1,000 of these signals
for the experiment below.
Source: edpresso
Neural networks can be viewed as linear regression with
more layers. Conversely, linear regression can be viewed as
Every dataset has an optimal level of model complexity.
a neural network with only one layer. Thus, we begin with
Overly simple models underfit, failing to capture all the
linear regression and successively build more complex
nuances of the data. Overly complex models overfit, failing
architectures. We use feedforward neural networks with
to work out of sample.
batch normalization, ReLu, and dropout. Don’t worry about
the details -- the main takeaway is that these networks get
The point at which optimal complexity is achieved depends
more complex as we add depth.
on the size of the dataset. Bigger datasets can sustain more
complex models. The extremely stylized chart below
Exhibit 19
illustrates this point.
Neural Networks of Increasing Complexity
Exhibit 18
Model Complexity Should Match Data Size
10
Deep Learning in Investing: Opportunity in Unstructured Data | July 2020
Exhibit 20 Sharpe Ratio is lower in the test period than the validation
Simulated Strategy Returns period. This is expected as the validation period has the
benefit of hindsight and alphas should naturally decay as
they are discovered in the latter period.
Optimal complexity in the test period is achieved at 3 layers.
This implies that linear regression is too simple. It does not
capture the full intricacies of the data. On the other hand,
the 5-layer neural network is too complex. It overfits the
data so badly that, despite an incredible backtest, it
performs only a bit better than linear regression out of
sample.
Our optimal model produced a Sharpe Ratio of 1.6. This is a
meaningful improvement over linear regression, which
delivered a Sharpe Ratio of 1.0. We can conclude there is
room for improvement moving beyond the “simple” tier of
model complexity but venturing too far into the “complex”
zone leads to overfitting.
Source: Sparkline
Shallow Deep Learning
The le panel contains the validation period. The more
complex the model, the better the performance. The right Our optimal model has 3 layers and 105,501 parameters.
panel contains the test period. The 3-layer model does the This is a lot more than linear regression, with its measly 1
best out of sample, especially over the past couple years layer and 1,001 parameters. However, it pales in comparison
including the ongoing COVID-19 crisis. to the deep learning architectures used on unstructured
data. For example, here is ResNet-50, a popular computer
The next exhibit summarizes the results using Sharpe Ratio vision model with 50 layers and 25 million parameters.
(i.e., signal-to-noise ratio). The chart looks as if it were taken
straight out of a machine learning textbook! Exhibit 22
ResNet-50
Exhibit 21
Sharpe Ratio and Model Complexity
Source: Deep Residual Learning for Image Recognition
We added our optimal model to the chart of modern NLP
models from the prior section. We overfit our data at just
100,000 parameters. Yet this is 1,000 times smaller than
ELMo and over 1M times smaller than GPT-3.
Source: Sparkline
11
Deep Learning in Investing: Opportunity in Unstructured Data | July 2020
Exhibit 23 range from -2.5% to +2.5%. We spot checked a few standard
🚀
NLP ++ quant factors to ensure they lined up with intuition. Value,
momentum, reversal, quality and size work as expected.
Phew!
Exhibit 24
Deep Learning Surrogate Coefficients
Source: Sparkline (Adapted from HuggingFace)
One might argue that our results are specific to our dataset
and model setup. Of course, we could further optimize the
hyperparameters and architecture. We could also go to daily
frequency data and further expand the number of signals.
However, this would not qualitatively change our
conclusion. Source: Sparkline
Deep learning models can offer an improvement upon linear One side benefit of this approach is we can evaluate the
regression. However, due to inherent limitations in financial R-squared, or the percent of variance explained by the linear
data, the models quickly start overfitting with even simple regression. If the deep learning model were completely
architectures. The whole point of deep learning models is linear -- which might happen if the underlying features were
that they are deep -- consisting of dozens of layers and truly linear -- the surrogate would capture 100% of its
millions of parameters. Forced to resort to “shallow deep variance. The less variance explained, the more
learning” means sacrificing most of the benefit of these nonlinearities and interactions the deep learning model is
models. picking up.
Exhibit 25
Explainability Financial Factors Are Mostly Linear
In addition, using deep learning is not without its tradeoffs.
One significant weakness of deep learning models is that
they are “black boxes”. Unlike linear regression, there is no
intuitive interpretation of its coefficients. With great power
comes great opacity. 🕷
Fortunately, this is an active branch of AI research. We will
utilize a simple technique called a “global surrogate”. The
idea is to train an interpretable model (in our case, linear
regression) to predict the predictions of the deep learning
model. To be clear, we are not trying to predict the market,
only the output of the deep learning model.
The main advantage of the surrogate model is that its Source: Sparkline
regression coefficients are interpretable. Weights (i.e., betas)
12
Deep Learning in Investing: Opportunity in Unstructured Data | July 2020
Exhibit 25 shows the variance explained for each of our
models. The 1-layer model is linear regression, so the
surrogate explains 100% of the variance. As we add layers,
the model begins finding interesting nonlinearities and
interactions in the data. The R-squared falls gradually to Kai Wu
62% as we increase model complexity up to 5 layers. Founder & CIO, Sparkline Capital LP
We found the 5-layer model overfits, so let’s focus instead on
the optimal 3-layer model. The linear surrogate captures Kai Wu is the founder and Chief Investment Officer of
70% of the deep learning model’s variance, while 30% can Sparkline Capital, an investment management firm applying
be explained only by nonlinearities and interactions. state-of-the-art machine learning and computing to uncover
alpha in large, unstructured data sets.
This 70/30 split is quite interesting. It implies that our data
are mostly linear. While complex models can add value, the Prior to Sparkline, Kai co-founded and co-managed
gains are limited. Furthermore, there are significant Kaleidoscope Capital, a quantitative hedge fund in Boston.
drawbacks to utilizing deep learning. These include opacity, With one other partner, he grew Kaleidoscope to $350
complexity and cost. There are plenty of machine learning million in assets from institutional investors. Kai jointly
algorithms occupying the “medium” complexity region managed all aspects of the company, including technology,
between linear regression and deep learning that might be investments, operations, trading, investor relations, and
worth first considering. recruiting.
Previously, Kai worked at GMO, where he was a member of
Conclusion Jeremy Grantham’s $40 billion asset allocation team. He
also worked closely with the firm's equity and macro
Deep learning is extremely powerful but requires very large investment teams in Boston, San Francisco, London, and
datasets to be effective. Traditional structured financial data Sydney.
is too small and linear to truly benefit from deep learning.
While “shallow deep learning” can be useful, researchers Kai graduated from Harvard College Magna Cum Laude and
may be better served to first consider simpler techniques. Phi Beta Kappa.
On the other hand, deep learning is highly effective on
unstructured data. Transfer learning provides the key to
unlocking its potential in niche domains such as investing. Disclaimer
Transfer learning enables us to leverage the creations of This paper is solely for informational purposes and is not an offer
large technology companies without having to gather the or solicitation for the purchase or sale of any security, nor is it to be
data or train the models ourselves. construed as legal or tax advice. References to securities and
strategies are for illustrative purposes only and do not constitute
buy or sell recommendations. The information in this report should
Unstructured data is a critical input to the investment
not be used as the basis for any investment decisions.
process. However, its unmitigated growth presents a
significant challenge for the industry. Fortunately, the
We make no representation or warranty as to the accuracy or
advances in natural language processing presented here can
completeness of the information contained in this report, including
greatly improve how we consume this data. Given that these third-party data sources. The views expressed are as of the
innovations are less than a few years old, we believe there is publication date and subject to change at any time.
opportunity for entrepreneurial individuals and firms to
profit from the impending transformation. Hypothetical performance has many significant limitations and no
representation is being made that such performance is achievable
in the future. Past performance is no guarantee of future
performance.
13