The following is a complete guide that will teach you how to create your own algorithmic trading bot that will make trades based on quarterly earnings reports (10-Q) filed to the SEC by publicly traded US companies. We will cover everything from downloading historical 10-Q filings, cleaning the text, and building your machine learning model. The training set for the ML model uses the text from these historical filings and the next-day price action after the filing as the feature label.
We then download and clean the 10-Q filings from each day and use our trained model to make predictions on each filing. Depending on the prediction, we will automatically execute a commission-free trade on that company's ticker symbol.
0. Our Project code:
1. Basic Dependencies: Python 3.4, Pandas, BeautifulSoup, yfinance, fuzzywuzzy, ntlk
Python is always my go-to language of choice for projects like these for the same reasons many other choose it- Fast development, readable syntax, and an awesome wealth of good quality libraries available for a huge range of tasks.
Pandas is a classic for any data science project to store, manipulate, and analyze your data in dataframe tables.
yfinance is a python library that is used to retrieve stock prices from Yahoo Finance.
fuzzywuzzy provides fuzzy text similarity results for creating our 10-Q diffs.
Ntlk allows us to split the text of the 10-Q reports on their sentences.
2. Alpaca Commission-Free Trading API (https://alpaca.markets/)
I researched and tried several solutions for equity brokers that claim to offer APIs to retail traders. Alpaca was far and away the easiest to use with the clearest documentation. (I have no affiliation with them) Other brokers that I tested were:
Interactive Brokers: Searching around the web, these guys seemed to have a reputation of being the "gold standard" in the space. However, upon testing their software it was a real eye-opener to the sad state of most retail trading APIs. It turned out they did not have an actual API as you'd expect, but instead a shockingly ancient desktop application for you to install and then libraries to automate the control of that desktop app. Their documentation was messy and convoluted. After a few attempts at running their example code and attempting some test trades, I could tell that using IB as a stable part of an algo trading bot would be possible, but a significant project in and of itself.
Think-or-swim by TD Ameritrade: It was clear that ToS's API was much newer and more sensible to use than Interactive Broker's. However, it was also clear that it was not a matured solution. Although I did get it working, even the initial authentication process to use the API was strange and required undocumented information found on various forums to get it working. The trade execution APIs themselves appeared straightforward to use but the written documentation on them is extremely sparse.
3. Google Cloud AutoML Natural Language
Google naturally has vast experience and investments in ML natural language processing due to the nature of their business as a search engine. After trialing several other ML techniques and solutions, Google's commercial solution produced the best accuracy of the model while providing a solution that was easy enough to use that this project would not get caught in an academic exercise of endless manual tuning and testing of various ML algorithms.
Other ML libraries trialed: Initially I tried the following ML libraries along with creating a bag of bigrams from the filings text to use as the feature set: h2o.ai, Keras, auto-sklearn, and AWS Sagemaker. I big challenge with this technique is that vectorizing the text from a bag of bigrams created a huge number of features for each data point of the training set. There are various techniques available to deal with this but predictive qualities may or may not be lost to varying degrees.
4. Python-edgar: A neat little python library for bulk downloading lists of historical SEC filings (https://github.com/edouardswiac/python-edgar/)
Step 1. Download List of Historical 10-Q SEC Filings
For this we will download and install python-edgar from https://github.com/edouardswiac/python-edgar/
Change directory to the folder with run.py and download historical filings with the following command:
> python run.py -y 2010
I chose to download nearly 10 year's worth (since 2010) to build our ML model with. You may download all the way back to 1993 or download less with a different year argument.
Once this is finished we can compile the results into a single master file:
> cat *.tsv > master.tsv
Now use quarterly-earnings-machine-learning-algo/download_raw_html.py to download all the raw html 10-Q filings that are listed in the index we just created:
> python download_raw_html.py path/to/master.tsv
This will take a significant amount of time to run as it will download many GB of data from the SEC.gov website. Each filing averages several dozen MB of data. When it finished we will have a folder "./filings" that contains the raw html of all of the filings.
The format of the filenames is: <CIK number>_<filing type>_<filing date>_<acceptance date+time>.html
Step 2. Clean The Filings
We will run the following command from the quarterly-earnings-machine-learning-algo project to clean the html from each filing into prepared text:
> python filing_cleaner.py
This will use the html files from the "filings" directory created in the previous step and output into a "cleaning_filings" directory with cleaned text files.
This does the following cleaning to prepare the text for natural language processing:
Removes mostly numerical tables, usually containing quantitative financial data
Removes a limited number of stopwords (Specific dates, numbers, etc)
This will also look up the ticker symbol for trading based on the CIK number that the SEC uses to identify companies.
The format of the filenames is: <CIK number>_<filing type>_<filing date>_<acceptance date+time>_<ticker symbol>.txt
Step 3. Download Financial Data
In this step, we download the company's stock market open prices for the day after each SEC filing and then the open price for 2 days after the filing. We use this precent price change as the target label that we would like to predict for our machine learning data points. Initially, I created the algorithm to trade on market open price vs same-day market close price. However, to use that in live trading would require a day-trading account for which the $25,000 account minimum may be out of reach for some readers.
Additionally, we restrict the SEC filings that we are going to use to ones that are filed just after market close. Releasing quarterly earnings after market hours is a practice that is generally held by most companies. We have plenty of datapoints to use so we will restrict it to these since using price action for quarterly earnings released during market hours would produce heterogeneity in our data samples.
Use pip to install yfinance if you do not already have it before running this command.
> python add_financial.py
This reads filenames with the ticker symbols from the "cleaned_filings" directory created in the previous step and outputs a financials.pkl which is a Pandas dataframe containing all these next-day price changes for each filing.
Step 4. Produce Text Deltas of Each Quarterly Earnings From the Company's Last 10-Q Filing
In this step, we are going to take each cleaned quarterly earnings report and take sentence-by-sentence fuzzy diffs from the company's last 10-Q filing to remove text that also appeared in their last filing. This is an important step that strips away a huge amount of excess text and creates a clean report of what the company has added since the their last quarterly earnings report. This creates a beautifully clean signal to build our machine learning model on because only the information that the company has deemed important enough to add with their latest filing will be a part of out training data.
Remember to use pip to install nltk and fuzzywuzzy dependencies before running.
> python diff_cleaned_filings.py
This command will take the cleaned text files from the "cleaned_filings" directory and output the text delta for each cleaned text file in the "whole_file_diffs" directory.
Step 5. Prepare a CSV of our training data to upload to Cloud AutoML Natural Language
We will now take our cleaned 10-Q diffs (training features) and compile them into a CSV with their next-day prices changes (training labels). We will create discreet training labels by splitting the price changes into 5 buckets at each 20th percentile. So our 0 bucket will have the bottom 20% (most significant price drop), and our 4 bucket will have the top 20% (most significant price increase).
> python cloudml_prepare_local_csv.py
This will output a file in the current directory called "training_data.csv" which is ready to be uploaded to Google.
Step 6. Upload, Train, and Evaluate our ML model on Cloud AutoML Natural Language
If you do not already have a Google Cloud Platform account, you may sign up here: https://cloud.google.com/free/
Once you have an account, we can access the Cloud AutoML Natural Language console here: https://console.cloud.google.com/natural-language/dashboard
Here we will click on "AutoML Sentiment Analysis". Although we are not analyzing the sentiment of the text, per se, we will model this as a sentiment analysis problem using the stock price reaction as the measure of the "sentiment" of the quarterly earnings report.
Click on "new dataset", select sentiment analysis, and set the maximum sentiment score to 4 since we have 5 percent-price-change buckets that we created in our last step.
We will now import the CSV file we created on the last step, then click import. When the importing is finished you will get an email. Then you can go back to the dashboard and click train. If you have any issues you may refer to Google's documentation here: https://cloud.google.com/natural-language/automl/sentiment/docs/
The model will take several hours to train and we will get an email when it's complete. We can then analyze the results:
Here we see the resulting confusion matrix for our model. If you're not sure how to interpret this, you may google "confusion matrix". We can see that we achieved a nearly 30% precision and recall overall where random chance should be 20% since we have 5 buckets. That's 50% better than random chance!
If we have a look at our results for sentiment score 0 (remember this is the most negative 20th percentile of price changes) we will see we get the best accuracy. When we use our model to predict that a quarterly earnings report will produce the most dramatic price drop, we will be correct 35.63% of the time. That's over 75% better than random chance!
It makes sense that we do best with this bucket. It tells us that when a quarterly earnings report contains certain negative indicator language, it will more predictably produce a dramatic stock price drop. It goes with the old adage about "sell the news".
Step 7. Download Today's 10-Q filings, make online predictions, and start trading!
Sign up for an Alpaca broker account to start trading: https://alpaca.markets/
They allow you to quickly and easily trade on a paper money account without entering any bank details which is very nice. We can use this paper money account to get started.
Once you're signed up you will need to retrieve your APIs keys from the dashboard:
We will also need to retrieve our model name from the Google console. Click on your trained machine learning model on the same dashboard as the previous step.
Go to the "Test & Use" tab. Click on the "Python" tab under "Use your custom model". Note down the model name circled in black above.
We can now run our command with the following command line arguments:
> python MakeTrades.py <Alpaca API Key ID> <Alpaca Secret Key> <Google Model Name>
This command will:
1. Download the latest market day's 10-Q filings from the SEC website at https://www.sec.gov/cgi-bin/current?q1=0&q2=1&q3=
This should only be run late on a market day since this is when all the filings will be available for that day. If you attempt to run it earlier it will give you yesterday's filings.
2. Clean each 10-Q filing and diff it with the companies' last 10-Q filing, as we did in our training preparation. If the company did not have a 10-Q filed in the past 3 months or so it will skip it.
3. Submit the text delta to do an online prediction with our ML model.
4. If our model returns a prediction of 0 (it is predicting the most dramatic price drop category) then it will use the Alpaca API to put in a short order for that stock that will execute on the following day's market open.
You should remember to close the short positions after they have been held for a day. You can write a script for this if you would like. You can also schedule this command with a cron job to be run at the end of each market day for complete automation.
Step 8. Flip It To Live And Let the Money Roll in
Hopefully this guide was valuable and can be used and expanded on to be profitable for live trading. If anything was unclear or you need any help please reach out to me.