In 2017, Facebook released an open source project called Prophet. From Facebook:
“Prophet is a procedure for forecasting time series data. It is based on an additive model where non-linear trends are fit with yearly and weekly seasonality, plus holidays. It works best with daily periodicity data with at least one year of historical data. Prophet is robust to missing data, shifts in the trend, and large outliers.”
I was intrigued, especially after reading further. This article by Chris Moffitt, of pbpython.com was pretty mind-blowing in terms of how useful the tool could be for SEO. We decided to see if we could put together a simple workflow using Python notebooks, Google Analytics data, and Prophet to forecast traffic expectations / uncertainty using prior Google Analytics traffic data.
If you are familiar with Jupyter notebooks and Python, and want to jump right to start testing out the notebook, you can find our heavily commented notebook here on Google Colaboratory.
Prophet is pitched by Facebook as a forecasting tool that is robust and reliable, yet approachable to non-data scientists. Prophet accepts at a basic level any time-series data with a date and metric value. Google Analytics is perfect for spitting out massive amounts of this. The article by Chris (above), as well as one from towardsdatascience.com linked in the conclusion, cover many of the details of Prophet and are recommended reading.
It also accepts what are called holidays. Holidays are any historical or anticipated seasonal variations that can help explain ebbs and flows in data. In our case, of predicting web traffic data, we decided to look at:
- Bank Holidays from 2012 through 2020 (file)
- Algorithm updates (as reported by Moz, through 2017) (file)
Other than that, other key values would be the prediction time frame and also something called capacity. The prediction time frame is self-explanatory, but capacity is intriguing because it is often overlooked in SEO. Year-over-year growth is always something that clients want in return for their SEO dollars. But the reality is that based on a website’s niche and market area, there is a threshold to which their traffic can grow without either pushing into new markets or new niches. The clearest example is that bobstireshop.com over the course of 10 years, is probably not going to reach the organic traffic volume of amazon.com no matter how much “SEO” they put into it. There is a limit to the number of available searchers in their niche and/or market. In statistics, this is called carrying capacity. Think of this as the threshold to which the best-in-class website in your market/niche has reached.
I am not one to code from scratch, what others have already done well, so my first step is to search Github for libraries that will handle the bulk of the code work. Luckily we found code from Stijn Debrouwere that handled intuitively many of the Google Analytics import functions needed. We had to fork his repo to overcome some issues importing into Colaboratory in addition to giving us a repo to store some of the needed files. This code handles all of the authentication and querying to the Google Analytics API.
In our notebook, we made use of the easy interface of Colaboratory to make pulling various segments and metrics of time-series data easy (see above).
Colaboratory is a recent release by Google which is essentially Jupyter notebooks that live in Google Drive and run off of Google’s computing power. Like Jupyter notebooks, they allow for inline Python code and inline markup to make very robust and intuitive documents that are easy to follow, execute, and share.
We chose Colaboratory to release this code because it is easy for users to copy to their own account and it is easy to follow (as long as you have a bit of patience) the code execution with the written comments. Once you have run it a few times, it goes very quickly.
Using the shared notebook
In our shared notebook, we took the time to label what each section does. There is also a Table of Contents that will give you the lay of the land to the notebook.
In the main body of the notebook, there are markup cells that just have written content explaining things about the notebook or cells. They look like this:
There are also code cells that can be run to execute code in the Python interpreter in Google’s cloud. The code cells can be run by clicking the run icon indicated below, or by highlighting the cell with a mouse click and pressing (CTRL+Enter).
Generally, for this notebook, you want to start at the top and work your way down to the bottom where the final output will be after all cells have executed successfully. One big hangup that many people may have is generating API credentials (client_id and client_secret) that is needed in the Google Analytics code. We have provided a brief video below that shows how to do this.
If this makes you uncomforatble, you can also reach out to one of your really geeky friends to see if they can help you out. It is really easy if you have done it a few times, but confusing if you haven’t.
client_id = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.apps.googleusercontent.com" client_secret = "XXXXXXXXXXXXXXXXXXXXXXX"
Historical data with future predictions and uncertainty interval
Historical trend with future predictions and uncertainty interval
Yearly and weekly seasonality
In addition to the article previously shared in the opening, there is an excellent resource on Prophet here. I would recommend reading both and reading through all the cells on the shared notebook prior to digging in for the first time. There is some content about linear vs logistic modeling that is important to understand if you want to expend out the code for other purposes. Generally, this notebook gives you a framework for getting data from Google Analytics and processing it with Propet to make forecasts. There are probably some cool ways it can be extended (or improved) that I hope you will tell me about. Also, before you begin (if you are not experienced in APIs or Python), give yourself several hours to work through it and expect some issues to pop up that you may need to head to Google for. We have tested and run many forecasts through the model with great success. Your mileage may vary.