Overview

Part of the General Assembly Data Science Bootcamp Series

Github link to the full Capstone Project

What is MLOps?

After recently finishing the Capstone Project, one of the “What’s Next” actions that I have identified was to:

  • automate the training, building model deployment
  • create an API that exposes this functionality

After looking into this a little bit more, I realized that there is a term for this activity - MLOps - the crossover of Machine Learning and DevOps. This relatively new term is defined by Towards Data Science as:

It is an engineering discipline that aims to unify ML systems development(dev) and ML systems deployment(ops) to standardize and streamline the continuous delivery of high-performing models in production.

And there is more to MLOps than just that, in fact listed below are the other operations that are collectively grouped under the MLOps umbrella:

  • data sourcing
  • data preparation
  • feature engineering
  • feature selection
  • model training
  • model selection
  • model building
  • maintaining data pipelines
  • deployment of model to production
  • monitor, governance, optimize and maintain models

As you can see, model building is merely a tiny part in that process. To facilitate all these activities, there are many MLOps tools available to us, however, there is a one tool popular to most developers which is curiously missing in that list - GitHub Actions!

OK, Let’s use GitHub Actions, then!

There’s not much ML resources using GitHub actions as an ML Workflow tool, but having used actions in previous projects, I know that with little work, it can be used satisfactorily, instead of more popular tools such as Azure Machine Learning or Apache Airflow.

In the GitHub workflow above, these are process we went through that ended up with our model deployed to an API in Production:

Checkout latest code from source control

The model and ML workflow presented here are available in this GitHub repo, free to reuse. Currently this is run manually, but can be easily run using a cron schedule, or as part of your CI workflow, through a Pull Request, or a repo update.

Set up Python v3.8

You’ll also notice that this workflow will be running on ubuntu-latest. This is what’s great with using GitHub actions, is that get a selection of GitHub-hosted runners. And they are all free*. All we need to do is select our preferred runner, and install all the dependencies that the workflow needs.

First is to install our preferred Python version and it’s dependencies defined in requirements.txt

  • Set up Node.js environment
  • Upgrade Python and dependencies

Install Chrome browser

Chrome is also required since part of our Feature Engineering is web scraping Wikipedia for some of the weather information. At some stage, we need Selenium to click on links, and further scrape those pages.

Run data sourcing scripts

Our data comes from Ergast Motor Racing, so we read this API and dump the data into MongoDB. You’ll notice that even though there are secrets involved, we are not sharing these in the actions yml file.

Run data preparation scripts

This is an intermediate step that classifies weather information into 6 weather types - getting it ready for the model building process.

Perform Feature Engineering

The data we got from Ergast is not ready as is, so we we had to engineer a few features. This is easy to do from our workflow by calling some custom-made Python functions.

Build ML model and score

In the previous article we have identified the most optimal ML model for this problem - Bagging Regressor (using Decision Trees), so in this step, we just build the model and generate the Pickle file which our API will use to perform the on-demand predictions.

Setup Serverless framework

Serverless makes it easy to setup our AWS Lambda as a public API. I wanted to use Pulumi, perhaps let’s do this in another blog post. In the meantime, Serverless was the easiest and fastest way (for me) to push this to our production environment.

Deploy ML Model and API to production

The last but not the least! Certainly the most exciting part! The final part of the GitHub MLOps workflow is to be able to deploy our model (and the Serverless API) to production.

I have opted to use Flask, since I already had scripts in Python, so it was minimal work to adapt for the public API.

Once deployed, the API is now available to the public. If time permits, I might create a web application that consumes this API, but because this is a public API, this is now ready for prime time!

Conclusion

In this article, we have demonstrated the use of GitHub Actions, a general purpose GitHub-based CI/CD tool that many developers love. Although not a tool typically used for ML workflows, its general purpose nature, with little effort, it can be used in simple ML pipelines.

There are some limitations (eg. you don’t want to use it with long-running ML tasks like exhaustive hyperparameter tuning), but in my opinion, it manages just fine for many ML scenarios.

What’s your ML workflow tool of choice?

What capabilities are missing in GitHub Actions as an ML workflow tool?

Github link to the full Capstone Project

Resources

2023

Back to top ↑

2022

Back to top ↑

2021

Back to top ↑

2020

DynamoDB and Single-Table Design

9 minute read

Follow along as I implement DynamoDB Single-Table Design - find out the tools and methods I use to make the process easier, and finally the light-bulb moment...

Back to top ↑

2019

Website Performance Series - Part 3

5 minute read

Speeding up your site is easy if you know what to focus on. Follow along as I explore the performance optimization maze, and find 3 awesome tips inside (plus...

Back to top ↑