Data Science Bootcamp - MLOps on the cheap!

Jose 𝒥𝒪 Reyes

Machine Learning ★ AWS Community Builder ★ Master of Data Science student @ UNSW ★ Author, fullstackdeveloper.tips ★ Connect with Me!

Overview

Part of the General Assembly Data Science Bootcamp Series
What is MLOps?
OK, Let’s use GitHub Actions, then!
Conclusion
Resources

Part of the General Assembly Data Science Bootcamp Series

Github link to the full Capstone Project

What is MLOps?

After recently finishing the Capstone Project, one of the “What’s Next” actions that I have identified was to:

automate the training, building model deployment
create an API that exposes this functionality

After looking into this a little bit more, I realized that there is a term for this activity - MLOps - the crossover of Machine Learning and DevOps. This relatively new term is defined by Towards Data Science as:

It is an engineering discipline that aims to unify ML systems development(dev) and ML systems deployment(ops) to standardize and streamline the continuous delivery of high-performing models in production.

And there is more to MLOps than just that, in fact listed below are the other operations that are collectively grouped under the MLOps umbrella:

data sourcing
data preparation
feature engineering
feature selection
model training
model selection
model building
maintaining data pipelines
deployment of model to production
monitor, governance, optimize and maintain models

As you can see, model building is merely a tiny part in that process. To facilitate all these activities, there are many MLOps tools available to us, however, there is a one tool popular to most developers which is curiously missing in that list - GitHub Actions!

OK, Let’s use GitHub Actions, then!

There’s not much ML resources using GitHub actions as an ML Workflow tool, but having used actions in previous projects, I know that with little work, it can be used satisfactorily, instead of more popular tools such as Azure Machine Learning or Apache Airflow.

In the GitHub workflow above, these are process we went through that ended up with our model deployed to an API in Production:

Checkout latest code from source control

The model and ML workflow presented here are available in this GitHub repo, free to reuse. Currently this is run manually, but can be easily run using a cron schedule, or as part of your CI workflow, through a Pull Request, or a repo update.

Set up Python v3.8

You’ll also notice that this workflow will be running on ubuntu-latest. This is what’s great with using GitHub actions, is that get a selection of GitHub-hosted runners. And they are all free^*. All we need to do is select our preferred runner, and install all the dependencies that the workflow needs.

First is to install our preferred Python version and it’s dependencies defined in requirements.txt

Set up Node.js environment
Upgrade Python and dependencies

Install Chrome browser

Chrome is also required since part of our Feature Engineering is web scraping Wikipedia for some of the weather information. At some stage, we need Selenium to click on links, and further scrape those pages.

Run data sourcing scripts

Our data comes from Ergast Motor Racing, so we read this API and dump the data into MongoDB. You’ll notice that even though there are secrets involved, we are not sharing these in the actions yml file.

Run data preparation scripts

This is an intermediate step that classifies weather information into 6 weather types - getting it ready for the model building process.

Perform Feature Engineering

The data we got from Ergast is not ready as is, so we we had to engineer a few features. This is easy to do from our workflow by calling some custom-made Python functions.

Build ML model and score

In the previous article we have identified the most optimal ML model for this problem - Bagging Regressor (using Decision Trees), so in this step, we just build the model and generate the Pickle file which our API will use to perform the on-demand predictions.

Setup Serverless framework

Serverless makes it easy to setup our AWS Lambda as a public API. I wanted to use Pulumi, perhaps let’s do this in another blog post. In the meantime, Serverless was the easiest and fastest way (for me) to push this to our production environment.

Deploy ML Model and API to production

The last but not the least! Certainly the most exciting part! The final part of the GitHub MLOps workflow is to be able to deploy our model (and the Serverless API) to production.

I have opted to use Flask, since I already had scripts in Python, so it was minimal work to adapt for the public API.

Once deployed, the API is now available to the public. If time permits, I might create a web application that consumes this API, but because this is a public API, this is now ready for prime time!

Conclusion

In this article, we have demonstrated the use of GitHub Actions, a general purpose GitHub-based CI/CD tool that many developers love. Although not a tool typically used for ML workflows, its general purpose nature, with little effort, it can be used in simple ML pipelines.

There are some limitations (eg. you don’t want to use it with long-running ML tasks like exhaustive hyperparameter tuning), but in my opinion, it manages just fine for many ML scenarios.

What’s your ML workflow tool of choice?

What capabilities are missing in GitHub Actions as an ML workflow tool?

Github link to the full Capstone Project

Resources

2023 6
2022 7
2021 9
2020 6
2019 11

2023

How to Build, Train and Deploy Your Own Recommender System – Part 2

7 minute read

We build a recommender system from the ground up with matrix factorization for implicit feedback systems. We then deploy the model to production in AWS.

How to Build, Train and Deploy Your Own Recommender System – Part 1

12 minute read

We build a recommender system from the ground up with matrix factorization for implicit feedback systems. We put it all together with Metaflow and used Comet...

Build Recommender Systems the Easy Way in AWS

15 minute read

Building and maintaining a recommender system that is tuned to your business’ products or services can take great effort. The good news is that AWS can do th...

Ethics in Data, Weekly Reflections

9 minute read

Provided in 6 weekly installments, we will cover current and relevant topics relating to ethics in data

Accelerate ML Application Development in AWS

8 minute read

Get your ML application to production quicker with Amazon Rekognition and AWS Amplify

Remember the last time you created an Entity Relationship diagram? I can’t.

3 minute read

(Re)Learning how to create conceptual models when building software

2022

Going to Production with Github Actions, Metaflow and AWS SageMaker

5 minute read

A scalable (and cost-effective) strategy to transition your Machine Learning project from prototype to production

Small to Reasonable Scale MLOps

4 minute read

An Approach to Effective and Scalable MLOps when you’re not a Giant like Google

AWS Summit 2022 Australia and New Zealand

4 minute read

Day 2 summary - AI/ML edition

AWS Summit 2022 Australia and New Zealand

4 minute read

Day 1 summary - AI/ML edition

Micro-frontends building blocks: Webpack Module Federation

4 minute read

What is Module Federation and why it’s perfect for building your Micro-frontend project

Micro-frontends building blocks: Monorepos

3 minute read

What you always wanted to know about Monorepos but were too afraid to ask

Data Science Bootcamp - MLOps on the cheap!

4 minute read

Using Github Actions as a practical (and Free*) MLOps Workflow tool for your Data Pipeline. This completes the Data Science Bootcamp Series

2021

Data Science Bootcamp - Week 10

7 minute read

Final week of the General Assembly Data Science bootcamp, and the Capstone Project has been completed!

Data Science Bootcamp - Week 5 & 6

5 minute read

Fifth and Sixth week, and we are now working with Machine Learning algorithms and a Capstone Project update

Data Science Bootcamp - Week 4

3 minute read

Fourth week into the GA Data Science bootcamp, and we find out why we have to do data visualizations at all

Data Science Bootcamp - Week 3

4 minute read

On the third week of the GA Data Science bootcamp, we explore ideas for the Capstone Project

Data Science Bootcamp - Week 2

3 minute read

We explore Exploratory Data Analysis in Pandas and start thinking about the course Capstone Project

Data Science Bootcamp - Week 1

3 minute read

Follow along as I go through General Assembly’s 10-week Data Science Bootcamp

Updating React Context does not update my component

4 minute read

Updating Context will re-render context consumers, only in this example, it doesn’t

Pre-render strategies in NextJS

8 minute read

Static Site Generation, Server Side Render or Client Side Render, what’s the difference?

Penny Pinching using the Jamstack Architecture

4 minute read

How to ace your Core Web Vitals without breaking the bank, hint, its FREE! With Netlify, Github and GatsbyJS.

2020

DynamoDB and Single-Table Design

9 minute read

Follow along as I implement DynamoDB Single-Table Design - find out the tools and methods I use to make the process easier, and finally the light-bulb moment...

Debunking 5 common misconceptions about DynamoDB

7 minute read

Use DynamoDB as it was intended, now!

Simple GraphQL consumer with Apollo Client

5 minute read

A GraphQL web client in ReactJS and Apollo

6 Steps to your first GraphQL server

6 minute read

From source to cloud using Serverless and Github Actions

Top 7 reasons why GraphQL is better than REST

7 minute read

How GraphQL promotes thoughtful software development practices

Managing React application state shouldn’t be rocket science

6 minute read

Why you might not need external state management libraries anymore

Data Science Bootcamp - MLOps on the cheap!

Jose 𝒥𝒪 Reyes

Part of the General Assembly Data Science Bootcamp Series

What is MLOps?

OK, Let’s use GitHub Actions, then!

Checkout latest code from source control

Set up Python v3.8

Install Chrome browser

Run data sourcing scripts

Run data preparation scripts

Perform Feature Engineering

Build ML model and score

Setup Serverless framework

Deploy ML Model and API to production

Conclusion

Resources

2023

2022

2021

2020

2019