Data Science Bootcamp - Week 10 - Full Stack Developer Tips

Jose 𝒥𝒪 Reyes

Machine Learning ★ AWS Community Builder ★ Master of Data Science student @ UNSW ★ Author, fullstackdeveloper.tips ★ Connect with Me!

Overview

Part of the General Assembly Data Science Bootcamp Series
Capstone Project Recap
Conclusion
What’s next
Resources

Part of the General Assembly Data Science Bootcamp Series

Github link to the full Capstone Project

Capstone Project Recap

This is a recap of my Capstone Project where I describe in high level detail the process I went through to complete the course, and achieve a machine learning model that can predict the winner of a 2021 Formula 1 race.

Before the start of this bootcamp, I had beginner level Python, even though I had been developing software for many years. I had attempted an end-to-end Data Science project once before, but when doing these type of things on your own, these are bound for failure, at least for me.

I cannot emphasize this enough, that courses like this GA Data Science bootcamp, will give you the best chance of achieving your goal. For me it was to be able to complete an end-to-end project. And I’m pretty proud of myself, but surely wouldn’t have done it if not for the course, very capable and supportive instructors and equally curious cohorts.

Problem Statement

Ever since the first season of Drive to Survive , I’ve been captivated by the drama and excitement that is Formula 1. I’ve been consuming this public API in some of my past blog posts and I thought it would be fun to continue this trend and explore the insights and predictions that can be gleaned from past race data:

predict the race winner in a race
explore the effect of factors such as Constructor/team membership, weather, home circuit advantage, age/years of experience of driver, qualifying position, etc on the outcome of the race

Even though this post is just skimming the surface, and is really meant as a high level summary, there are still lots of details in it, and will not be very interesting for many of you. If you are in this boat, please fast forward to the conclusion to skip through all the details.

Data Sources

Ergast Motor Racing has been publishing these Formula 1 results from 1950 up to the present. Majority of my data set will be from this API.

I will also be scraping some data from the following sites:

Chicane F1 - Since 1997 this website has been publishing F1 Race statistics and may have some data that is missing in the Ergast API. I did not end up needing data from this site after all, but it does have some race insights which are handy to know, like this 2021 points graph .
Wikipedia - Weather information is missing in the Ergast data set and this can be scraped from Wikipedia
World Weather Online - some weather information is also missing from Wikipedia, so we can use WWO as a backup
F1 Metrics - We are not really using any dataset from F1 Metrics, however, the author of this blog had so many past predictions and analysis, that I felt it important to consider his domain knowledge as I develop the machine learning models in this project. It is a shame that his blog updates are few and far between, however when he does, it’s gold.

What’s Machine Learning again?

I’ve been a developer for some time now, and for a while I’ve been fascinated by machine learning. In software development, I can make computers do what I want, by programming them. However, in machine learning, you are not and cannot do that.

In machine learning, you need to prepare your data in such a way so that when you show this data to your algorithms, it can understand and see the patterns and can come to a conclusion, without being explicitly programmed to do so.

This is the part that perplexed me from the start, its kind of smoke and mirrors. However having successfully gone through the process, it all makes sense to me now. And working on another end-to-end project by myself is not that unconceivable anymore.

Feature Engineering, Importance and Selection

Although the data that we extracted from Ergast is great, it needs to be improved upon if we want to be able to successfully predict the race winner. Apart from the weather information that I scraped from an external source (Wikipedia), we already had all the data we need. We just needed to create “features” from the data already there. In Data Science, we call this Feature Engineering. If you want to see how I did these, please check it out here. At the end of this process, I had about 110 features.

Next is to determine the importance of these features, since we don’t want to use all of them in the modelling stage. During this Feature Importance stage, we identify the most important and relevant features, and drop the rest. All these processes are all done with Python, here’s what I did to quantify the feature importance of the data.

Finally, once we have identified the relevant features, we can now safely drop the inconsequential features (that just contributes to model bloat), and keep the most significant ones. And finally, at this stage we are now ready to build our models.

Regression or Classification

We have a multitude of machine learning algorithms that we can use to build our model. Since this is my first attempt I wanted to use as many models as I can. At this stage I didn’t really know about any of them, much less building one that can do predictions!

In this project we used 2 types of Supervised machine learning techniques. One is called Regression models , where this type can predict a continuous number like the finishing order. The second type is called Classification models , where I used it to predict if a driver Wins a race, or Not.

Parameter search methodology

Python has made it relatively easy to build a model. There’s a library called Sci Kit Learn open source and free, where building a model can be as easy as finding an appropriate algorithm, supply the data, and parameters, and train it with your training data.

To find the most optimum parameters, this is called Hyperparameter tuning in machine learning-speak. There are libraries to do this easily, however I have opted to do this manually using Python for loops. I have tried the automated way, but for this project, I got better results without it.

I have actually combined my hyperparameter tuning with running the actual predictions, this took a while, and it depends on the type of algorithms you end up using, but for me initially it took more than 5 hours (I left it running overnight), and found out later that one of the Classification models I used (SVM) was so slow, without much difference as the other models which were much quicker to train. So I ditched that, and for all the models I used (15 in total including the Dumb Classifier), it took about 1.7 hours using my trusty old MacBook Pro.

Conclusion

Following our Champion/Challenger methodology, the winning model is the one built with Bagging Regressor (Decision Tree Regressor). This is an ensemble model fron SKLearn which uses DecistionTreeRegressor as its base estimator.

Neural Network - MLP Regressor ended up with the same precision score, however because it took 10x more in training time, I have delegated it to 2nd place.

Bagging Regressor (Decision Tree Regressor)
Train time: 0.97s
Test time: 0.42s
Grid, ConstructorRecentForm, DriverRecentForm and DriverRecentWins round out the most significant features. The last 3 are one of the features we engineered for this project.
Precision score: 61.9%
For each race, the finishing order is also available, but omitted here for brevity
Correctly predicting the winner in 13 out of 21 races in 2021 Season

What’s next

automate the training, building and deploying the model
create an API that exposes this functionality to
a web application allowing others to interact with it
get more exposure to other types of machine learning projects and domains. The F1 prediction project is but a tiny speck in the machine learning/AI universe.
keen to volunteer or collaborate with a worthwhile project that needs my help

Github link to the full Capstone Project

Resources

2023 6
2022 7
2021 9
2020 6
2019 11

2023

How to Build, Train and Deploy Your Own Recommender System – Part 2

7 minute read

We build a recommender system from the ground up with matrix factorization for implicit feedback systems. We then deploy the model to production in AWS.

How to Build, Train and Deploy Your Own Recommender System – Part 1

12 minute read

We build a recommender system from the ground up with matrix factorization for implicit feedback systems. We put it all together with Metaflow and used Comet...

Build Recommender Systems the Easy Way in AWS

15 minute read

Building and maintaining a recommender system that is tuned to your business’ products or services can take great effort. The good news is that AWS can do th...

Ethics in Data, Weekly Reflections

9 minute read

Provided in 6 weekly installments, we will cover current and relevant topics relating to ethics in data

Accelerate ML Application Development in AWS

8 minute read

Get your ML application to production quicker with Amazon Rekognition and AWS Amplify

Remember the last time you created an Entity Relationship diagram? I can’t.

3 minute read

(Re)Learning how to create conceptual models when building software

2022

Going to Production with Github Actions, Metaflow and AWS SageMaker

5 minute read

A scalable (and cost-effective) strategy to transition your Machine Learning project from prototype to production

Small to Reasonable Scale MLOps

4 minute read

An Approach to Effective and Scalable MLOps when you’re not a Giant like Google

AWS Summit 2022 Australia and New Zealand

4 minute read

Day 2 summary - AI/ML edition

AWS Summit 2022 Australia and New Zealand

4 minute read

Day 1 summary - AI/ML edition

Micro-frontends building blocks: Webpack Module Federation

4 minute read

What is Module Federation and why it’s perfect for building your Micro-frontend project

Micro-frontends building blocks: Monorepos

3 minute read

What you always wanted to know about Monorepos but were too afraid to ask

Data Science Bootcamp - MLOps on the cheap!

4 minute read

Using Github Actions as a practical (and Free*) MLOps Workflow tool for your Data Pipeline. This completes the Data Science Bootcamp Series

2021

Data Science Bootcamp - Week 10

7 minute read

Final week of the General Assembly Data Science bootcamp, and the Capstone Project has been completed!

Data Science Bootcamp - Week 5 & 6

5 minute read

Fifth and Sixth week, and we are now working with Machine Learning algorithms and a Capstone Project update

Data Science Bootcamp - Week 4

3 minute read

Fourth week into the GA Data Science bootcamp, and we find out why we have to do data visualizations at all

Data Science Bootcamp - Week 3

4 minute read

On the third week of the GA Data Science bootcamp, we explore ideas for the Capstone Project

Data Science Bootcamp - Week 2

3 minute read

We explore Exploratory Data Analysis in Pandas and start thinking about the course Capstone Project

Data Science Bootcamp - Week 1

3 minute read

Follow along as I go through General Assembly’s 10-week Data Science Bootcamp

Updating React Context does not update my component

4 minute read

Updating Context will re-render context consumers, only in this example, it doesn’t

Pre-render strategies in NextJS

8 minute read

Static Site Generation, Server Side Render or Client Side Render, what’s the difference?

Penny Pinching using the Jamstack Architecture

4 minute read

How to ace your Core Web Vitals without breaking the bank, hint, its FREE! With Netlify, Github and GatsbyJS.

2020

DynamoDB and Single-Table Design

9 minute read

Follow along as I implement DynamoDB Single-Table Design - find out the tools and methods I use to make the process easier, and finally the light-bulb moment...

Debunking 5 common misconceptions about DynamoDB

7 minute read

Use DynamoDB as it was intended, now!

Simple GraphQL consumer with Apollo Client

5 minute read

A GraphQL web client in ReactJS and Apollo

6 Steps to your first GraphQL server

6 minute read

From source to cloud using Serverless and Github Actions

Top 7 reasons why GraphQL is better than REST

7 minute read

How GraphQL promotes thoughtful software development practices

Managing React application state shouldn’t be rocket science

6 minute read

Why you might not need external state management libraries anymore

Data Science Bootcamp - Week 10

Jose 𝒥𝒪 Reyes

Part of the General Assembly Data Science Bootcamp Series

Capstone Project Recap

Problem Statement

Data Sources

What’s Machine Learning again?

Feature Engineering, Importance and Selection

Regression or Classification

Parameter search methodology

Conclusion

What’s next

Resources

2023

2022

2021

2020

2019