Data Science Bootcamp - Week 5 & 6 - Full Stack Developer Tips

Jose 𝒥𝒪 Reyes

Machine Learning ★ AWS Community Builder ★ Master of Data Science student @ UNSW ★ Author, fullstackdeveloper.tips ★ Connect with Me!

Overview

Part of the General Assembly Data Science Bootcamp Series
What is the difference between Statistics and Machine Learning?
- Prediction
- Inference
Assignment #2 due Monday last week
Special Guest this week
Update on the Capstone project
- Capstone Dataset
- Capstone Problem Statement
Resources

Part of the General Assembly Data Science Bootcamp Series

What is the difference between Statistics and Machine Learning?

Who knew!? I never knew they were different until I’ve attended this course!

“The major difference between machine learning and statistics is their purpose. Machine learning models are designed to make the most accurate predictions possible. Statistical models are designed for inference about the relationships between variables.”

Machine Learning VS Statistics (Copyright becominghuman.ai)

It is still a bit unclear to me because the the lines are really blurred with both overlapping in capabilities. Perhaps this is best shown by explaining what the difference is between inference and prediction by means of an example.

Prediction

When we aim to predict the outcome of a future race, as in the case of my Capstone Project, this is an example of a prediction, fairly obvious there. In applying machine learning techniques, we typically train the machine learning model using a training/test set. When we need to make a prediction, we pass the input variables to the model, and we expect the prediction as the output of that model.

Machine learning is better at predictions, but it can also do a good job in inference.

Inference

Inference, is similar, but with a subtle difference. For example, you are a Data Scientist in the Formula 1 organization, and it has been decided that a new race will be added to the Global calendar. Your boss approached you with this problem - Can you create a statistical model that can infer which country/where the best location of that new race is going to be?

This differs from prediction, because we are not actually predicting something, however, we are somewhat creating an outcome based on past and current data to find relationships, and come up with the best country/location for the next race.

Statistical modeling tend to be better at making inferences, but they can also be good in making predictions.

Clear as mud, right?

Assignment #2 due Monday last week

I originally planned to write something about this Data Science bootcamp every week, and up until last week I intended to. However, many things conspired that I was not able to complete that. We were also required to submit our EDA (Exploratory Data Analysis) assignments that weekend, plus a few other things, so something’s got to give.

Special Guest this week

This week, we had a special guest brought in by our instructor. A Senior Data Scientist from a well known international tech company came in an gave us an hour and a half talk regarding his Data Science journey.

With a smart and resourceful personality, but what was more exciting was that he was also talking about his recent Data Science projects, both at his work, and more interestingly, his personal projects. Seeing these gave us cohorts that dose of motivation to soldier on with the rest of the course, well at least for me, that’s for sure!

Update on the Capstone project

So yeah, I am well and truly into my Capstone project. I have started getting all the required data from my sources. Instead of taking the easier way of just downloading data from sites like Kaggle and Google Dataset, I have decided to find and extract and transform all the data myself. I have to experience how if feels to go through the process. I feel that this is the only way to learn.

I have also dumped all the race and results data to a MongoDB collection. It’s been a long time since the last time I have tinkered with Mongo, but just the same, it’s still easy and wonderful to work with. I picked it, not really specifically for working with Python, but I am planning to write an API and application with the Capstone, if time permits.

Below is the script I used to prepare the race results Data Frame, the main data I would need to commence EDA:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def create_results_dataframe_from_mongodb_collection():
    db = connect.f1Oracle
    collection = db.results

    for_da_result = {'Season':[],'Round':[],'Race Name':[],'Race Date':[],'Race Time':[],'Position':[],
                     'Points':[],'Grid':[],'Laps':[],'Status':[],'Driver':[],'DOB':[],
                     'Nationality':[],'Constructor':[],'Circuit Name':[],'Race Url':[],
                     'Lat':[],'Long':[],'Locality':[],'Country':[]}
    for race in races:
        race_results = list(collection.find({'season':f"{race['season']}",'round': f"{race['round']}"}))
        for results in race_results:
            for item in results['Results']:
                for_da_result['Season'].append(f"{results['season']}")
                for_da_result['Round'].append(f"{results['round']}")
                for_da_result['Race Name'].append(f"{results['raceName']}")
                for_da_result['Race Date'].append(f"{results['date']}")
                for_da_result['Race Time'].append(f"{results['time']}" if 'time' in results else '10:10:00Z')
                for_da_result['Position'].append(f"{item['position']}")
                for_da_result['Points'].append(f"{item['points']}")
                for_da_result['Grid'].append(f"{item['grid']}")
                for_da_result['Laps'].append(f"{item['laps']}")
                for_da_result['Status'].append(f"{item['status']}")
                for_da_result['Driver'].append(f"{item['Driver']['givenName']} {item['Driver']['familyName']}")
                for_da_result['DOB'].append(f"{item['Driver']['dateOfBirth']}")
                for_da_result['Nationality'].append(f"{item['Driver']['nationality']}")
                for_da_result['Constructor'].append(f"{item['Constructor']['name']}")
                for_da_result['Circuit Name'].append(f"{results['Circuit']['circuitName']}")
                for_da_result['Race Url'].append(f"{results['url']}")
                for_da_result['Lat'].append(f"{results['Circuit']['Location']['lat']}")
                for_da_result['Long'].append(f"{results['Circuit']['Location']['long']}")
                for_da_result['Locality'].append(f"{results['Circuit']['Location']['locality']}")
                for_da_result['Country'].append(f"{results['Circuit']['Location']['country']}")


    return pd.DataFrame(for_da_result)

results_df = create_results_dataframe_from_mongodb_collection()
results_df

Capstone Dataset

Ergast Motor Racing has been publishing these Formula 1 results from 1950 up to the present. Majority of my data set will be from this API.

I will also be scraping some data from the following sites:

Chicane F1 - Since 1997 this website has been publishing F1 Race statistics and may have some data that is missing in the Ergast API
Wikipedia - Weather information is missing in the Ergast data set and this can be scraped from Wikipedia
World Weather Online - some weather information is also missing from Wikipedia, so we can use WWO as a backup
F1 Metrics - We are not really using any dataset from F1 Metrics, however, the author of this blog had so many past predictions and analysis, that I felt it important to consider his domain knowledge as I develop the machine learning models in this project. It is a shame that his blog updates are few and far between, however when he does, it’s gold.

Capstone Problem Statement

Ever since the first season of Drive to Survive, I’ve been captivated by the drama and excitement that is Formula 1. I’ve been consuming this public API in some of my past blog posts and I thought it would be fun to continue this trend and explore the insights and predictions that can be gleaned from past race data:

predict the podium placers (1st, 2nd, 3rd) in a race
predict the winner (pole-position) in a qualifying race
predict who wins the fastest lap in the race
who wins the constructor at the end of the year
explore the effect of factors such as Constructor/team membership, weather, home circuit advantage, age/years of experience of driver, qualifying position, etc on the outcome of the race

Resources

2023 6
2022 7
2021 9
2020 6
2019 11

2023

How to Build, Train and Deploy Your Own Recommender System – Part 2

7 minute read

We build a recommender system from the ground up with matrix factorization for implicit feedback systems. We then deploy the model to production in AWS.

How to Build, Train and Deploy Your Own Recommender System – Part 1

12 minute read

We build a recommender system from the ground up with matrix factorization for implicit feedback systems. We put it all together with Metaflow and used Comet...

Build Recommender Systems the Easy Way in AWS

15 minute read

Building and maintaining a recommender system that is tuned to your business’ products or services can take great effort. The good news is that AWS can do th...

Ethics in Data, Weekly Reflections

9 minute read

Provided in 6 weekly installments, we will cover current and relevant topics relating to ethics in data

Accelerate ML Application Development in AWS

8 minute read

Get your ML application to production quicker with Amazon Rekognition and AWS Amplify

Remember the last time you created an Entity Relationship diagram? I can’t.

3 minute read

(Re)Learning how to create conceptual models when building software

2022

Going to Production with Github Actions, Metaflow and AWS SageMaker

5 minute read

A scalable (and cost-effective) strategy to transition your Machine Learning project from prototype to production

Small to Reasonable Scale MLOps

4 minute read

An Approach to Effective and Scalable MLOps when you’re not a Giant like Google

AWS Summit 2022 Australia and New Zealand

4 minute read

Day 2 summary - AI/ML edition

AWS Summit 2022 Australia and New Zealand

4 minute read

Day 1 summary - AI/ML edition

Micro-frontends building blocks: Webpack Module Federation

4 minute read

What is Module Federation and why it’s perfect for building your Micro-frontend project

Micro-frontends building blocks: Monorepos

3 minute read

What you always wanted to know about Monorepos but were too afraid to ask

Data Science Bootcamp - MLOps on the cheap!

4 minute read

Using Github Actions as a practical (and Free*) MLOps Workflow tool for your Data Pipeline. This completes the Data Science Bootcamp Series

2021

Data Science Bootcamp - Week 10

7 minute read

Final week of the General Assembly Data Science bootcamp, and the Capstone Project has been completed!

Data Science Bootcamp - Week 5 & 6

5 minute read

Fifth and Sixth week, and we are now working with Machine Learning algorithms and a Capstone Project update

Data Science Bootcamp - Week 4

3 minute read

Fourth week into the GA Data Science bootcamp, and we find out why we have to do data visualizations at all

Data Science Bootcamp - Week 3

4 minute read

On the third week of the GA Data Science bootcamp, we explore ideas for the Capstone Project

Data Science Bootcamp - Week 2

3 minute read

We explore Exploratory Data Analysis in Pandas and start thinking about the course Capstone Project

Data Science Bootcamp - Week 1

3 minute read

Follow along as I go through General Assembly’s 10-week Data Science Bootcamp

Updating React Context does not update my component

4 minute read

Updating Context will re-render context consumers, only in this example, it doesn’t

Pre-render strategies in NextJS

8 minute read

Static Site Generation, Server Side Render or Client Side Render, what’s the difference?

Penny Pinching using the Jamstack Architecture

4 minute read

How to ace your Core Web Vitals without breaking the bank, hint, its FREE! With Netlify, Github and GatsbyJS.

2020

DynamoDB and Single-Table Design

9 minute read

Follow along as I implement DynamoDB Single-Table Design - find out the tools and methods I use to make the process easier, and finally the light-bulb moment...

Debunking 5 common misconceptions about DynamoDB

7 minute read

Use DynamoDB as it was intended, now!

Simple GraphQL consumer with Apollo Client

5 minute read

A GraphQL web client in ReactJS and Apollo

6 Steps to your first GraphQL server

6 minute read

From source to cloud using Serverless and Github Actions

Top 7 reasons why GraphQL is better than REST

7 minute read

How GraphQL promotes thoughtful software development practices

Managing React application state shouldn’t be rocket science

6 minute read

Why you might not need external state management libraries anymore

Data Science Bootcamp - Week 5 & 6

Jose 𝒥𝒪 Reyes

Part of the General Assembly Data Science Bootcamp Series

What is the difference between Statistics and Machine Learning?

Prediction

Inference

Assignment #2 due Monday last week

Special Guest this week

Update on the Capstone project

Capstone Dataset

Capstone Problem Statement

Resources

2023

2022

2021

2020

2019