Data Science Bootcamp - Week 4 - Full Stack Developer Tips

Jose 𝒥𝒪 Reyes

Machine Learning ★ AWS Community Builder ★ Master of Data Science student @ UNSW ★ Author, fullstackdeveloper.tips ★ Connect with Me!

Overview

Part of the General Assembly Data Science Bootcamp Series
Why bother with data visualization?
Week 4 - Python visualizations, aka throw away visualizations
Capstone Project proposal due this coming Monday
- Formula 1 Dataset
Resources

Part of the General Assembly Data Science Bootcamp Series

Why bother with data visualization?

While summary statistics (eg. sum, mean, stdev) are important concepts in the study of data science, they are not enough when you want to see a more complete understanding of your data.

© Copyright General Assembly course notes

Take for example the summary statistics data in the table above. All the summary statistics indicate that all the data sets are identical. Or are they? You’ll see in the charts below, that could be further from the truth. They are all totally different, if not for the visualization with the charts, we will be none the wiser!

This is known as Anscombe’s Quartet, which was first constructed by the statistician Anscombe in 1973. He wanted to demonstrate the need to graphing the data before analyzing it, and the effect of outliers on these statistical properties.

The above example highlights the shortcomings of summary statistics alone. It also shows the effects of outliers on these summary data. We are visual beings, and given the table above, it would not have given the impact and understanding that the charts would have easily conveyed.

Week 4 - Python visualizations, aka throw away visualizations

In conjunction with exploratory data analysis and slicing and dicing your data, and in this case we use the most popular language and library for data scientists - Python and Pandas, we will also run the data through a series of visualizations. We like to see patterns at a high level, and evaluate early on if we can continue, because part of data science is ensuring we have enough data and if the quality of the data is good enough.

The charts above were generated by a popular python based library called Seaborn which is based on yet another python visualization library called matplotlib. Both these libraries enable the data scientist to easily create many different types of visualizations straight from their Jupyter notebooks.

When working with your visualization, specially in your Jupyter notebooks, the preference is to create many, and we were encouraged to plot as many charts as we can, the idea being these will be treated as throw away charts. These visualizations were created for the sole purpose of finding pattern at the early stages of the data science end-to-end process.

Capstone Project proposal due this coming Monday

I still have the whole weekend to complete my Capstone project proposal, as well as the Unit 2 assignment. This will be a busy weekend. I will still be updating the proposal draft below, but I have decided that my project will be using the Formula 1 Racing dataset.

Formula 1 Dataset

Ever since the first season of Drive to Survive, I’ve been captivated by the drama and excitement that is Formula 1. I’ve been consuming this public API in some of my past blog posts (DynamoDB and Single-Table Design, Simple GraphQL consumer with Apollo Client) and I thought it was fitting to continue this trend and explore the insights and predictions that can be gleaned from it:

predict the winner of a race
predict the pole sitter in the qualifying race
predict who will be the fastest lap
predict which team will be fastest pit stop
explore the effect of the weather on the outcome of the rece
who wins the constructor at the end of the year
who is the last place in the next race

Resources

2023 6
2022 7
2021 9
2020 6
2019 11

2023

How to Build, Train and Deploy Your Own Recommender System – Part 2

7 minute read

We build a recommender system from the ground up with matrix factorization for implicit feedback systems. We then deploy the model to production in AWS.

How to Build, Train and Deploy Your Own Recommender System – Part 1

12 minute read

We build a recommender system from the ground up with matrix factorization for implicit feedback systems. We put it all together with Metaflow and used Comet...

Build Recommender Systems the Easy Way in AWS

15 minute read

Building and maintaining a recommender system that is tuned to your business’ products or services can take great effort. The good news is that AWS can do th...

Ethics in Data, Weekly Reflections

9 minute read

Provided in 6 weekly installments, we will cover current and relevant topics relating to ethics in data

Accelerate ML Application Development in AWS

8 minute read

Get your ML application to production quicker with Amazon Rekognition and AWS Amplify

Remember the last time you created an Entity Relationship diagram? I can’t.

3 minute read

(Re)Learning how to create conceptual models when building software

2022

Going to Production with Github Actions, Metaflow and AWS SageMaker

5 minute read

A scalable (and cost-effective) strategy to transition your Machine Learning project from prototype to production

Small to Reasonable Scale MLOps

4 minute read

An Approach to Effective and Scalable MLOps when you’re not a Giant like Google

AWS Summit 2022 Australia and New Zealand

4 minute read

Day 2 summary - AI/ML edition

AWS Summit 2022 Australia and New Zealand

4 minute read

Day 1 summary - AI/ML edition

Micro-frontends building blocks: Webpack Module Federation

4 minute read

What is Module Federation and why it’s perfect for building your Micro-frontend project

Micro-frontends building blocks: Monorepos

3 minute read

What you always wanted to know about Monorepos but were too afraid to ask

Data Science Bootcamp - MLOps on the cheap!

4 minute read

Using Github Actions as a practical (and Free*) MLOps Workflow tool for your Data Pipeline. This completes the Data Science Bootcamp Series

2021

Data Science Bootcamp - Week 10

7 minute read

Final week of the General Assembly Data Science bootcamp, and the Capstone Project has been completed!

Data Science Bootcamp - Week 5 & 6

5 minute read

Fifth and Sixth week, and we are now working with Machine Learning algorithms and a Capstone Project update

Data Science Bootcamp - Week 4

3 minute read

Fourth week into the GA Data Science bootcamp, and we find out why we have to do data visualizations at all

Data Science Bootcamp - Week 3

4 minute read

On the third week of the GA Data Science bootcamp, we explore ideas for the Capstone Project

Data Science Bootcamp - Week 2

3 minute read

We explore Exploratory Data Analysis in Pandas and start thinking about the course Capstone Project

Data Science Bootcamp - Week 1

3 minute read

Follow along as I go through General Assembly’s 10-week Data Science Bootcamp

Updating React Context does not update my component

4 minute read

Updating Context will re-render context consumers, only in this example, it doesn’t

Pre-render strategies in NextJS

8 minute read

Static Site Generation, Server Side Render or Client Side Render, what’s the difference?

Penny Pinching using the Jamstack Architecture

4 minute read

How to ace your Core Web Vitals without breaking the bank, hint, its FREE! With Netlify, Github and GatsbyJS.

2020

DynamoDB and Single-Table Design

9 minute read

Follow along as I implement DynamoDB Single-Table Design - find out the tools and methods I use to make the process easier, and finally the light-bulb moment...

Debunking 5 common misconceptions about DynamoDB

7 minute read

Use DynamoDB as it was intended, now!

Simple GraphQL consumer with Apollo Client

5 minute read

A GraphQL web client in ReactJS and Apollo

6 Steps to your first GraphQL server

6 minute read

From source to cloud using Serverless and Github Actions

Top 7 reasons why GraphQL is better than REST

7 minute read

How GraphQL promotes thoughtful software development practices

Managing React application state shouldn’t be rocket science

6 minute read

Why you might not need external state management libraries anymore

Data Science Bootcamp - Week 4

Jose 𝒥𝒪 Reyes

Part of the General Assembly Data Science Bootcamp Series

Why bother with data visualization?

Week 4 - Python visualizations, aka throw away visualizations

Capstone Project proposal due this coming Monday

Formula 1 Dataset

Resources

2023

2022

2021

2020

2019