Predicting whether a Formula 1 driver will finish on the podium (top 3) using historical data, advanced feature engineering, and machine learning.
- Introduction
- Key Terminology
- Problem Statement
- Data Sources
- Methodology
- Exploratory Data Analysis
- Statistical Testing
- Feature Engineering
- Modeling
- Deployment
- How to Run
- Conclusion
The world of Formula 1 is rich with data—qualifying rounds, race conditions, team strategies, driver performances, and track histories. This project leverages that data to predict one thing: will a driver finish on the podium?
- Driver — The individual competing in the race. Each F1 team has two drivers.
- Constructor (Team) — The organization that builds and races the car. Examples: Mercedes, Ferrari, Red Bull Racing.
- Grand Prix (Race, Round) — A single event in the F1 calendar, typically held over a weekend, consisting of practice sessions, qualifying, and the main race.
- Qualifying — A session that determines the starting grid for the race. A better qualifying position often improves race performance.
- Grid Position: The starting position of a driver in the race.
- Podium — The top 3 finishers in a race — 1st, 2nd, and 3rd place. These are the drivers who physically stand on the podium after the race and receive trophies.
- Pole Position — The first position on the starting grid, awarded to the fastest qualifier.
- Pit Stop — When a driver enters the pit lane to change tyres or fix minor issues. Time-consuming, but sometimes strategically vital. The pit stop itself ideally takes 2-3 seconds, but the whole process of entering and exiting the pits lasts about 20-25 seconds.
- DNF (Did Not Finish) — When a driver does not complete the race due to a crash, mechanical failure, or other issue.
Goal: Predict whether a driver will finish on the podium using a historical dataset.
This is a binary classification problem with significant class imbalance (only ~15% of drivers finish in the top 3).
- Main datasets: https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020
- Pit Stops: https://www.kaggle.com/datasets/akashrane2609/formula-1-pit-stop-dataset
- Weather API and race metadata
- Historical driver/team performance
- Data cleaning & merging
- Feature engineering (both static and temporal)
- Statistical testing
- Model training with hyperparameter tuning
- Evaluation with real-world test set (2024 races)
- Deployment-ready pipeline + API
Exploration included:
- Class imbalance check
- Grid position impact
- Team and driver podium rates
- Circuit-based performance
- Weather condition summaries
- Global distribution of F1 circuits
We used:
- Chi-Squared Tests: For independence of categorical features.
- Mann-Whitney U Tests: For differences in feature distributions across classes.
- Results informed feature selection.
Key features:
- Driver Experience (race count)
- Recent Performance (last 3 races)
- Rolling Average Finish
- Constructor Podium Rate
- Track-Specific Averages
- Weather Flags (wet, windy, hot, cold)
- Binary-Encoded Categorical Features
All feature engineering is encapsulated in a reusable F1DataPreprocessor
transformer.
- Logistic Regression
- Random Forest
- Cost-sensitive learning
- SMOTE and over-sampling
- HistGradientBoostingClassifier
- LightGBM, XGBoost, CatBoost
- Ensembles with Voting and Stacking
- Optuna with AUC & F1 scores
Best Model: Random Forest with Optuna-tuned hyperparameters Test AUC: 937 Test F1: 0.72 Test Precision: 0.64
-
Custom Transformer (
F1DataPreprocessor
) -
Custom Pipeline (
F1Pipeline
) -
Joblib Model Saving:
joblib.dump(preprocessor, "f1_preprocessor.pkl") joblib.dump(model, "models/model.pkl")
A FastAPI service is provided for predicting podium chances in real-time.
The API is dockerized for easy deployment:
docker build -t f1-predictor .
docker run -p 8000:8000 f1-predictor
-
Clone the repo:
git clone https://github.com/wsiqz/formula-1.git cd formula-1
-
Install dependencies:
pip install -r requirements.txt
-
Run the pipeline notebook:
notebooks/f1.ipynb
-
Train and export the model:
joblib.dump(preprocessor, "f1_preprocessor.pkl") joblib.dump(model, "models/model.pkl")
-
Start FastAPI server:
uvicorn app.main:app --reload
This project demonstrates how domain knowledge, careful feature engineering, and rigorous modeling techniques can be combined to solve real-world predictive problems—even in complex, dynamic environments like Formula 1.