A modular Python project to scrape, process, and analyze detailed cricket player statistics from ESPN CricInfo. The data is stored in AWS S3 and visualized using Power BI dashboards.
This project builds an end-to-end data pipeline that:
- Scrapes player-level data (batting, bowling, fielding, all-rounder stats, and personal info)
- Transforms and standardizes the data
- Aggregates it into master datasets
- Uploads it to AWS S3 for dashboarding in Power BI or Tableau
The pipeline is structured using modular Python scripts and classes, with extensibility for containerization and orchestration.
Tool | Purpose |
---|---|
Python | Core programming language |
Selenium | Web scraping from CricInfo |
pandas | Data transformation |
boto3 | Interacting with AWS S3 |
Power BI | Dashboard visualization |
- Scrape data using Selenium for a specific player.
- Transform data into clean, analysis-ready format.
- Aggregate data across multiple players into master datasets.
- Upload to AWS S3 in both raw and transformed forms.
- Visualize insights using Power BI.
cricketer-stats/
├── research lab/ # Exploratory notebooks or R&D scripts
│ ├── EDA.ipynb # EDA Notebook
│ ├── check.ipynv # notebook for experimentation
├── scripts/ # Python module folder with core logic
│ ├── scraper/ # Web scraping module
│ ├── transformer/ # Data cleaning and formatting module
│ |── loader/ # S3 upload/download logic
│ |── aggregator/ # aggregation logic to generate master dataframes
├── scripts.egg-info/ # Auto-generated metadata for Python packaging
├── tests/ # Test scripts for modules
│ ├── aggregator_test.py
│ ├── scraper_test.py
│ └── transformer_test.py
├── visuals/ # Power BI (.pbix) and design elements
├── .env # AWS credentials and other environment setup
├── .gitignore
├── extras.md # Feature backlog and future plans
├── README.md # Project documentation
├── requirements.txt # List of Python dependencies
├── setup.py # Package setup file for pip installation
├── test.py # Optional test runner script
├── workflow.md # Workflow explanation
- 🔍 Scrapes detailed stats: batting, bowling, fielding, allround, and player profile
- ♻️ Clean modular classes: ScrapeData, TransformData, LoadData, Aggregator
- ☁️ Uses AWS S3 as the cloud data store
- 📊 Output ready for Power BI or Tableau
- 🧱 Code structured for easy scaling and future automation
-
Clone this repository:
git clone https://github.com/your-username/cricketer-stats.git cd cricketer-stats
-
Create and configure a
.env
file with your AWS credentials:AWS_ACCESS_KEY_ID=your_key AWS_SECRET_ACCESS_KEY=your_secret AWS_DEFAULT_REGION=ap-south-1
-
Install project dependencies:
pip install -r requirements.txt
-
Install the project locally as a package:
pip install -e .
-
Run any of the test scripts:
python tests/aggregator_test.py
⚠️ Make sure to setplayer_name
andbucket_name
before running the tests.
See extras.txt
for upcoming enhancements, including:
- Dynamic ground info scraping
- Incremental loading logic
- Secret manager integration
- Docker + Airflow orchestration
- Custom exception handling
This project is structured as an installable package using setup.py
. This allows you to import the core modules (scraper, loader, transformer, aggregator) anywhere in your system.
From the project root, run:
pip install -e .
from scraper import ScrapeData
from transformer import TransformData
This makes testing and modular development cleaner, especially across notebooks and scripts.
See workflow.md
for a high-level overview of the entire ETL-to-dashboard pipeline.