This project demonstrates an end-to-end data pipeline for analyzing NYC Taxi trip data. It involves data cleaning, cloud storage, data modeling using a star schema on BigQuery, and visualization through Looker Studio.
- Clean and preprocess raw trip-level data for consistency and accuracy
- Store and manage data using cloud-native infrastructure (GCS + BigQuery)
- Design a dimensional data model to support analytical querying
- Build a scalable dashboard to surface operational, financial, and behavioral insights
Component | Tool/Service |
---|---|
Data Cleaning | Python (Jupyter Notebook) |
Cloud Storage | Google Cloud Storage |
Data Warehouse | BigQuery |
Dashboarding | Looker Studio |
Data Modeling | Star Schema |
-
Loaded raw NYC taxi CSV dataset using
pandas
. -
Cleaned data by:
- Parsing and formatting datetime columns.
- Filtering out invalid entries (e.g., zero/negative distance, fare, or passengers).
- Dropping duplicates and nulls in critical fields.
-
Exported cleaned data as
trips_cleaned.csv
.
View Data Cleaning Notebook
Download Cleaned CSV
- Uploaded
trips_cleaned.csv
to a GCS bucket:gs://nyc-taxi-data-cleaned/trips_cleaned.csv
- Table:
trips_cleaned_1.fact_trip
- Contains all numeric and transactional data (distance, fare, time, surcharges, tips, etc.).
dim_payment_type
: Mapspayment_type_id
→payment_type_description
dim_rate_code
: MapsRatecodeID
to rate descriptionsdim_location
: MapsLocationID
→ Borough, Zone, Service Zonedim_vendor
: MapsVendorID
to vendor names
- Implemented a star schema, joining
fact_trip
to relevant dimensions for optimized query performance and semantic clarity.
SELECT
EXTRACT(WEEK FROM tpep_pickup_datetime) AS week,
SUM(total_amount) AS weekly_revenue
FROM trips_cleaned_1.fact_trip
GROUP BY week
ORDER BY week;
SELECT
dpt.payment_type_description,
AVG(tip_amount) AS avg_tip
FROM fact_trip ft
JOIN dim_payment_type dpt ON ft.payment_type = dpt.payment_type_id
GROUP BY dpt.payment_type_description;
SELECT
dl.zone AS pickup_zone,
SUM(total_amount) AS revenue
FROM fact_trip ft
JOIN dim_location dl ON ft.PULocationID = dl.location_id
GROUP BY pickup_zone
ORDER BY revenue DESC
LIMIT 5;
Connected Looker Studio to BigQuery to build an interactive dashboard with the following sections:
- Overview: Weekly revenue/trips, top pickup zones, trip volume by hour
- Operations: Revenue/trips segmented by borough, ratecode, vendor
- Revenue Analytics: Component-wise revenue breakdown (e.g., tips, tolls), waterfall, funnel charts
├── data/
│ └── trips_cleaned.csv
├── data_exploration.ipynb
├── sql/
│ ├── create_fact_table.sql
│ ├── create_dim_tables.sql
│ └── insights.sql
└── dashboard/
└── looker_studio_link.txt
└── results/
└── top_5_rows.csv
- Cleaned data is manually uploaded to GCS. For automation, integrate with Cloud Functions or Composer.
- BigQuery views are refreshed on query execution.
- Dashboard is auto-updated via live BigQuery connection.