Skip to content

I explored over 2.9 million NYC taxi trip records using BigQuery to understand which zones drive the most profit, how pricing varies, and where demand peaks. Then, I projected the insights onto a Looker Studio dashboard to help make smarter city transport decisions with real data.

Notifications You must be signed in to change notification settings

ruru-lyy/NYC-Taxi-Trip-EDA-Dashboard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NYC Taxi Trip Analytics – End-to-End Data Pipeline & Dashboard

This project demonstrates an end-to-end data pipeline for analyzing NYC Taxi trip data. It involves data cleaning, cloud storage, data modeling using a star schema on BigQuery, and visualization through Looker Studio.


Project Objectives

  • Clean and preprocess raw trip-level data for consistency and accuracy
  • Store and manage data using cloud-native infrastructure (GCS + BigQuery)
  • Design a dimensional data model to support analytical querying
  • Build a scalable dashboard to surface operational, financial, and behavioral insights

Tech Stack

Component Tool/Service
Data Cleaning Python (Jupyter Notebook)
Cloud Storage Google Cloud Storage
Data Warehouse BigQuery
Dashboarding Looker Studio
Data Modeling Star Schema

Workflow Overview

1. Data Cleaning (data_exploration.ipynb)

  • Loaded raw NYC taxi CSV dataset using pandas.

  • Cleaned data by:

    • Parsing and formatting datetime columns.
    • Filtering out invalid entries (e.g., zero/negative distance, fare, or passengers).
    • Dropping duplicates and nulls in critical fields.
  • Exported cleaned data as trips_cleaned.csv.

View Data Cleaning Notebook
Download Cleaned CSV

2. Cloud Storage

  • Uploaded trips_cleaned.csv to a GCS bucket: gs://nyc-taxi-data-cleaned/trips_cleaned.csv

3. BigQuery Data Warehouse

a. Fact Table

  • Table: trips_cleaned_1.fact_trip
  • Contains all numeric and transactional data (distance, fare, time, surcharges, tips, etc.).

b. Dimension Tables

  • dim_payment_type: Maps payment_type_idpayment_type_description
  • dim_rate_code: Maps RatecodeID to rate descriptions
  • dim_location: Maps LocationID → Borough, Zone, Service Zone
  • dim_vendor: Maps VendorID to vendor names

c. Data Model

  • Implemented a star schema, joining fact_trip to relevant dimensions for optimized query performance and semantic clarity.

Sample Analytical Queries

Weekly Revenue Trend

SELECT
  EXTRACT(WEEK FROM tpep_pickup_datetime) AS week,
  SUM(total_amount) AS weekly_revenue
FROM trips_cleaned_1.fact_trip
GROUP BY week
ORDER BY week;

Average Tip by Payment Type

SELECT
  dpt.payment_type_description,
  AVG(tip_amount) AS avg_tip
FROM fact_trip ft
JOIN dim_payment_type dpt ON ft.payment_type = dpt.payment_type_id
GROUP BY dpt.payment_type_description;

Top Pickup Zones by Revenue

SELECT
  dl.zone AS pickup_zone,
  SUM(total_amount) AS revenue
FROM fact_trip ft
JOIN dim_location dl ON ft.PULocationID = dl.location_id
GROUP BY pickup_zone
ORDER BY revenue DESC
LIMIT 5;

Open BigQuery Dataset


Looker Studio Dashboard

Connected Looker Studio to BigQuery to build an interactive dashboard with the following sections:

Pages:

  • Overview: Weekly revenue/trips, top pickup zones, trip volume by hour
  • Operations: Revenue/trips segmented by borough, ratecode, vendor
  • Revenue Analytics: Component-wise revenue breakdown (e.g., tips, tolls), waterfall, funnel charts

Directory Structure

├── data/
│   └── trips_cleaned.csv
├── data_exploration.ipynb
├── sql/
│   ├── create_fact_table.sql
│   ├── create_dim_tables.sql
│   └── insights.sql
└── dashboard/
    └── looker_studio_link.txt
└── results/
    └── top_5_rows.csv

Deployment & Automation Notes

  • Cleaned data is manually uploaded to GCS. For automation, integrate with Cloud Functions or Composer.
  • BigQuery views are refreshed on query execution.
  • Dashboard is auto-updated via live BigQuery connection.

About

I explored over 2.9 million NYC taxi trip records using BigQuery to understand which zones drive the most profit, how pricing varies, and where demand peaks. Then, I projected the insights onto a Looker Studio dashboard to help make smarter city transport decisions with real data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published