This project presents a full-cycle data science solution for analyzing and deriving insights from retail transactional data using Python and Power BI. It involves structured data cleaning, EDA, customer segmentation using RFM + KMeans, Market Basket Analysis using the Apriori algorithm, and business dashboard creation. Outputs are stored in CSV, visualized with Matplotlib/Seaborn, and optionally integrated with a MySQL database or BI dashboards (Power BI/Tableau).
This project aims to derive strategic insights from customer purchase data in an e-commerce/retail environment by:
- Identify customers purchasing patterns and trends
- Segmenting customers based on behavioral metrics (Recency, Frequency, Monetary)
- Generate association rules for product bundling
- Recommend strategies for targeted marketing and inventory optimixation
- Visualize insights through an interactive Power BI dashboard
- Source: Kaggle β Online Retail II Dataset
- Period: Dec 2009 β Dec 2011
- Size: 779,000+ records
- Columns:
Feature | Description |
---|---|
InvoiceNo |
Unique transaction ID |
StockCode |
Unique product ID |
Description |
Product description |
Quantity |
Quantity purchased |
InvoiceDate |
Date and time of transaction |
UnitPrice |
Price per item |
CustomerID |
Unique customer ID |
Country |
Country of purchase |
Layer | Technology |
---|---|
Language | Python 3.10+ |
Data Handling | Pandas, NumPy |
Visualization | Matplotlib, Seaborn |
ML Algorithms | KMeans (Scikit-learn), Apriori (mlxtend) |
Database | MySQL (via mysql-connector-python ) |
Notebook | Jupyter Notebook |
Dashboard (opt.) | Power BI (4-page executive dashboard) |
customer_purchase_analysis/
βββ data/ # Raw dataset
β βββ online_retail.csv
β
βββ scripts/ # Modular ETL/ML scripts
β βββ _init_.py
β βββ utils.py
β βββ data_cleaning.py
β βββ mysql_pipeline.py
β βββ eda_analysis.py
β βββ rfm_segmentation.py
β βββ market_basket.py
β
βββ notebooks/ # Main pipeline orchestrator
β βββ purchase_analysis.ipynb
β
βββ outputs/
β βββ data/
β β βββ clean_online_retail.csv
β β βββ rfm_segments.csv
β β βββ association_rules.csv
β βββ figures/
β βββ eda_fig/
β βββ rfm_fig/
β βββ mba_fig/
β
βββ logs/
β βββ process_log.log
βββ Reports/
β βββ Customer_Purchase_Analysis.pbix
β βββ Customer_Purchase_Analysis.pdf
β βββ Customer_Purchase_Analysis.pptx
β βββ BI_Executive_Summary.png
β βββ BI_Sales_Trend.png
β βββ BI_RFM_Segments.png
β βββ BI_Market_Basket.png
βββ requirements.txt
βββ README.md
- Drop duplicates
- Handle missing values (esp.
CustomerID
) - Removes missing or invalid values
- Creates new columns like
TotalPrice
- Logs all steps and saves cleaned file
π Cleaned dataset: outputs/data/clean_online_retail.csv
- Insert & retrieve cleaned data into/from MySQL
- Optional for production deployment and data integration
- Handles deduplication and backup
π Outputs:


- Top 10 selling products
- Monthly & daily revenue trends
- Hourly purchase patterns (peak times)
- Country-wise revenue distribution`
π Outputs: outputs/figures/eda_fig/
- Total Revenue
- Unique Customers
- Quantity Sold
- Average Order Value
- Core KPIs (Revenue, Quantity, AOV, etc.)
- Calculates Recency, Frequency, and Monetary values
- Removes outliers using IQR
- Scales features and applies KMeans clustering β 4 customer segments
- Uses silhouette + elbow methods to determine optimal
k
- Segments labeled for business use:
Loyal Valuable Customers
Recent High-Spenders
Occasional Low-Spenders
Inactive Spenders
π Outputs: outputs/figures/rfm_fig/
π RFM: outputs/data/rfm_segments.csv
- Applies Apriori algorithm to find frequent itemsets
- Generates association rules (support, confidence, lift)
- Visualizes top rules (bubble chart, lift bar chart)
- Great for cross-selling & bundling strategies
π Outputs: outputs/figures/mba_fig/
π Rules: outputs/data/association_rules.csv
- 4 Pages:
- Executive Summary
- Sales Analysis
- Customer Segments
- Association Rules
Insight | Value |
---|---|
π 70% of revenue | Comes from top 20% of customers |
π― Peak time | 10 AM β 2 PM on weekdays |
π° Best countries | UK (80%), Germany, Netherlands |
ποΈ Bundling | "Gift box set" + "Teacups" has 62% confidence |
π Segment | 4 clusters with tailored marketing strategies |
- π¦ Inventory planning based on top co-purchases
- π― Loyalty programs for high-value customers
- π’ Targeted email offers during peak purchase times
- π Executive dashboards via Power BI (optional)
-
Customer Segments:
- Loyal Valuable Customers
- Inactive Spenders
- Occasional Low Spenders
- Recent High Spenders
-
Example Association Rule:
- If user buys
"Set of Teacups"
β 62% likely to buy"Gift Wrap"
- If user buys
4 Page Executive BI Dashboard (Reports/
):
- Total Revenue, Orders, Customers
- Revenue by Country
- Segment distribution (from RFM)

- Monthly/Weekly Revenue Trend
- Top Products Sold
- Peak Hour Purchases

- RFM Cluster Scatter Plots
- Segment-specific KPIs

- Rules Table (A β‘ B)
- Top Rules by Lift
- Scatter: Confidence vs Support

- Python 3.10+
- MySQL Server (optional)
- Jupyter Notebook
git clone https://github.com/Ayesha24banu/Customer-Purchase-Behaviour-Analysis-in-Retail.git
cd Customer-Purchase-Behaviour-Analysis-in-Retail
pip install -r requirements.txt
jupyter notebook notebooks/purchase_analysis.ipynb
purchase_analysis.ipynb
and mysql_pipeline.py
.
Jupyter Notebook: https://drive.google.com/file/d/1Jip6S2ppr5XhR7zQi5dREoGPfj3NtdXh/view?usp=drive_link
Power BI: https://drive.google.com/file/d/1NLckyX9VrAv5E3ddQYDTrqJXJIr0i5D8/view?usp=drive_link
- RFM segmentation helps personalize marketing and optimize offers.
- Market Basket Analysis guides product placement, bundling, and inventory management.
- Visual outputs can be used by business teams with minimal technical effort.
- Live segmentation using streaming data
- Recommendation engine using collaborative filtering
- Customer lifetime value prediction
- Streamlit app for business teams
- AutoML for dynamic segmentation
- NLP analysis on customer reviews
- Real-time customer segmentation pipeline
- API-based deployment via FastAPI or Flask
purchase_analysis.ipynb
: Master notebookrfm_segments.csv
: RFM clustering resultsassociation_rules.csv
: Market basket rules results- Visual charts in
/outputs/figures
- MySQL-ready table insertions (optional)
- Power BI dashboard images in Reports/
Thanks to the UCI & Kaggle community for the retail dataset.
Ayesha Banu
- π M.Sc. Computer Science | π Gold Medalist
- πΌ Data Scientist | Data Analyst | Full-Stack Python Developer | GenAI Enthusiast
- π« LinkedIn
- Project: Customer Purchase Behavior Analysis in Retail -- 2025
Distributed under the MIT License. See LICENSE
file for details.