ETL and visualisation of the Online Retail Transaction dataset, which contains information on transactions made by customers through an online retail platform.
This project demonstrates an end-to-end ETL (Extract, Transform, Load) process and exploratory data analysis for a retail transactions dataset. The ETL process cleans and transforms the dataset by removing missing values, eliminating cancellations (InvoiceNo starting with “C”), and creating calculated fields such as Sales (Quantity × UnitPrice).
The Data Visualisation stage explores invoice basket sizes by country, sales trends, time of day and day of week sales patterns using Matplotlib, Seaborn, and Plotly.
The dataset includes data on the products that were purchased, the quantity of each product, the date and time of each transaction, the price of each product, the unique identifier for each customer who made a purchase, and the country where each customer is located.
Column Descriptors
| Column Name | Description |
|---|---|
| StockCode | A code used to identify the product that was purchased |
| Description | A brief description of the product that was purchased |
| Quantity | The quantity of the product that was purchased |
| InvoiceDate | The date and time that the purchase was made |
| UnitPrice | The price of one unit of the product that was purchased |
| CustomerID | The unique identifier for the customer who made the purchase |
| Country | The country where the customer who made the purchase is located |
- Clean the raw data by removing anomalies such as cancellations
- Analyse sales trends over time and also the pattern of sales by day of week and time of day.
- Understand geographic distribution of sales.
- The raw data contains records such as Invoice Cancellations which need to be removed for further analysis. This was confirmed to be the case, so these records were removed.
- There is a Sales peak in the run-up to Christmas.
Plotted weekly sales trends and checked for spikes in week before Christmas. - A small proportion of countries drive most revenue.
Rank countries by sales and plot as a sorted chart.
- Load raw dataset.
- Inspect for missing values and duplicates.
- Remove cancellations (
InvoiceNostarting with “C”) and their matching original invoices. - Add
Salescolumn (Quantity × UnitPrice). - Save cleaned dataset to CSV.
- Load cleaned dataset.
- Aggregate and plot:
- Monthly and Weekly sales trends.
- Sales by country.
- Customer purchase frequency.
- Day of Week and Time of Day Sales heatmap.
- Present Country Sales on a World Map to provide an interactive visualisation.
- A bubble chart is used to analyze average invoice basket sizes and average basket values by country.
| Requirement | Visualisation |
|---|---|
| Top-selling countries | Bar chart of ranked countries by sales |
| Seasonality | Monthly and Weekly sales line charts |
| Geographic distribution | Map of sales by country |
| Basket sizes | Totalbubble chart is used to analyze average invoice basket sizes |
- Data Cleaning:
dropna, string filtering withstr.startswith, outlier filtering. - Feature Engineering: New
Salescolumn. - Aggregation:
groupbywith sum/count to derive KPIs. - Visualisation:
- Matplotlib & Seaborn for static charts.
- Plotly Express for interactive exploration.
- Date Handling: Extracting month/year/week/day of week/time of day from
InvoiceDatefor trend analysis.
- Google and Copilot were used to get the correct syntax for the functions needed.
-
CustomerIDvalues are anonymised. There is no personal data.
- None.
- Used Google and Copilot to get the correct Python syntax.
-
data manipulation - numpy, pandas
-
visualisation - matplotlib, seaborn, plotly.express
- Dataset: Kaggle - Online Retail Transactions
- Plot styles inspired by Seaborn and Plotly documentation.
- Thanks to Code Institute facilitators for project guidance.
