The goal of this project was to work with an inherited legacy front-end e-commerce web portal, Pulsar Shop, and build out a specific API microservice to be used by the front-end client. The microservice was to be a RESTful API that would need to handle web-scale traffic on all 4 read routes.
- For the purposes of this project, web-scale traffic was defined as:
- The ability to handle a minimum of 100 requests/second (RPS)
- <2000ms of latency
- <1% error rate
After various optimizations, this API microservice allows for real-world traffic loads of up to 500 RPS in 2 of 4 read routes, and up to 1,000 RPS on the rest, with an error rate of 0%.
An exhaustive break-down of my daily process, hurdles, and successes can be found in my Notion Engineering Journal.
I opted for 2 deployment methods for testing purposes:
-
The first method utilized Docker-Compose to spin up containers that consisted of one container each for:
- Node Server
- Postgres DB
- Redis DB
- Nginx Load-balancer
This allowed for a fast and consistent developer experience for both local testing and production testing. Additionally, this allowed my entire service to be easily deployed to one AWS EC2 t2.micro instance.
-
The second method consisted of spinning up 4 EC t2.micro instances, one each for:
- Node server 1
- Node server 2
- 1 Nginx Load Balancing server
- 1 PostgreSQL DB w/Redis server
This allowed for a more highly available and higher load withstanding system since I wasn't only load testing one EC2 instance, but distributing it among 4. Ultimately this was became my official system architecture.
Method | Endpoint | Purpose | Response Code |
---|---|---|---|
GET | /products/list |
Retrieves the list of products. | 200 |
GET | /products/:product_id |
Returns all product level information for a specified product by ID. | 200 |
GET | /products/:product_id/styles |
Returns all the styles available for the given product by ID. | 200 |
GET | /products/:product_id/related |
Returns the IDs of all the products related to the product specified by ID. | 200 |
PUT | /products/:product_id |
Updates a Product by ID. | 200 |
POST | /products |
Creates a new Product | 201 |
DELETE | /products/:product_id |
Deletes a Product by ID | 200 |
Since I had 6 CSV files (products
, features
, styles
, related
, photos
, and skus
) that made up all of my product information to work with, and I knew the data shape the front-end was expecting, the first order of business was to create a data model that would conform to the data shape expected by the front end.
Given that the data naturally lent itself to a relationship-driven schema, I knew choosing a Relational DB would be a good choice. I ultimately chose PostgreSQL for its powerful aggregator functions, as well as the ability to build out json objects as queries.
An ETL process was required before beginning the project since all of the data had to be sanitized and normalized, and represented ~5GBs worth of data or 49M+ records. Opening each individual CSV to work on proved impossible, given that each CSV had anywhere from 1 million to 26million rows and easily overwhelmed the RAM on my machine. Therefore, I was forced to do in-depth research on Node Streams in order to:
- Create a readable Stream from each CSV, since it was impossible to open all of them in memory
- Pipe in a CSV parser with specific configurations to open and read the CSV
- Pipe in a custom transformer function-- extending the Transform class in Node in order to normalize all rows and columns
- Pipe to the writable Stream with my clean CSV files
Once the "Extraction" and "Transformation" parts of the process were done, I needed to build out an automated way to "Load" my ~5GBs worth of data into Docker so that it would be accessible within the running instance of Postgres to then copy it into the database. Through a lot of trial and error, I found that I could utilize Docker's /docker-entrypoint-initdb.d
entrypoint and updated my Dockerfile to load in 3 scripts:
- One to init my database schema
- Another to copy over the CSV files into their appropriate tables
- And the last one to creat the indexes, but that was included later once I had figured out which indexes to create when I began my optimization
With my ETL process finally complete, I was able to run my service locally using Docker-Compose!
As I began to think about how to design my API: thinking about the routes, how to respond to those routes, and how to interact with my database, I found that I wanted to follow a layers architecture approach in my codebase. This would allow me to easily divide concerns and responsibilities into different files and directories, which would allow direction commmunication only between certain files.
- The Application layer would handle the basic setup of my server and the connection to my routes, as well as any middlewares and swagger specs later on.
- The Routes layer defined all of my routes and served as the connection to the Controllers layer, as well as where I handled input validation using
express-validator
. - The Controllers layer is where all of the service's business logic would live for each of my endpoints, decide what HTTP codes to return to the client and also serve as the connection to the Model layer.
- The Model layer housed all of my logic for interacting with my Postgres database (and Redis cache later on).
- The Persistence layer is where my database lived, which is located in the
/database
directory, outside of the/server
directory, which instantiates and exports thePg.Pool
interface as well as theRedis.Client
interface.
An initial test using Postman on the most computationally heavy query showed a response time of >1 minute! With this benchmark in mind, an initial goal was to optimize my SQL query itself down to under 50ms and a stretch goal of 10ms.
I had intially use the Pg.Client
class from node-pg
since it was simple to set up and I began making one off manual tests using Postman, however, according to the docs, by utilizing a Client vs a Pool I was in danger of quickly exhausting available, idle clients. This could have the negative effect of causing my server to timeout with an error or hang indefinitely. Additionally, when using Pg.Pool
, you must release a client back to the pool after each request, thereby ensuring an available client at scale unless the machine itself is overwhelmed.
This change from using Pg.Client
to Pg.Pool
led to an increase of 375% in performance
Once I had learned to properly query my database using connection pooling, I moved on to the biggest optimization of my queries: indexing. However, I knew I couldn't just index everything, since the overhead cost of creating those indexing would negate any performance I was hoping to achieve. With that mindset, I started to think about how to identify what to index, and Postgres' EXPLAIN ANALYZE
function came to my rescue!
I wanted to understand what the cost of my queries were for each route:
GET /products/:product_id/styles
:
GET /products/:product_id/related
:
After analyzing each of my queries, I realized that there were Sequence scans happening on all queries that weren't looking a product up by ID that were severely limiting my performance. I identified that this was happening on 5 specific relationships:
features
andproduct_id
styles
andproduct_id
photos
andstyles_id
skus
andstyles_id
related
andproduct_id
Once I had created these 5 indexes, I saw the following performance increases:
- 495% decrease in query time on GET
/products/list
:
- 7,565% decrease in query time on GET
/products/:product_id
:
- 1,014,070% decrease in query time on GET
/products/:product_id/styles
:
- 47,757% decrease in query time on GET
/products/:product_id/related
:
All my queries were WELL BELOW the stated goal of 50ms and even faster than the stretch goal of 10ms. Now it was time to stress test locally before I deploy and start testing on an EC2 instance!
Using Artillery to stress test my service, I was able to successfully go up to 100 RPS, but eventually crashed out at 1000 RPS as can be seen below:
For my caching strategy, I decided to use the "lazy-loading" cache-aside strategy by placing my Redis cache on the same server as my Postgres DB:
Once I had seen improved performance using my caching strategy, I decided to implement a load balancer technique as the higher RPS I pushed, the higher my latency was. Therefore I reasoned that perhaps the load on the single server was too much. I realized that Docker-Compose has a pretty nifty feature of scaling out any service I want by using the docker-compose --scale <service-name>=<# of services>
syntax to scale up my containers within my one EC2 instance.
For my load balancing strategy I utilized the default round-robin strategy:
After comparing the results from my Dockerized service on one EC2 instance vs the 4 distributed EC2 instances' performance, I came away with 2 primary conclusions:
- While I was easily able to handle at least 200RPS on my service implementing Redis and scaling up my Docker containers to utilize Nginx, I realized that my biggest bottleneck became my t2.micro's limited CPU. I started seeing worse results after implementing Nginx. While Docker made it super easy to deploy and even scale my service, it was too much load on the one machine.
- Therefore, I made the correct decision to horizontally scale my service through 4 EC2 t2.micro machines and this is where I saw the greatest output, easily increasing my performance throughput by 250% from 200RPS to 500RPS on my 2 most computationally costly read routes (GET Products by ID and GET Product Styles) and by 133% from 750RPS to 1,000RPS on my two other read routes (GET Products List and GET Related Product IDs).
If I had more time, I would have been interested in testing on a larger machine (t2small, medium, or even large) with more CPUs so that I could look into utilizing Node Clusters in order to take advantage of running my service on multiple threads.
Cesar Roman - LinkedIn - [email protected]
git clone https://github.com/cesartheroman/products-api
cd server
npm install
npm run server/index.ts
If you'd like to contribute, please fork the repository and open a pull request to the main
branch.