Performance improvement for part2 of datalake challenge

Basically i was fascinated with this part of the challenge, and spent some time a few weeks ago trying to find a way to improve it's performance. The problem of this case was simple: You have a massive file with 70k json objects with this structure {"productId": "{number}", "image": "{img_url"}, but {number} and {img_url} was repeated several times in the file. This is what we have to do:

Iterate the whole file
Check if the the {img_url} returns 2XX
Return a new json object for which productId with this structure: {"productId": "{number}", "images": "[{list of unique and 2XX images for this productId, with the max value of 3}]"}

While having in mind that performance (time/cpu usage) is very important. So avoid making more than one request for which {img_url}

Solution

For this version of the solution, i decided to use NodeJs instead of Python. After a bit of testing with requests in both, nodeJs with axios had a pretty big win (in average, 70k requests with NodeJs/Axios took ~17s, with python it took ~2min). So here it is a step by step guide of what the script do:

Use fs and readLine to read the file linebyLine
Created a map, with {productId} -> [{image}, {image}... all images for this id]
Iterate this map, with this process for the images:
- Before iterating the images, create a new list that'll receive the images with status 200 for this product (max of 3)
- Is this image already in my new list? If yes, pass
- Is this image already cached? If yes, append to new list. If not, make a request and if it returns 200, save it in the cache.
- When we have 3 images on the new list, or we finished the images for this productId, print the object with the new structure: {"productId": "{number}", "images": new List}

With this solution, I could get an average time of 16s. My first solution using python and Redis my average time was 80s. There's still a lot of room to gain in terms of CPU usage.

Setup

For running this solution, you'll need to install the dependencies

$ npm install

After that, you still need to run the service that the script needs to make requests for which {img_url}

$ node server.js

Usage

Finally, you can run the script with the command below, this'll run the script and at the end will show the time/cpu usage

$ time node .

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dump		dump
.gitignore		.gitignore
README.md		README.md
cache.js		cache.js
index.js		index.js
package.json		package.json
server.js		server.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Performance improvement for part2 of datalake challenge

Solution

Setup

Usage

About

Releases

Packages

Languages

MarlonCorreia/datalake-part2-improvement

Folders and files

Latest commit

History

Repository files navigation

Performance improvement for part2 of datalake challenge

Solution

Setup

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages