Skip to content
This repository has been archived by the owner on Apr 17, 2018. It is now read-only.

Link parsing is CPU intensive #41

Open
keith-turner opened this issue Jan 8, 2016 · 2 comments
Open

Link parsing is CPU intensive #41

keith-turner opened this issue Jan 8, 2016 · 2 comments

Comments

@keith-turner
Copy link
Member

While running webindex on EC2 I have noticed the link parsing done by the load task is very CPU intensive. This is usually the bottleneck for loading data when running one load task per node.

For example on a 20 node m3.xlarge EC2 cluster with 20 load task running, the maximum load rate is around 1000 pages/sec. As load increases on the system from having more data (caused by compactions, etc), this takes more CPU and causes the load rate to drop.

@keith-turner
Copy link
Member Author

Could create a stand alone test to measure performance of this code. I suspect its slow, but it may not be. Its hard to tell on a cluster with lots of other things going.

Talked w/ @mikewalch offline, he mentioned that the load task filters out pages that only have links within the domain. He thinks alot of pages may be filtered. The task could possibly be spending time on this.

@mikewalch
Copy link
Contributor

This may have been fixed by #54. However, it might be nice to still create a test to verify parsing performance.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants