Link parsing is CPU intensive #41

keith-turner · 2016-01-08T18:18:25Z

While running webindex on EC2 I have noticed the link parsing done by the load task is very CPU intensive. This is usually the bottleneck for loading data when running one load task per node.

For example on a 20 node m3.xlarge EC2 cluster with 20 load task running, the maximum load rate is around 1000 pages/sec. As load increases on the system from having more data (caused by compactions, etc), this takes more CPU and causes the load rate to drop.

keith-turner · 2016-01-08T18:56:31Z

Could create a stand alone test to measure performance of this code. I suspect its slow, but it may not be. Its hard to tell on a cluster with lots of other things going.

Talked w/ @mikewalch offline, he mentioned that the load task filters out pages that only have links within the domain. He thinks alot of pages may be filtered. The task could possibly be spending time on this.

mikewalch · 2016-03-08T17:38:09Z

This may have been fixed by #54. However, it might be nice to still create a test to verify parsing performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link parsing is CPU intensive #41

Link parsing is CPU intensive #41

keith-turner commented Jan 8, 2016

keith-turner commented Jan 8, 2016

mikewalch commented Mar 8, 2016

Link parsing is CPU intensive #41

Link parsing is CPU intensive #41

Comments

keith-turner commented Jan 8, 2016

keith-turner commented Jan 8, 2016

mikewalch commented Mar 8, 2016