-
Notifications
You must be signed in to change notification settings - Fork 0
Page load times and performance
Note: the following was drafted following a team discussion on Feb 15, 2021, and provides a summary of factors that are involved in page load time. This document will be supplemented in the future.
There are a number of factors that go into the total time it takes to deliver search results to a user over the web. Due to the architecture of the search index provided in the crow_backend
, a growing corpus will probably not be the significant factor, at least up to a point.
The indexing methodology used -- storing the texts and the tokenized words in keyed indices in an SQL database -- is designed to perform lookups across millions of rows of data. For example, as of 2021, Crow's corpus texts currently are stored in 10,911 rows, and the tokenized words account for 103,769 distinct rows in that separate index. Our increase, for example, from ~7,000 texts to ~10,000 texts in the last year almost certainly had no perceptible impact on retrieving the search results. Put another way, a 30% increase in the size of the corpus is not going to result in a proportional increase in load time. We should theoretically be able to increase to a million texts without a perceptible increase in load times to the end user. We performed a 'smoke test' recently, importing our corpus texts twice, resulting in ~20,000 texts. Subsequent searches did not take perceptibly longer. Results displayed, even for the most memory-intensive phrase searches, in less than 1 second.
The frequency table results we're displaying do require real-time calculation. Those operations are performed in CPU memory after the matching texts have been retrieved through the above process. However, we do pre-index a fair amount of information to reduce the amount of CPU work involved: each token already contains data on which corpus text it is present in, and how many times that token occurs in that text. This makes the frequency calculation simpler than having to find the number of instances in a subset of the corpus and calculate their frequency. Phrase searches are another matter, since those can't be indexed. For that operation, the contents of the matching texts need to be searched in real-time, but still in CPU memory, not through files (which would be much slower). We will theoretically hit a CPU memory limit at some point in the future where we would have to pay more for more CPU. We should do more research, but it unlikely we're near that yet, and the takeaway here is that CPU memory calculations shouldn't significantly affect the load time for the end user, anyway.
Given that folks are reporting slower load times -- albeit intermittently -- what's the cause, if it's not the size of the corpus? The cause could be multiple places in the page load cycle:
- The backend is "warming" its cache of data, which would result in a slower load time on a first or second search, but normal load times after that.
- Logging of user actions, or logging of notices based on those actions would require additional writes to the database, which would slow down the return time.
- Additional user authentication actions the backend is performing would require separate operations that could slow things down somewhat.
- The content delivery network (CDN), which takes care of providing static assets to browsers, could be temporarily slower, or may need to retrieve new assets, depending on what is already in the user's browser store.
- The user's internet service provider (ISP) could be a factor (less likely).
One point is that all of the above can be considered "fixed costs," things that are not going to increase page load time as the corpus grows.
If we were to rearchitect the backend around an outsourced solution such as ElasticSearch, factors 1 & 2 above would be reduced, but 3, 4 and 5 would still be present, and we would be introducing another factor: the time it takes for the outsourced search engine to return its results to the backend so that the backend can manipulate & send those results to the end user.
Given this, I think a good action item would be for to schedule a meeting to collaboratively replicate circumstances where load times are greater, and then figure out where the bottlenecks are and look for performance tuning opportunities.
Separately, developers can use "profiling" tools to identify exactly what parts of the system are using the most memory and which are taking the longest time.