[Feature Request] IO optimizations in Search


### Is your feature request related to a problem? Please describe

We are proposing a roadmap item for opensearch query performance improvement, focused on IO characterstics of queries. 
Query performance for opensearch depends largely on the underlying storage device performance (for cases when whole working set of queries can not fit into memory).
We ran an experiment by running Big5 workload on opensearch deployment with an underlying storage device with per seek latency 3x of amazon EBS. And we found that on opensearch query performance on cold page cache shows a regression of 5x (even for a simple term query). 

We believe that there are ways to hide the IO latency of the underlying storage device so that it is not directly translated to query latency. To prove this, we performed a number of experiments to validate our hypothesis and tried to improve the performance of term query on big5 workload. 
And finally with all the optimizations, we were able to improve the query latencies (for simple term query in big5) by 5x. 


Our high level proposal is to make sure that opensearch process is able to saturate the underlying resources(hopefully the CPU) when a query arrives or until we hit the critical chain limit .

In order to achieve this goal of improving cold start query latency, first and foremost, we need benchmarks that exercise the IO path for queries.

Next up, we will use the profiles on these benchmarks to identify opportunities of parallel and async IO.  Lucene already has some form of this implementation through prefetch and inter and intra segment concurrency. However,  these optimizations either don’t have complete coverage in terms of maximum IO parallelism or the defaults are tuned assuming that the data is available in page cache already.  

Third, we are also looking into implementing our own buffer pool with direct IO and io_uring to have more control on IO and memory management. We will instrument the page access path and optimize query path depending on access patterns and data locality.


### Describe the solution you'd like

In principle, we are proposing to make following changes in the opensearch. 

1. Benchmarking enhancement 
2. Non blocking IO and Buffer pool 
3. IO concurrency
4. Prefetching and per query optimizations
5. Fetch phase optimization

**Benchmarking enhancement**

For benchmarking enhancements, we have following proposals. 
1. Build cold start testing in opensearch-benchmarks.  
2. Opensearch benchmarks don’t have enough randomization to simulate an “average” case.  Basically we want to reproduce a scenario where there is a variation in query inputs (field name and values) along with the query types. This will relate more closely to an actual  user workload.
3. Fetch phase benchmarking, focusing on performance variation with the number of documents need to be returned as part of query is not included in opensearch benchmarks. 

**Non blocking IO and Bufferpool**

Non blocking IO helps in hiding IO latencies by enabling us to submit more concurrent IO requests and utilizing compute more efficiently. Our focus is going to be on the ability to submit as many IO request concurrently as are supported by the instance or possible till we hit critical chain of reads limit.

Java has started supporting virtual threads as part of project loom JDK 21 (out of preview mode) . Virtual threads are light-weight user-mode threads that runs on platform threads. 
We are proposing to use virtual thread per search request and per segment slice for the query execution, instead of relying (or in addition to) on search and searcher threadpools.  
Through experiment we  found out that mmap and niofs directories do not provide application level non blocking io as virtual thread captures the carrier thread. So we are going to build IO uring based directory which has proven to be capable of submitting more concurrent IO requests at application level (as we have verified in our experiments).

**IO concurrency**

Another workstream that we are going to focus on is - io concurrency and our primary way to achieve io concurrency are going to be concurrent segment search and intra segment search. Apart from our existing performance testing mechanisms, we are also proposing to test intra segment search and concurrent segment search on cold start and average cases. 


**Prefetching and per query io concurrency**

We also believe there is a lot of potential for io concurrency outside of stock intra and inter segment search.  For instance, as part of our experiment we dove deep into term query and found out even with concurrent segment search and intra segment search significant amount of IO (term lookup in term dictionary) was happening sequentially. And,  we also didn’t see lot of performance difference especially in case of cold start tests with concurrent segment search and without concurrent segment search across different query types, which makes us think that there is a huge opportunity of per query io concurrency and prefetching related enhancements that can be done per query. 

**Fetch phase optimization**

In general, query phase latency dominates the fetch phase latency for most opensearch benchmark workloads and even lot of users fetch a small number of docs. Most opensearch benchmark workloads don’t fetch lot of docs in the query and even the profile API does not include fetch phase latency. This makes us think that fetch phase is not the top most concern for of search performance till now, especially on fast storage subsystem. However based on our experiments we found that,  on slower storage devices we saw that fetching even a small number of documents causes lot of regression in latencies. Which we were able to improve significantly as mentioned here (https://github.com/opensearch-project/OpenSearch/issues/18780).




### Related component
Benchmarking enhancement  - (https://github.com/opensearch-project/opensearch-benchmark-workloads/issues/684)
Bufferpool - (https://github.com/opensearch-project/OpenSearch/issues/18838) (https://github.com/opensearch-project/OpenSearch/issues/18873)
io interface - (https://github.com/opensearch-project/OpenSearch/issues/18839, https://github.com/opensearch-project/OpenSearch/issues/18931)
virtual threads - (https://github.com/opensearch-project/OpenSearch/issues/18840)
Prefetching and per query io concurrency (https://github.com/opensearch-project/OpenSearch/issues/18841)
FetchPhase optimization - (https://github.com/opensearch-project/OpenSearch/issues/18780)
IO concurrency in term query - (https://github.com/opensearch-project/OpenSearch/issues/18782)
Intra-segment search (https://github.com/opensearch-project/OpenSearch/issues/18338)
BKD prefetching - (https://github.com/apache/lucene/pull/15376, https://github.com/apache/lucene/issues/15197)



### Describe alternatives you've considered

We initially thought that virtual threads with as much concurrency as possible (through inter-segment and intra-segment search) should be sufficient enough to improve the latencies on slower storage devices. But from our experiment, we didn't see lot of improvements.

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] IO optimizations in Search #18833

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] IO optimizations in Search #18833

Description

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions