Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/explain/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ about applications and use cases of CrateDB, trying to put things into a
bigger picture and joining things together to help answer the question _why_?


:::{rubric} 2021
:::

- {ref}`indexing-and-storage`

:::{rubric} 2018
:::

Expand Down
1 change: 1 addition & 0 deletions docs/feature/document/index.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
(container)=
(document)=
(object)=

Expand Down
20 changes: 6 additions & 14 deletions docs/feature/index/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,23 +84,16 @@ with solutions from other vendors.



::::{info-card}
:::{grid-item}
:columns: auto 9 9 9
**Blog: Indexing and Storage in CrateDB**

{{ '{}[Indexing and Storage in CrateDB]'.format(blog) }}

::::{card} Blog: Indexing and Storage in CrateDB
:link: indexing-and-storage
:link-type: ref
Learn about the fundamentals of the CrateDB storage layer,
looking at the three main Lucene structures that are used within CrateDB:
Inverted Indexes for text values, BKD-trees for numeric values, and Doc Values.
:::
:::{grid-item}
:columns: auto 3 3 3
{tags-primary}`Fundamentals` \
Inverted indexes for text values, BKD trees for numeric values, and doc values.
+++
{tags-primary}`Fundamentals`
{tags-secondary}`Converged Indexing`
{tags-secondary}`Deep Dive`
:::
::::


Expand Down Expand Up @@ -159,7 +152,6 @@ bit thin.
[Elasticsearch for Dummies]: https://dzone.com/articles/elasticsearch-for-dummies
[Elasticsearch: Documents and Indices]: https://www.elastic.co/guide/en/elasticsearch/reference/current/documents-indices.html
[Independent comparison of CrateDB and MongoDB using Time Series Benchmark Suite]: https://blog.nyrkio.com/wp-content/uploads/2024/07/Nyrkio-comparison-of-CrateDB-and-MongoDB-with-TSBS-v2.pdf
[Indexing and Storage in CrateDB]: https://cratedb.com/blog/indexing-and-storage-in-cratedb
[Searching and Indexing With Apache Lucene]: https://dzone.com/articles/apache-lucene-a-high-performance-and-full-featured
[Time Series Benchmark on CrateDB and MongoDB]: https://blog.nyrkio.com/2024/07/11/timeseries-benchmark-on-cratedb-and-mongodb/
[TimescaleDB Time Series Benchmark Suite (TSBS)]: https://github.com/timescale/tsbs
36 changes: 9 additions & 27 deletions docs/feature/search/fts/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ of a [search engine].
- {ref}`vector-search`
- {ref}`hybrid-search`
- {ref}`query`
- {ref}`storage-layer`

{tags-primary}`SQL`
{tags-primary}`Full-Text Search`
Expand Down Expand Up @@ -301,41 +302,23 @@ by exploring how to manage a dataset of Netflix titles.
:::{rubric} Explanations
:::

::::{info-card}
:::{grid-item}
:columns: auto 9 9 9
**Indexing and Storage in CrateDB**
:::{card} Indexing and Storage in CrateDB
:link: indexing-and-storage
:link-type: ref

This article explores the internal workings of the storage layer in CrateDB,
with a focus on Lucene's indexing strategies.

{hyper-navigate}`Indexing and Storage in CrateDB <[Indexing and Storage in CrateDB]>`

The CrateDB storage layer is based on Lucene indexes.
Lucene offers scalable and high-performance indexing which enables efficient search
Lucene offers scalable and high-performance indexing, which enables efficient search
and aggregations over documents and rapid updates to the existing documents.
We will look at the three main Lucene structures that are used within CrateDB:
Inverted Indexes for text values, BKD-Trees for numeric values, and Doc Values.

:Inverted Index:
You will learn how inverted indexes are implemented in Lucene and CrateDB.

:BKD Tree:
Better understand the BKD tree, starting from KD trees, and how this data
structure supports range queries in CrateDB.

:Doc Values:
This data structure supports more efficient querying document fields by id,
performs column-oriented retrieval of data, and improves the performance of
aggregation and sorting operations.
+++
CrateDB uses three important Lucene data structures:
Inverted indexes for text values, BKD trees for numeric values, and doc values.

:::
:::{grid-item}
:columns: auto 3 3 3
{tags-primary}`Introduction` \
{tags-primary}`Introduction`
{tags-secondary}`Lucene Indexing`
:::
::::


:::{card} Indexing Text for Both Effective Search and Accurate Analysis
Expand Down Expand Up @@ -374,7 +357,6 @@ effective-search
[BM25: The Next Generation of Lucene Relevance]: https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
[BM25 vs. Lucene Default Similarity]: https://www.elastic.co/blog/found-bm-vs-lucene-default-similarity
[full-text search]: https://en.wikipedia.org/wiki/Full_text_search
[Indexing and Storage in CrateDB]: https://cratedb.com/blog/indexing-and-storage-in-cratedb
[MATCH predicate]: inv:crate-reference#predicates_match
[Okapi BM25]: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/okapi_trec3.pdf
[search engine]: https://en.wikipedia.org/wiki/Search_engine
Expand Down
25 changes: 17 additions & 8 deletions docs/feature/storage/index.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
(storage-internals)=
(storage-layer)=
# Storage Layer

Expand All @@ -11,8 +12,8 @@ The CrateDB storage layer is based on Lucene.
By default, all fields are indexed,
nested or not, but the indexing can be turned off selectively.

This page enumerates some concepts of Lucene. The article [Indexing and Storage in
CrateDB] goes into more details by exploring its internal workings.
This page enumerates some concepts of Lucene. The article {ref}`indexing-and-storage`
goes into more details by exploring its internal workings.

## Lucene

Expand Down Expand Up @@ -49,7 +50,7 @@ Elasticsearch are building upon the same technologies.
## Data structures

CrateDB uses three main data structures of Lucene:
Inverted indexes for text values, BKD trees for numeric values, and DocValues.
Inverted indexes for text values, BKD trees for numeric values, and doc values.

- **Inverted index**

Expand All @@ -69,7 +70,7 @@ Inverted indexes for text values, BKD trees for numeric values, and DocValues.

To optimize numeric range queries, Lucene uses an implementation of the Block KD (BKD)
tree data structure. The BKD tree index structure is suitable for indexing large
multi-dimensional point data sets. It is an I/O-efficient dynamic data structure based
multidimensional point data sets. It is an I/O-efficient dynamic data structure based
on the KD tree. Contrary to its predecessors, the BKD tree maintains its high space
utilization and excellent query and update performance regardless of the number of
updates performed on it.
Expand All @@ -78,17 +79,25 @@ Inverted indexes for text values, BKD trees for numeric values, and DocValues.
including fields defined as `TIMESTAMP` types, supporting performant date range
queries.

- **DocValues**
- **Doc values**

Because Lucene's inverted index data structure implementation is not optimal for
finding field values by given document identifier, and for performing column-oriented
retrieval of data, the DocValues data structure is used for those purposes instead.
retrieval of data, the doc values data structure is used for those purposes instead.

DocValues is a column-based data storage built at document index time. They store
Doc values is a column-based data storage built at document index time. They store
all field values that are not analyzed as strings in a compact column, making it more
effective for sorting and aggregations.

## See also

- {ref}`indexing-and-storage`


:::{toctree}
:hidden:
indexing-and-storage
:::


[column-based store]: https://cratedb.com/docs/crate/reference/en/latest/general/ddl/storage.html
[Indexing and Storage in CrateDB]: https://cratedb.com/blog/indexing-and-storage-in-cratedb
Loading