-
Notifications
You must be signed in to change notification settings - Fork 5.5k
[Do not Review]feat(connector): Add similiarity search capabilties using Jvector lib #26721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
bibith4
wants to merge
10
commits into
prestodb:master
Choose a base branch
from
bibith4:jvector-presto-integration-tech-preview
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
[Do not Review]feat(connector): Add similiarity search capabilties using Jvector lib #26721
bibith4
wants to merge
10
commits into
prestodb:master
from
bibith4:jvector-presto-integration-tech-preview
+1,935
−27
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- build indexes and mappings - store indexes and mappings files in s3 - map node id to row id - enable TVF Top ANN K search support Co-authored-by: Nivin C S <[email protected]> Co-authored-by: Shijin K <[email protected]>
|
Contributor
Reviewer's GuideAdds approximate nearest neighbor (ANN) similarity search capability to the Iceberg connector using the JVector library, including index building, S3 storage of index and node→row-id mappings, a TVF-based query interface, and connector wiring guarded by a new configuration flag. Sequence diagram for ANN table function query executionsequenceDiagram
actor User
participant PrestoEngine
participant IcebergConnector
participant IcebergMetadata as IcebergAbstractMetadata
participant SplitManager as IcebergSplitManager
participant PageSourceProvider as IcebergPageSourceProvider
participant ANNPageSource
participant JVectorIndex
User->>PrestoEngine: SELECT * FROM TABLE(approx_nearest_neighbors(ARRAY[...], 'schema.table.column', 10))
PrestoEngine->>IcebergConnector: resolve table function approx_nearest_neighbors
IcebergConnector->>IcebergConnector: getTableFunctions()
IcebergConnector-->>PrestoEngine: ApproxNearestNeighborsFunction
PrestoEngine->>ApproxNearestNeighborsFunction: analyze(arguments)
ApproxNearestNeighborsFunction-->>PrestoEngine: IcebergAnnTableFunctionHandle
PrestoEngine->>IcebergMetadata: applyTableFunction(handle)
IcebergMetadata->>IcebergMetadata: check icebergConfig.isSimilaritySearchEnabled()
IcebergMetadata-->>PrestoEngine: TableFunctionApplicationResult(IcebergAnnTableHandle, columnHandles)
PrestoEngine->>SplitManager: getSplits(IcebergAnnTableHandle)
SplitManager->>SplitManager: check icebergConfig.isSimilaritySearchEnabled()
SplitManager->>SplitManager: create IcebergSplit(ann=true, queryVector, topN)
SplitManager-->>PrestoEngine: FixedSplitSource(IcebergSplit)
PrestoEngine->>PageSourceProvider: createPageSource(split with ann=true)
PageSourceProvider->>PageSourceProvider: if similaritySearchEnabled and split.isAnn()
PageSourceProvider->>ANNPageSource: new ANNPageSource(FixedPageSource, queryVector, topN, tableLocation, HdfsFileIO)
PageSourceProvider-->>PrestoEngine: ANNPageSource
loop scan results
PrestoEngine->>ANNPageSource: getNextPage()
ANNPageSource->>JVectorIndex: load index and NodeRowIdMapping from S3 via HdfsFileIO
ANNPageSource->>JVectorIndex: search topN using queryVector
JVectorIndex-->>ANNPageSource: nodeIds with scores
ANNPageSource->>ANNPageSource: map nodeIds to rowIds
ANNPageSource-->>PrestoEngine: Page(row_id)
end
PrestoEngine-->>User: result rows with nearest neighbor row_id values
Sequence diagram for building a vector index via proceduresequenceDiagram
actor User
participant PrestoEngine
participant BuildVecProc as BuildVectorIndexProcedure
participant TxManager as IcebergTransactionManager
participant MetadataFactory as IcebergMetadataFactory
participant Metadata as ConnectorMetadata
participant PageSourceProvider as ConnectorPageSourceProvider
participant VectorBuilder as IcebergVectorIndexBuilder
participant JVector
participant HdfsFileIO
participant S3
User->>PrestoEngine: CALL system.CREATE_VEC_INDEX('catalog.schema.table.column')
PrestoEngine->>BuildVecProc: invoke buildVectorIndex(session, columnPath)
BuildVecProc->>BuildVecProc: parse catalog, schema, table, column
BuildVecProc->>MetadataFactory: create()
MetadataFactory-->>BuildVecProc: ConnectorMetadata
BuildVecProc->>TxManager: put(transactionHandle, metadata)
BuildVecProc->>VectorBuilder: buildAndSaveVectorIndex(metadata, pageSourceProvider, transactionHandle, session, schemaTableName, columnName, indexName, catalogName, similarityFunction, m, efConstruction)
VectorBuilder->>Metadata: getTableHandle(schemaTableName)
VectorBuilder->>Metadata: getTypeManager()
VectorBuilder->>Metadata: getIcebergTable(schemaTableName)
VectorBuilder->>HdfsFileIO: resolve tableLocation and FileIO
loop scan data files
VectorBuilder->>PageSourceProvider: createPageSource(split, layout, [vectorColumn, row_id])
PageSourceProvider-->>VectorBuilder: ConnectorPageSource
loop pages
VectorBuilder->>VectorBuilder: read vectors and rowIds into memory
end
end
VectorBuilder->>JVector: normalize vectors and build ImmutableGraphIndex using GraphIndexBuilder
VectorBuilder->>VectorBuilder: create NodeRowIdMapping(rowIds)
VectorBuilder->>HdfsFileIO: open temp local files for index and mapping
VectorBuilder->>VectorBuilder: write OnDiskGraphIndex and mapping to temp files
VectorBuilder->>S3: upload index and mapping to tableLocation/.vector_index via HdfsOutputFile
VectorBuilder-->>BuildVecProc: Path to index file
BuildVecProc->>TxManager: remove(transactionHandle)
BuildVecProc-->>PrestoEngine: success
PrestoEngine-->>User: procedure completed
Updated class diagram for ANN similarity search and index supportclassDiagram
class IcebergConfig {
- boolean similaritySearchEnabled
+ boolean isSimilaritySearchEnabled()
+ IcebergConfig setSimilaritySearchEnabled(boolean similaritySearchEnabled)
}
class IcebergConnector {
- Set~ConnectorTableFunction~ connectorTableFunctions
+ Set~ConnectorTableFunction~ getTableFunctions()
}
class IcebergAbstractMetadata {
- IcebergConfig icebergConfig
+ TypeManager getTypeManager()
+ Optional~TableFunctionApplicationResult~ applyTableFunction(ConnectorSession session, ConnectorTableFunctionHandle handle)
}
class IcebergSplit {
- boolean ann
- List~Float~ queryVector
- int topN
+ boolean isAnn()
+ List~Float~ getQueryVector()
+ int getTopN()
}
class IcebergSplitManager {
- IcebergConfig icebergConfig
+ ConnectorSplitSource getSplits(ConnectorTransactionHandle transaction, ConnectorSession session, ConnectorTableHandle table, SplitSchedulingContext splitSchedulingContext)
}
class IcebergTableLayoutHandle {
- Optional~String~ tableLocation
+ Optional~String~ getTableLocation()
}
class IcebergTableLayoutHandle_Builder {
- Optional~String~ tableLocation
+ IcebergTableLayoutHandle_Builder setTableLocation(Optional~String~ tableLocation)
+ IcebergTableLayoutHandle build()
}
class ApproxNearestNeighborsFunction {
<<table_function_provider>>
+ ConnectorTableFunction get()
}
class ApproxNearestNeighborsFunction_QueryFunction {
+ TableFunctionAnalysis analyze(ConnectorSession session, ConnectorTransactionHandle transaction, Map~String,Argument~ arguments)
}
class ApproxNearestNeighborsFunction_QualifiedNameParts {
- String schema
- String table
- String column
+ String getSchema()
+ String getTable()
+ String getColumn()
}
class ApproxNearestNeighborsFunction_IcebergAnnTableHandle {
- List~Float~ queryVector
- int limit
+ ApproxNearestNeighborsFunction_IcebergAnnTableHandle(List~Float~ queryVector, int limit, String schema, String table)
+ List~Float~ getInputVector()
+ int getLimit()
}
class ApproxNearestNeighborsFunction_IcebergAnnTableFunctionHandle {
- List~Float~ queryVector
- int limit
- ConnectorTableHandle tableHandle
- List~ColumnHandle~ columnHandles
+ ApproxNearestNeighborsFunction_IcebergAnnTableFunctionHandle(String schema, String table, ScalarArgument inputVector, ScalarArgument limit, List~ColumnHandle~ columnHandles)
+ List~Float~ getInputVector()
+ int getLimit()
+ ConnectorTableHandle getTableHandle()
+ List~ColumnHandle~ getColumnHandles()
}
class ANNPageSource {
- ConnectorPageSource delegate
- List~Float~ queryVector
- int topN
- boolean finished
- String tableLocation
- HdfsFileIO hdfsFileIO
+ ANNPageSource(ConnectorPageSource delegate, List~Float~ queryVector, int topN, String tableLocation, HdfsFileIO hdfsFileIO)
+ Page getNextPage()
+ long getCompletedBytes()
+ long getCompletedPositions()
+ long getReadTimeNanos()
+ boolean isFinished()
+ long getSystemMemoryUsage()
+ void close()
}
class IcebergVectorIndexBuilder {
+ static Path buildAndSaveVectorIndex(ConnectorMetadata metadata, ConnectorPageSourceProvider pageSourceProvider, ConnectorTransactionHandle transactionHandle, ConnectorSession session, SchemaTableName schemaTableName, String columnName, String indexName, String catalogName, String similarityFunction, int m, int efConstruction)
- static VectorData readVectorsFromTable(ConnectorMetadata metadata, ConnectorPageSourceProvider pageSourceProvider, ConnectorTransactionHandle transactionHandle, ConnectorSession session, SchemaTableName schemaTableName, String columnName)
- static IcebergTableLayoutHandle createTableLayoutHandle(IcebergTableHandle tableHandle, List~IcebergColumnHandle~ columns)
- static VectorSimilarityFunction getVectorSimilarityFunction(String similarityFunction)
}
class IcebergVectorIndexBuilder_VectorData {
- List~float[]~ vectors
- List~Long~ rowIds
}
class ListRandomAccessVectorValues {
- List~float[]~ vectors
- int dimension
- Constructor arrayVectorFloatConstructor
+ ListRandomAccessVectorValues(List~float[]~ vectors, int dimension)
+ int size()
+ int dimension()
+ VectorFloat getVector(int ord)
+ boolean isValueShared()
+ RandomAccessVectorValues copy()
}
class CustomVectorFloat {
- float[] values
+ CustomVectorFloat(float[] values)
+ VectorFloat toArrayVectorFloat()
+ CustomVectorFloat get()
+ float get(int index)
+ void set(int index, float value)
+ int length()
+ void copyFrom(VectorFloat src, int srcOffset, int destOffset, int length)
+ void zero()
+ int getHashCode()
+ long ramBytesUsed()
+ float[] vectorValue()
+ CustomVectorFloat copy()
+ float[] getFloatArray()
}
class NodeRowIdMapping {
- long[] nodeToRowId
+ NodeRowIdMapping(List~Long~ rowIds)
+ long getRowId(int nodeId)
+ int size()
+ void save(OutputStream out)
+ static NodeRowIdMapping load(InputStream in)
}
class BuildVectorIndexProcedure {
- IcebergTransactionManager transactionManager
- IcebergMetadataFactory metadataFactory
- ConnectorPageSourceProvider pageSourceProvider
+ BuildVectorIndexProcedure(IcebergTransactionManager transactionManager, IcebergMetadataFactory metadataFactory, ConnectorPageSourceProvider pageSourceProvider)
+ Procedure get()
+ void buildVectorIndex(ConnectorSession session, String columnPath)
}
IcebergConfig <|.. IcebergAbstractMetadata
IcebergConfig <|.. IcebergSplitManager
IcebergConfig <|.. IcebergConnector
IcebergConnector o--> ConnectorTableFunction
IcebergConnector --> ApproxNearestNeighborsFunction
IcebergAbstractMetadata --> ApproxNearestNeighborsFunction_IcebergAnnTableFunctionHandle : uses
IcebergAbstractMetadata --> IcebergTableLayoutHandle : setsTableLocation
IcebergSplitManager --> ApproxNearestNeighborsFunction_IcebergAnnTableHandle : checksInstance
IcebergSplitManager --> IcebergSplit : createsAnnSplit
IcebergTableLayoutHandle_Builder --> IcebergTableLayoutHandle : builds
ApproxNearestNeighborsFunction_QueryFunction --> ApproxNearestNeighborsFunction_IcebergAnnTableFunctionHandle : creates
ApproxNearestNeighborsFunction_IcebergAnnTableFunctionHandle --> ApproxNearestNeighborsFunction_IcebergAnnTableHandle : wraps
ANNPageSource --> NodeRowIdMapping : uses
ANNPageSource --> IcebergVectorIndexBuilder : consumesIndex
IcebergVectorIndexBuilder --> IcebergVectorIndexBuilder_VectorData : uses
IcebergVectorIndexBuilder --> NodeRowIdMapping : creates
IcebergVectorIndexBuilder --> ListRandomAccessVectorValues : uses
IcebergVectorIndexBuilder --> CustomVectorFloat : uses
BuildVectorIndexProcedure --> IcebergVectorIndexBuilder : calls
BuildVectorIndexProcedure --> IcebergTransactionManager : uses
BuildVectorIndexProcedure --> IcebergMetadataFactory : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
be49301 to
47ebd99
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Motivation and Context
Impact
Test Plan
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.
If release note is NOT required, use: