-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First draft of relationship embeddings #379
base: dev
Are you sure you want to change the base?
Conversation
WalkthroughThe pull request introduces a new Changes
Sequence DiagramsequenceDiagram
participant User
participant RelationshipModel
participant LanceDBAdapter
participant VectorDB
User->>RelationshipModel: Create Relationship
RelationshipModel-->>User: Relationship Instance
User->>LanceDBAdapter: create_relationships(collection, relationships)
LanceDBAdapter->>VectorDB: Generate Embeddings
LanceDBAdapter->>VectorDB: Merge Insert Relationships
VectorDB-->>LanceDBAdapter: Confirmation
LanceDBAdapter-->>User: Relationship Stored
Poem
Tip CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (4)
cognee/infrastructure/engine/models/Relationship.py (1)
45-53
: Consider security implications of using pickle for serializationUsing
pickle
for serialization can be insecure if untrusted data is deserialized, as it may lead to arbitrary code execution. Sincepickle
is not safe against erroneous or maliciously constructed data, consider using a safer serialization method if the data might come from untrusted sources.If security is a concern, prefer using
to_json
andfrom_json
methods for serialization and deserialization:# Remove pickle methods if not needed -def to_pickle(self) -> bytes: - """Serialize the instance to pickle-compatible bytes.""" - return pickle.dumps(self.dict()) - -@classmethod -def from_pickle(cls, pickled_data: bytes): - """Deserialize the instance from pickled bytes.""" - data = pickle.loads(pickled_data) - return cls(**data)cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (3)
96-100
: DefineLanceRelationship
class at the module levelDefining the
LanceRelationship
class inside the method can cause issues with pickling, serialization, and type annotations. It's generally better to define classes at the module level for clarity and reusability.Move the
LanceRelationship
class definition outside of thecreate_relationships
method:+# Define LanceRelationship at the module level class LanceRelationship(LanceModel): id: str vector: Vector(vector_size) payload: dict
Adjust the method to access
vector_size
appropriately.
116-116
: ReplaceUsing
logging
module to log messages, which provides better control over logging levels and output destinations.Apply this diff to replace the
-import asyncio +import asyncio +import logging +logger = logging.getLogger(__name__) # ... print(f"Inserted {len(relationships)} relationships into LanceDB") +logger.info(f"Inserted {len(relationships)} relationships into LanceDB")Ensure that the logging configuration is properly set up elsewhere in your application.
112-115
: Add error handling for database operationsCurrently, there is no exception handling around the database operations when inserting relationships. To improve robustness, consider adding a try-except block to catch and handle any exceptions that might occur during the operation.
Apply this diff to add error handling:
try: await collection.merge_insert("id") \ .when_matched_update_all() \ .when_not_matched_insert_all() \ .execute(lance_relationships) +except Exception as e: + logger.error(f"Failed to insert relationships into LanceDB: {e}") + # Handle exception accordinglyEnsure that the
logger
is defined and properly configured.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py
(2 hunks)cognee/infrastructure/engine/__init__.py
(1 hunks)cognee/infrastructure/engine/models/Relationship.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- cognee/infrastructure/engine/init.py
🔇 Additional comments (3)
cognee/infrastructure/engine/models/Relationship.py (2)
40-43
: Verify Pydantic version for model_validate_json
method
The method model_validate_json
is available in Pydantic version 2 and above. If the codebase is using Pydantic version 1, you should use parse_raw
instead.
Please verify the Pydantic version being used. If it's version 1.x, update the method as follows:
@classmethod
def from_json(cls, json_str: str):
"""Deserialize the instance from a JSON string."""
- return cls.model_validate_json(json_str)
+ return cls.parse_raw(json_str)
62-62
: Verify Pydantic version for model_validate
method
The method model_validate
is available in Pydantic version 2 and above. If the codebase is using Pydantic version 1, you should use parse_obj
instead.
Please verify the Pydantic version being used. If it's version 1.x, update the method as follows:
def from_dict(cls, data: Dict[str, Any]) -> "Relationship":
"""Deserialize model from a dictionary."""
- return cls.model_validate(data)
+ return cls.parse_obj(data)
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1)
82-83
: Verify that create_collection
correctly handles the Relationship
model
When creating a collection with the Relationship
model as the payload schema, ensure that any private attributes or methods are excluded from the schema to prevent unintended data from being stored.
Check if get_data_point_schema
properly processes the Relationship
model:
def get_data_point_schema(self, model_type):
return copy_model(
model_type,
include_fields = {
"id": (str, ...),
+ # Include necessary fields explicitly
+ "source_id": (UUID, ...),
+ "target_id": (UUID, ...),
+ "relationship_type": (str, ...),
+ "weight": (Optional[float], None),
+ # Exclude private attributes and methods
},
exclude_fields = ["_metadata"],
)
Alternatively, consider defining a specific schema for the collection that includes only the required fields.
|
||
def get_embeddable_properties(self): | ||
"""Retrieve embeddable properties for edge embeddings.""" | ||
return {field: getattr(self, field, None) for field in self._metadata["index_fields"]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure _metadata
is not None
before accessing keys
In the get_embeddable_properties
method, accessing self._metadata["index_fields"]
without checking if _metadata
is not None
can lead to an exception if _metadata
is None
.
Apply this diff to add a check:
def get_embeddable_properties(self):
+ if self._metadata and "index_fields" in self._metadata:
return {field: getattr(self, field, None) for field in self._metadata["index_fields"]}
+ return {}
This ensures that you only access index_fields
when _metadata
is properly initialized.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
return {field: getattr(self, field, None) for field in self._metadata["index_fields"]} | |
if self._metadata and "index_fields" in self._metadata: | |
return {field: getattr(self, field, None) for field in self._metadata["index_fields"]} | |
return {} |
"""Retrieve names of embeddable properties.""" | ||
return self._metadata["index_fields"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle potential None
in _metadata
when accessing index_fields
Similarly, in the get_embeddable_property_names
method, ensure that _metadata
is not None
before accessing index_fields
.
Apply this diff to add a check:
def get_embeddable_property_names(self):
+ if self._metadata and "index_fields" in self._metadata:
return self._metadata["index_fields"]
+ return []
This prevents potential TypeError
or KeyError
exceptions.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
"""Retrieve names of embeddable properties.""" | |
return self._metadata["index_fields"] | |
def get_embeddable_property_names(self): | |
"""Retrieve names of embeddable properties.""" | |
if self._metadata and "index_fields" in self._metadata: | |
return self._metadata["index_fields"] | |
return [] |
_metadata: Optional[RelationshipMetaData] = { | ||
"index_fields": [], | ||
"type": "Relationship" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the _metadata
field type or default value
The _metadata
field is annotated as Optional[RelationshipMetaData]
but is assigned a default value that includes the key "type"
, which is not defined in the RelationshipMetaData
TypedDict. This can lead to type inconsistencies and potential runtime errors.
Option 1: Update RelationshipMetaData
to include the "type"
key.
class RelationshipMetaData(TypedDict):
index_fields: list[str]
+ type: str
Option 2: Adjust the _metadata
default value to match the specified type.
_metadata: Optional[RelationshipMetaData] = {
"index_fields": [],
- "type": "Relationship"
}
Ensure that the _metadata
field's type annotation matches its default value.
Committable suggestion skipped: line range outside the PR's diff.
data_vectors = await self.embed_data([ | ||
" ".join([str(v) for v in rel.get_embeddable_properties().values()]) | ||
for rel in relationships | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle potential None
values in embeddable properties
When generating embeddings, if any embeddable property of a relationship is None
, converting it to a string will result in the string "None"
, which might negatively impact the embeddings.
Apply this diff to filter out None
values:
data_vectors = await self.embed_data([
- " ".join([str(v) for v in rel.get_embeddable_properties().values()])
+ " ".join([str(v) for v in rel.get_embeddable_properties().values() if v is not None])
for rel in relationships
])
This ensures that only valid property values are included in the embeddings.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
data_vectors = await self.embed_data([ | |
" ".join([str(v) for v in rel.get_embeddable_properties().values()]) | |
for rel in relationships | |
]) | |
data_vectors = await self.embed_data([ | |
" ".join([str(v) for v in rel.get_embeddable_properties().values() if v is not None]) | |
for rel in relationships | |
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's talk
|
||
|
||
class Relationship(BaseModel): | ||
__tablename__ = "relationship" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edge-type embeddings are already implemented here:
#251
This collected the distinct edge types, creates a pedantic model for edge types: https://github.com/topoteretes/cognee/pull/251/files#diff-676c14525be801de96c5734e9d56bb784f3aaf40fc60d4f030a16e06f17317f9
and embed the relationship name into the edge_type collection.
payload: dict | ||
|
||
# Prepare LanceDB-compatible data points | ||
lance_relationships = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for this, we have the index datapoints method, which indexes the datapoint pydantic object based on its index_fields property.
Summary by CodeRabbit
New Features
Relationship
class to model relationships between nodes, including various serialization methods.Documentation
Relationship
class.