Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft of relationship embeddings #379

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from
Open

First draft of relationship embeddings #379

wants to merge 1 commit into from

Conversation

Vasilije1990
Copy link
Contributor

@Vasilije1990 Vasilije1990 commented Dec 17, 2024

Summary by CodeRabbit

  • New Features

    • Introduced a new method to create and store relationship embeddings in the database.
    • Added a new Relationship class to model relationships between nodes, including various serialization methods.
  • Documentation

    • Updated import statements to include the new Relationship class.

Copy link
Contributor

coderabbitai bot commented Dec 17, 2024

Walkthrough

The pull request introduces a new Relationship model and associated infrastructure for managing relationships between nodes in the Cognee system. The changes include creating a Relationship class with comprehensive serialization methods, updating the engine's initialization to export the new class, and extending the LanceDB adapter to support storing relationship embeddings. This enhancement provides a structured way to represent and manage relationships with versioning, metadata, and embedding capabilities.

Changes

File Change Summary
cognee/infrastructure/engine/models/Relationship.py New file defining Relationship class with methods for serialization, versioning, and embedding
cognee/infrastructure/engine/__init__.py Added import for Relationship class to expose it in the engine module
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py Added create_relationships method to support storing relationship embeddings in LanceDB

Sequence Diagram

sequenceDiagram
    participant User
    participant RelationshipModel
    participant LanceDBAdapter
    participant VectorDB

    User->>RelationshipModel: Create Relationship
    RelationshipModel-->>User: Relationship Instance
    User->>LanceDBAdapter: create_relationships(collection, relationships)
    LanceDBAdapter->>VectorDB: Generate Embeddings
    LanceDBAdapter->>VectorDB: Merge Insert Relationships
    VectorDB-->>LanceDBAdapter: Confirmation
    LanceDBAdapter-->>User: Relationship Stored
Loading

Poem

🐰 In databases deep and wide,
Relationships now smoothly glide
With LanceDB's vector might
Connections bloom, data takes flight!
A rabbit's code, precise and bright 🌟

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
cognee/infrastructure/engine/models/Relationship.py (1)

45-53: Consider security implications of using pickle for serialization

Using pickle for serialization can be insecure if untrusted data is deserialized, as it may lead to arbitrary code execution. Since pickle is not safe against erroneous or maliciously constructed data, consider using a safer serialization method if the data might come from untrusted sources.

If security is a concern, prefer using to_json and from_json methods for serialization and deserialization:

 # Remove pickle methods if not needed
-def to_pickle(self) -> bytes:
-    """Serialize the instance to pickle-compatible bytes."""
-    return pickle.dumps(self.dict())
-
-@classmethod
-def from_pickle(cls, pickled_data: bytes):
-    """Deserialize the instance from pickled bytes."""
-    data = pickle.loads(pickled_data)
-    return cls(**data)
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (3)

96-100: Define LanceRelationship class at the module level

Defining the LanceRelationship class inside the method can cause issues with pickling, serialization, and type annotations. It's generally better to define classes at the module level for clarity and reusability.

Move the LanceRelationship class definition outside of the create_relationships method:

+# Define LanceRelationship at the module level
 class LanceRelationship(LanceModel):
     id: str
     vector: Vector(vector_size)
     payload: dict

Adjust the method to access vector_size appropriately.


116-116: Replace print statement with logging

Using print statements in production code is not recommended. Consider using the logging module to log messages, which provides better control over logging levels and output destinations.

Apply this diff to replace the print statement:

-import asyncio
+import asyncio
+import logging

+logger = logging.getLogger(__name__)

 # ...

 print(f"Inserted {len(relationships)} relationships into LanceDB")
+logger.info(f"Inserted {len(relationships)} relationships into LanceDB")

Ensure that the logging configuration is properly set up elsewhere in your application.


112-115: Add error handling for database operations

Currently, there is no exception handling around the database operations when inserting relationships. To improve robustness, consider adding a try-except block to catch and handle any exceptions that might occur during the operation.

Apply this diff to add error handling:

 try:
     await collection.merge_insert("id") \
         .when_matched_update_all() \
         .when_not_matched_insert_all() \
         .execute(lance_relationships)
+except Exception as e:
+    logger.error(f"Failed to insert relationships into LanceDB: {e}")
+    # Handle exception accordingly

Ensure that the logger is defined and properly configured.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45cb2c3 and 4545f22.

📒 Files selected for processing (3)
  • cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (2 hunks)
  • cognee/infrastructure/engine/__init__.py (1 hunks)
  • cognee/infrastructure/engine/models/Relationship.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • cognee/infrastructure/engine/init.py
🔇 Additional comments (3)
cognee/infrastructure/engine/models/Relationship.py (2)

40-43: Verify Pydantic version for model_validate_json method

The method model_validate_json is available in Pydantic version 2 and above. If the codebase is using Pydantic version 1, you should use parse_raw instead.

Please verify the Pydantic version being used. If it's version 1.x, update the method as follows:

 @classmethod
 def from_json(cls, json_str: str):
     """Deserialize the instance from a JSON string."""
-    return cls.model_validate_json(json_str)
+    return cls.parse_raw(json_str)

62-62: Verify Pydantic version for model_validate method

The method model_validate is available in Pydantic version 2 and above. If the codebase is using Pydantic version 1, you should use parse_obj instead.

Please verify the Pydantic version being used. If it's version 1.x, update the method as follows:

 def from_dict(cls, data: Dict[str, Any]) -> "Relationship":
     """Deserialize model from a dictionary."""
-    return cls.model_validate(data)
+    return cls.parse_obj(data)
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1)

82-83: Verify that create_collection correctly handles the Relationship model

When creating a collection with the Relationship model as the payload schema, ensure that any private attributes or methods are excluded from the schema to prevent unintended data from being stored.

Check if get_data_point_schema properly processes the Relationship model:

 def get_data_point_schema(self, model_type):
     return copy_model(
         model_type,
         include_fields = {
             "id": (str, ...),
+            # Include necessary fields explicitly
+            "source_id": (UUID, ...),
+            "target_id": (UUID, ...),
+            "relationship_type": (str, ...),
+            "weight": (Optional[float], None),
+            # Exclude private attributes and methods
         },
         exclude_fields = ["_metadata"],
     )

Alternatively, consider defining a specific schema for the collection that includes only the required fields.


def get_embeddable_properties(self):
"""Retrieve embeddable properties for edge embeddings."""
return {field: getattr(self, field, None) for field in self._metadata["index_fields"]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure _metadata is not None before accessing keys

In the get_embeddable_properties method, accessing self._metadata["index_fields"] without checking if _metadata is not None can lead to an exception if _metadata is None.

Apply this diff to add a check:

 def get_embeddable_properties(self):
+    if self._metadata and "index_fields" in self._metadata:
         return {field: getattr(self, field, None) for field in self._metadata["index_fields"]}
+    return {}

This ensures that you only access index_fields when _metadata is properly initialized.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return {field: getattr(self, field, None) for field in self._metadata["index_fields"]}
if self._metadata and "index_fields" in self._metadata:
return {field: getattr(self, field, None) for field in self._metadata["index_fields"]}
return {}

Comment on lines +69 to +70
"""Retrieve names of embeddable properties."""
return self._metadata["index_fields"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle potential None in _metadata when accessing index_fields

Similarly, in the get_embeddable_property_names method, ensure that _metadata is not None before accessing index_fields.

Apply this diff to add a check:

 def get_embeddable_property_names(self):
+    if self._metadata and "index_fields" in self._metadata:
         return self._metadata["index_fields"]
+    return []

This prevents potential TypeError or KeyError exceptions.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""Retrieve names of embeddable properties."""
return self._metadata["index_fields"]
def get_embeddable_property_names(self):
"""Retrieve names of embeddable properties."""
if self._metadata and "index_fields" in self._metadata:
return self._metadata["index_fields"]
return []

Comment on lines +23 to +26
_metadata: Optional[RelationshipMetaData] = {
"index_fields": [],
"type": "Relationship"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct the _metadata field type or default value

The _metadata field is annotated as Optional[RelationshipMetaData] but is assigned a default value that includes the key "type", which is not defined in the RelationshipMetaData TypedDict. This can lead to type inconsistencies and potential runtime errors.

Option 1: Update RelationshipMetaData to include the "type" key.

 class RelationshipMetaData(TypedDict):
     index_fields: list[str]
+    type: str

Option 2: Adjust the _metadata default value to match the specified type.

 _metadata: Optional[RelationshipMetaData] = {
     "index_fields": [],
-    "type": "Relationship"
 }

Ensure that the _metadata field's type annotation matches its default value.

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +88 to +91
data_vectors = await self.embed_data([
" ".join([str(v) for v in rel.get_embeddable_properties().values()])
for rel in relationships
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle potential None values in embeddable properties

When generating embeddings, if any embeddable property of a relationship is None, converting it to a string will result in the string "None", which might negatively impact the embeddings.

Apply this diff to filter out None values:

 data_vectors = await self.embed_data([
-    " ".join([str(v) for v in rel.get_embeddable_properties().values()])
+    " ".join([str(v) for v in rel.get_embeddable_properties().values() if v is not None])
     for rel in relationships
 ])

This ensures that only valid property values are included in the embeddings.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
data_vectors = await self.embed_data([
" ".join([str(v) for v in rel.get_embeddable_properties().values()])
for rel in relationships
])
data_vectors = await self.embed_data([
" ".join([str(v) for v in rel.get_embeddable_properties().values() if v is not None])
for rel in relationships
])

Copy link
Contributor

@hajdul88 hajdul88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's talk



class Relationship(BaseModel):
__tablename__ = "relationship"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edge-type embeddings are already implemented here:
#251

This collected the distinct edge types, creates a pedantic model for edge types: https://github.com/topoteretes/cognee/pull/251/files#diff-676c14525be801de96c5734e9d56bb784f3aaf40fc60d4f030a16e06f17317f9

and embed the relationship name into the edge_type collection.

payload: dict

# Prepare LanceDB-compatible data points
lance_relationships = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this, we have the index datapoints method, which indexes the datapoint pydantic object based on its index_fields property.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants