Add versioning to the data point model #378

Vasilije1990 · 2024-12-17T18:58:10Z

Summary by CodeRabbit

New Features
- Introduced new fields in the DataPoint model: created_at, updated_at, version, and type.
- Added methods for serialization: to_json, from_json, to_pickle, from_pickle, to_dict, and from_dict.
- Added a method to update the version along with the timestamp.
- New optional attributes in BaseConfig for enhanced configuration capabilities.
Improvements
- Enhanced timestamp handling for created_at and updated_at.
- Improved error handling in the profiling workflow.
- Conditional configuration for success and failure callbacks in OpenAIAdapter based on monitoring tool settings.

coderabbitai · 2024-12-17T18:58:19Z

Walkthrough

The changes modify the DataPoint model in the Cognee infrastructure, introducing new fields and methods to enhance metadata handling and versioning. The model now includes additional attributes like created_at, version, type, and others, with timestamps stored as integers in milliseconds since the epoch. The modifications also update method signatures to include docstrings and add a new update_version method for managing version information. Additionally, the workflow configuration for profiling has been improved, and new optional attributes have been added to the BaseConfig class.

Changes

File	Changes
`cognee/infrastructure/engine/models/DataPoint.py`	- Added fields: `created_at`, `updated_at`, `version`, `type` - Updated method signatures with docstrings - Added methods: `update_version`, `to_json`, `from_json`, `to_pickle`, `from_pickle`, `to_dict`, `from_dict` - Simplified return logic in `get_embeddable_data`
`.github/workflows/profiling.yaml`	- Added packages: `parso`, `jedi` - Enhanced error handling in "Compare profiling results" step
`cognee/base_config.py`	- Added attributes: `langfuse_public_key`, `langfuse_secret_key`, `langfuse_host`
`cognee/infrastructure/llm/openai/adapter.py`	- Added imports for `MonitoringTool` and `get_base_config` - Updated constructor to configure callbacks based on `monitoring_tool`

Poem

🐰 A DataPoint's Tale of Growth

Metadata dancing, version so bright,
Timestamps ticking with rabbit delight,
New fields sprouting like carrots in spring,
Our model now has an extra swing!
Hop, hop, evolve with coding might! 🥕

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (3)

cognee/infrastructure/engine/models/DataPoint.py (3)
11-14: Consider making MetaData more type-safe

The MetaData type definition could be more specific about the allowed values.

Consider this improvement:
class MetaData(TypedDict):
-    index_fields: list[str]
+    index_fields: list[str]
+    type: Literal["DataPoint"]
36-47: Consider adding docstring type hints and return value

The method documentation could be more explicit about types and return values.

Consider enhancing the docstring:
     @classmethod
     def get_embeddable_data(cls, data_point):
-        """Retrieve embeddable data based on metadata's index_fields."""
+        """Retrieve embeddable data based on metadata's index_fields.
+        
+        Args:
+            data_point: DataPoint instance to extract data from
+            
+        Returns:
+            str | Any: The stripped string value if the attribute is a string,
+                      otherwise the original attribute value
+        """
16-29: Consider adding database indexes and documenting versioning strategy

Since this is a database model with timestamp fields and versioning:

Consider adding database indexes for created_at and updated_at fields to optimize queries by time range

Document the versioning strategy, including:

When and how versions should be incremented

The meaning of version numbers

Migration strategy for existing data

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45cb2c3 and 6fb3b4a.

📒 Files selected for processing (1)

cognee/infrastructure/engine/models/DataPoint.py (1 hunks)

coderabbitai · 2024-12-17T18:59:35Z

cognee/infrastructure/engine/models/DataPoint.py

+    created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
+    updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))


🛠️ Refactor suggestion

Simplify timestamp creation and add validation

The timestamp creation could be simplified and should validate against negative values.

Consider this improvement:

- created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000)) - updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000)) + created_at: int = Field( + default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000), + ge=0 + ) + updated_at: int = Field( + default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000), + ge=0 + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))

updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))

created_at: int = Field(

default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),

ge=0

)

updated_at: int = Field(

default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),

ge=0

)

coderabbitai · 2024-12-17T18:59:36Z

cognee/infrastructure/engine/models/DataPoint.py

+    def update_version(self, new_version: str):
+        """Update the version and updated_at timestamp."""
+        self.version = new_version
+        self.updated_at = int(datetime.now(timezone.utc).timestamp() * 1000)


🛠️ Refactor suggestion

Improve version update method with validation

The update_version method should validate the version format and reuse timestamp logic.

Consider this improvement:

def update_version(self, new_version: str): """Update the version and updated_at timestamp.""" + if not re.match(r"^\d+\.\d+$", new_version): + raise ValueError("Version must be in format 'X.Y'") self.version = new_version - self.updated_at = int(datetime.now(timezone.utc).timestamp() * 1000) + self.updated_at = Field( + default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000) + ).default_factory()

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-12-17T18:59:36Z

cognee/infrastructure/engine/models/DataPoint.py

+    version: str = "0.1"  # Default version
+    source: Optional[str] = None  # Path to file, URL, etc.
+    type: Optional[str] = "text"  # "text", "file", "image", "video"


🛠️ Refactor suggestion

Add version format validation and type enumeration

The version string and type field should have proper validation.

Consider these improvements:

+from enum import Enum +import re + +class DataPointType(str, Enum): + TEXT = "text" + FILE = "file" + IMAGE = "image" + VIDEO = "video" class DataPoint(BaseModel): # ... other fields ... - version: str = "0.1" # Default version - type: Optional[str] = "text" # "text", "file", "image", "video" + version: str = Field( + default="0.1", + regex=r"^\d+\.\d+$" + ) + type: Optional[DataPointType] = Field(default=DataPointType.TEXT)

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (3)

cognee/infrastructure/engine/models/DataPoint.py (3)

20-21: 🛠️ Refactor suggestion

Add validation for timestamp fields

The timestamp fields should include validation to prevent negative values.

-    created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
-    updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
+    created_at: int = Field(
+        default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),
+        ge=0
+    )
+    updated_at: int = Field(
+        default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),
+        ge=0
+    )

22-24: 🛠️ Refactor suggestion

Add version format validation and type enumeration

The version string and type field should have proper validation.

+from enum import Enum
+import re
+
+class DataPointType(str, Enum):
+    TEXT = "text"
+    FILE = "file"
+    IMAGE = "image"
+    VIDEO = "video"

-    version: str = "0.1"  # Default version
-    type: Optional[str] = "text"  # "text", "file", "image", "video"
+    version: str = Field(
+        default="0.1",
+        pattern=r"^\d+\.\d+$"
+    )
+    type: Optional[DataPointType] = Field(default=DataPointType.TEXT)

59-62: 🛠️ Refactor suggestion

Improve version update method with validation

The update_version method should validate the version format.

     def update_version(self, new_version: str):
         """Update the version and updated_at timestamp."""
+        if not re.match(r"^\d+\.\d+$", new_version):
+            raise ValueError("Version must be in format 'X.Y'")
         self.version = new_version
         self.updated_at = int(datetime.now(timezone.utc).timestamp() * 1000)

🧹 Nitpick comments (1)

cognee/infrastructure/llm/openai/adapter.py (1)
42-43: Remove commented code

The commented callback configuration lines should be removed as they are not being used and may cause confusion.
-            # litellm.success_callback = ["langfuse"]
-            # litellm.failure_callback = ["langfuse"]

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fb3b4a and 7657b8e.

📒 Files selected for processing (4)

.github/workflows/profiling.yaml (1 hunks)
cognee/base_config.py (1 hunks)
cognee/infrastructure/engine/models/DataPoint.py (2 hunks)
cognee/infrastructure/llm/openai/adapter.py (2 hunks)

🔇 Additional comments (2)

.github/workflows/profiling.yaml (1)

60-61: Verify necessity of new dependencies

The newly added packages parso and jedi aren't directly used in the workflow. Please verify if these are necessary dependencies or if they're being installed for a specific purpose.

✅ Verification successful

Dependencies are necessary for code analysis functionality

The parso and jedi packages are actively used in the codebase for Python code analysis:

Used in get_local_dependencies.py for parsing code and extracting code entities
Used in extract_code_parts.py for parsing source code
Used in top_down_repo_parse.py for project-wide code analysis

These packages are essential dependencies for the repository's code analysis functionality and their installation in the workflow is justified.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if parso and jedi are required dependencies
poetry show --tree | grep -A 5 -E "parso|jedi"

Length of output: 145

Script:

#!/bin/bash
# Search for any references to parso or jedi in Python files
rg -l "import (parso|jedi)" -t python

# Search for any references in requirements files
fd "requirements.*\.txt$" --exec grep -l "parso\|jedi" {} \;

# Search for any references in pyproject.toml or setup files
fd "(pyproject\.toml|setup\.py|setup\.cfg)$" --exec grep -l "parso\|jedi" {} \;

# Look for any documentation or comments mentioning these packages
rg -l "parso|jedi" -g "!*.{pyc,pyo,pyd}" -g "!.git"

Length of output: 442

Script:

#!/bin/bash
# Check the content of files that reference these packages
rg "parso|jedi" poetry.lock -A 5 -B 5

# Check how these packages are used in the Python files
rg "parso|jedi" cognee/tasks/repo_processor/get_local_dependencies.py
rg "parso|jedi" cognee/tasks/repo_processor/extract_code_parts.py
rg "parso|jedi" cognee/tasks/repo_processor/top_down_repo_parse.py

Length of output: 5192

cognee/infrastructure/llm/openai/adapter.py (1)

9-9: LGTM: Import statements are correctly added

The new imports for MonitoringTool and get_base_config are appropriately placed and necessary for the monitoring configuration feature.

Also applies to: 13-13

coderabbitai · 2024-12-17T19:45:04Z

cognee/base_config.py

+    langfuse_public_key: Optional[str] = os.getenv("LANGFUSE_PUBLIC_KEY")
+    langfuse_secret_key: Optional[str] = os.getenv("LANGFUSE_SECRET_KEY")
+    langfuse_host: Optional[str] = os.environ["LANGFUSE_HOST"]


⚠️ Potential issue

Fix potential KeyError in environment variable access

The direct access to LANGFUSE_HOST using os.environ[] could raise a KeyError if the environment variable isn't set. Consider using os.getenv() for consistent behavior with other environment variables.

- langfuse_host: Optional[str] = os.environ["LANGFUSE_HOST"] + langfuse_host: Optional[str] = os.getenv("LANGFUSE_HOST")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

langfuse_public_key: Optional[str] = os.getenv("LANGFUSE_PUBLIC_KEY")

langfuse_secret_key: Optional[str] = os.getenv("LANGFUSE_SECRET_KEY")

langfuse_host: Optional[str] = os.environ["LANGFUSE_HOST"]

langfuse_public_key: Optional[str] = os.getenv("LANGFUSE_PUBLIC_KEY")

langfuse_secret_key: Optional[str] = os.getenv("LANGFUSE_SECRET_KEY")

langfuse_host: Optional[str] = os.getenv("LANGFUSE_HOST")

coderabbitai · 2024-12-17T19:45:04Z

cognee/infrastructure/engine/models/DataPoint.py

+    @classmethod
    @classmethod


⚠️ Potential issue

Remove duplicate @classmethod decorator

The @classmethod decorator is duplicated.

- @classmethod - @classmethod + @classmethod

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@classmethod

@classmethod

@classmethod

coderabbitai · 2024-12-17T19:45:04Z

cognee/infrastructure/engine/models/DataPoint.py

+    # Pickle Serialization
+    def to_pickle(self) -> bytes:
+        """Serialize the instance to pickle-compatible bytes."""
+        return pickle.dumps(self.dict())
+
+    @classmethod
+    def from_pickle(self, pickled_data: bytes):
+        """Deserialize the instance from pickled bytes."""
+        data = pickle.loads(pickled_data)
+        return self(**data)


🛠️ Refactor suggestion

⚠️ Potential issue

Security concern: Replace pickle with a safer serialization method

Using pickle for serialization poses a security risk as it can execute arbitrary code during deserialization. Consider using a safer alternative like JSON or MessagePack.

- def to_pickle(self) -> bytes: - """Serialize the instance to pickle-compatible bytes.""" - return pickle.dumps(self.dict()) - - @classmethod - def from_pickle(self, pickled_data: bytes): - """Deserialize the instance from pickled bytes.""" - data = pickle.loads(pickled_data) - return self(**data) + def to_bytes(self) -> bytes: + """Serialize the instance to bytes using JSON.""" + return self.json().encode('utf-8') + + @classmethod + def from_bytes(cls, data: bytes): + """Deserialize the instance from JSON bytes.""" + return cls.parse_raw(data)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Pickle Serialization

def to_pickle(self) -> bytes:

"""Serialize the instance to pickle-compatible bytes."""

return pickle.dumps(self.dict())

@classmethod

def from_pickle(self, pickled_data: bytes):

"""Deserialize the instance from pickled bytes."""

data = pickle.loads(pickled_data)

return self(**data)

def to_bytes(self) -> bytes:

"""Serialize the instance to bytes using JSON."""

return self.json().encode('utf-8')

@classmethod

def from_bytes(cls, data: bytes):

"""Deserialize the instance from JSON bytes."""

return cls.parse_raw(data)

coderabbitai · 2024-12-17T19:45:05Z

cognee/infrastructure/llm/openai/adapter.py

+        base_config = get_base_config()
+        if base_config.monitoring_tool == MonitoringTool.LANGFUSE:
+            # set callbacks
+            # litellm.success_callback = ["langfuse"]
+            # litellm.failure_callback = ["langfuse"]
+            self.aclient.success_callback = ["langfuse"]
+            self.aclient.failure_callback = ["langfuse"]
+            self.client.success_callback = ["langfuse"]
+            self.client.failure_callback = ["langfuse"]


🛠️ Refactor suggestion

Refactor callback configuration and add error handling

The current implementation has several areas for improvement:

Duplicate callback configuration for both clients

Missing error handling for base_config

Missing documentation for the monitoring feature

Consider refactoring like this:

+ def _configure_langfuse_callbacks(self, client): + """Configure Langfuse callbacks for the given client.""" + client.success_callback = ["langfuse"] + client.failure_callback = ["langfuse"] def __init__( self, api_key: str, endpoint: str, api_version: str, model: str, transcription_model: str, streaming: bool = False, ): + """Initialize OpenAI adapter with optional Langfuse monitoring. + + Args: + api_key (str): OpenAI API key + endpoint (str): API endpoint + api_version (str): API version + model (str): Model identifier + transcription_model (str): Model for transcription + streaming (bool, optional): Enable streaming. Defaults to False. + """ self.aclient = instructor.from_litellm(litellm.acompletion) self.client = instructor.from_litellm(litellm.completion) self.transcription_model = transcription_model self.model = model self.api_key = api_key self.endpoint = endpoint self.api_version = api_version self.streaming = streaming - base_config = get_base_config() - if base_config.monitoring_tool == MonitoringTool.LANGFUSE: - self.aclient.success_callback = ["langfuse"] - self.aclient.failure_callback = ["langfuse"] - self.client.success_callback = ["langfuse"] - self.client.failure_callback = ["langfuse"] + try: + base_config = get_base_config() + if base_config and base_config.monitoring_tool == MonitoringTool.LANGFUSE: + self._configure_langfuse_callbacks(self.aclient) + self._configure_langfuse_callbacks(self.client) + except Exception as e: + # Log the error but don't fail initialization + print(f"Warning: Failed to configure monitoring: {str(e)}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

base_config = get_base_config()

if base_config.monitoring_tool == MonitoringTool.LANGFUSE:

# set callbacks

# litellm.success_callback = ["langfuse"]

# litellm.failure_callback = ["langfuse"]

self.aclient.success_callback = ["langfuse"]

self.aclient.failure_callback = ["langfuse"]

self.client.success_callback = ["langfuse"]

self.client.failure_callback = ["langfuse"]

def _configure_langfuse_callbacks(self, client):

"""Configure Langfuse callbacks for the given client."""

client.success_callback = ["langfuse"]

client.failure_callback = ["langfuse"]

def __init__(

self,

api_key: str,

endpoint: str,

api_version: str,

model: str,

transcription_model: str,

streaming: bool = False,

):

"""Initialize OpenAI adapter with optional Langfuse monitoring.

Args:

api_key (str): OpenAI API key

endpoint (str): API endpoint

api_version (str): API version

model (str): Model identifier

transcription_model (str): Model for transcription

streaming (bool, optional): Enable streaming. Defaults to False.

"""

self.aclient = instructor.from_litellm(litellm.acompletion)

self.client = instructor.from_litellm(litellm.completion)

self.transcription_model = transcription_model

self.model = model

self.api_key = api_key

self.endpoint = endpoint

self.api_version = api_version

self.streaming = streaming

try:

base_config = get_base_config()

if base_config and base_config.monitoring_tool == MonitoringTool.LANGFUSE:

self._configure_langfuse_callbacks(self.aclient)

self._configure_langfuse_callbacks(self.client)

except Exception as e:

# Log the error but don't fail initialization

print(f"Warning: Failed to configure monitoring: {str(e)}")

💡 Codebase verification

Missing Langfuse environment documentation and host configuration

The Langfuse integration requires proper documentation and configuration:

The required environment variables (LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST) are not documented in any .md files

LANGFUSE_HOST is accessed directly with os.environ[] instead of os.getenv(), which could cause crashes if not set

🔗 Analysis chain

Verify Langfuse configuration requirements

Since Langfuse monitoring is being added, we should ensure all necessary configuration is present.

Let me check the actual Langfuse configuration content to verify if all necessary settings are present.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Check for Langfuse configuration in base config and environment variables # Check if BaseConfig contains Langfuse configuration echo "Checking BaseConfig for Langfuse configuration..." rg -l "langfuse" --type py # Check for environment variable documentation echo "Checking for environment variable documentation..." find . -type f -name "*.md" -exec grep -l "LANGFUSE" {} \;

Length of output: 456

Script:

#!/bin/bash # Check Langfuse configuration details # Check BaseConfig for Langfuse settings echo "=== BaseConfig Langfuse settings ===" rg -A 5 -B 5 "langfuse" cognee/base_config.py # Check environment variables and configuration in data models echo -e "\n=== Data models Langfuse configuration ===" rg -A 5 -B 5 "langfuse" cognee/shared/data_models.py # Check if Langfuse secret/public key settings exist echo -e "\n=== Environment variables for Langfuse ===" rg -i "LANGFUSE.*KEY|LANGFUSE.*SECRET" .

Length of output: 1434

borisarzentar · 2024-12-18T13:40:51Z

cognee/infrastructure/llm/openai/adapter.py

+        base_config = get_base_config()
+        if base_config.monitoring_tool == MonitoringTool.LANGFUSE:
+            # set callbacks
+            # litellm.success_callback = ["langfuse"]


Let's remove these commented lines.

borisarzentar · 2024-12-18T13:42:09Z

cognee/infrastructure/engine/models/DataPoint.py

-    id: UUID = Field(default_factory = uuid4)
-    updated_at: Optional[datetime] = datetime.now(timezone.utc)
+    id: UUID = Field(default_factory=uuid4)
+    created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))


What is the difference between datetime.now(timezone.utc) and this one?

created_at is when the initial record was created, updated at is any change that happens

borisarzentar · 2024-12-18T13:42:45Z

cognee/infrastructure/engine/models/DataPoint.py

+    id: UUID = Field(default_factory=uuid4)
+    created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
+    updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
+    version: str = "0.1"  # Default version


I would keep it as a number, and we can just increase it with each version. (1, 2, 3, 4...)

borisarzentar · 2024-12-18T13:43:23Z

cognee/infrastructure/engine/models/DataPoint.py

+    created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
+    updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
+    version: str = "0.1"  # Default version
+    source: Optional[str] = None  # Path to file, URL, etc.


source is a Document model related property, doesn't belong to this general DataPoint model.

borisarzentar · 2024-12-18T13:43:40Z

cognee/infrastructure/engine/models/DataPoint.py

+    updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
+    version: str = "0.1"  # Default version
+    source: Optional[str] = None  # Path to file, URL, etc.
+    type: Optional[str] = "text"  # "text", "file", "image", "video"


Same for type, doesn't belong here.

Laslzo asked me for this one, due to retriever logic. In general I agree

borisarzentar · 2024-12-18T13:44:36Z

cognee/infrastructure/engine/models/DataPoint.py

    topological_rank: Optional[int] = 0
+    extra: Optional[str] = "extra"  # For additional properties


extra is unnecessary. All other models that extend DataPoint can add their own properties.

borisarzentar · 2024-12-18T13:44:51Z

cognee/infrastructure/engine/models/DataPoint.py


+    @classmethod


Double @classmethod

borisarzentar · 2024-12-18T13:46:01Z

cognee/infrastructure/engine/models/DataPoint.py

+    def update_version(self, new_version: str):
+        """Update the version and updated_at timestamp."""
+        self.version = new_version


If we have a number as a version, we can do +1 here then.

borisarzentar · 2024-12-18T13:46:31Z

cognee/infrastructure/engine/models/DataPoint.py

+
+ # JSON Serialization
+    def to_json(self) -> str:


Why do we need this serialization?

So you can parallelize tasks, since you had issues with that. Pickle or json

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

cognee/infrastructure/engine/models/DataPoint.py (1)
71-80: ⚠️ Potential issue

Security concern: Remove pickle serialization

Using pickle for serialization poses a security risk as it can execute arbitrary code during deserialization. Since JSON serialization is already implemented and serves the same purpose, the pickle methods should be removed.
-    # Pickle Serialization
-    def to_pickle(self) -> bytes:
-        """Serialize the instance to pickle-compatible bytes."""
-        return pickle.dumps(self.dict())
-
-    @classmethod
-    def from_pickle(self, pickled_data: bytes):
-        """Deserialize the instance from pickled bytes."""
-        data = pickle.loads(pickled_data)
-        return self(**data)
If binary serialization is needed for parallelization (as mentioned in past comments), consider using MessagePack or Protocol Buffers instead.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7657b8e and 2bfc657.

📒 Files selected for processing (3)

cognee/base_config.py (1 hunks)
cognee/infrastructure/engine/models/DataPoint.py (2 hunks)
cognee/infrastructure/llm/openai/adapter.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

cognee/infrastructure/llm/openai/adapter.py
cognee/base_config.py

🔇 Additional comments (4)

cognee/infrastructure/engine/models/DataPoint.py (4)

20-21: Add validation for timestamp fields

The timestamp fields should validate against negative values.
-    created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
-    updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
+    created_at: int = Field(
+        default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),
+        ge=0
+    )
+    updated_at: int = Field(
+        default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),
+        ge=0
+    )
82-89: LGTM: Dict serialization methods are well implemented

The implementation correctly uses Pydantic's model_dump and model_validate methods, following best practices for dictionary serialization.

16-18: Add tests for versioning functionality

The new versioning feature needs test coverage to ensure correct behavior, especially for:

Version incrementation

Timestamp updates

Serialization/deserialization of versioned objects

Would you like me to help create test cases for these scenarios?

56-59: ⚠️ Potential issue

Update version increment logic for integer version

Since version is being changed to an integer, the increment logic needs to be updated.
     def update_version(self):
         """Update the version and updated_at timestamp."""
-        self.version += 1
+        self.version = self.version + 1  # Explicit increment for clarity
         self.updated_at = int(datetime.now(timezone.utc).timestamp() * 1000)
Likely invalid or redundant comment.

cognee/infrastructure/engine/models/DataPoint.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

gitguardian · 2025-01-06T12:51:03Z

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
9573981	Triggered	Generic Password	`667f973`	notebooks/hr_demo.ipynb	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secret safely. Learn here the best practices.
Revoke and rotate this secret.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

cognee/infrastructure/engine/models/DataPoint.py (3)
9-10: ⚠️ Potential issue

Security Risk: Avoid using pickle for serialization

Using pickle for serialization poses a security risk as it can execute arbitrary code during deserialization. Consider using JSON serialization which is already implemented in the class.

Since JSON serialization methods are already implemented (to_json, from_json), we should remove the pickle-based methods entirely.

20-21: 🛠️ Refactor suggestion

Add validation for timestamp fields

The timestamp fields should validate against negative values.
-    created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
-    updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
+    created_at: int = Field(
+        default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),
+        ge=0,
+        description="Creation timestamp in milliseconds since epoch"
+    )
+    updated_at: int = Field(
+        default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),
+        ge=0,
+        description="Last update timestamp in milliseconds since epoch"
+    )
71-80: ⚠️ Potential issue

Remove pickle serialization methods

These methods should be removed as they pose security risks. Use the JSON or dict serialization methods instead.

🧹 Nitpick comments (2)

cognee/infrastructure/engine/models/DataPoint.py (2)

23-23: Document type field constraints

Since the type field was specifically requested for retriever logic, please add documentation about its purpose and valid values.

-    type: Optional[str] = "text"  # "text", "file", "image", "video"
+    type: Optional[str] = Field(
+        default="text",
+        description="Content type for retriever logic. Valid values: text, file, image, video"
+    )

Line range hint 34-42: Add docstring to get_embeddable_data method

For consistency with other methods, please add a docstring explaining the purpose and return value of this method.

     @classmethod
     def get_embeddable_data(self, data_point):
+        """Retrieve the primary embeddable data from the data point.
+        
+        Args:
+            data_point: The data point instance to extract data from
+        
+        Returns:
+            The stripped string value of the first index field if it's a string,
+            otherwise returns the raw value
+        """
         if data_point._metadata and len(data_point._metadata["index_fields"]) > 0 \

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2bfc657 and d5243b4.

📒 Files selected for processing (1)

cognee/infrastructure/engine/models/DataPoint.py (2 hunks)

🔇 Additional comments (4)

cognee/infrastructure/engine/models/DataPoint.py (4)

30-33: LGTM: Proper configuration for private attributes

The Pydantic configuration correctly handles the private _metadata field.

56-59: LGTM: Clean version update implementation

The version update implementation correctly follows the numeric versioning approach as discussed in previous comments.

61-69: LGTM: Secure JSON serialization implementation

The JSON serialization methods correctly utilize Pydantic's built-in functionality.

82-89: LGTM: Proper dictionary serialization implementation

The dictionary serialization methods correctly use Pydantic's model_dump and model_validate functions.

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (1)

cognee/infrastructure/llm/openai/adapter.py (1)
Fix indentation issues throughout the file

The following formatting issues need to be addressed:

Remove indentation for class-level attributes (name, model, api_key, api_version)

Fix method parameter indentation to use 4 spaces instead of 8 (e.g., in __init__, acreate_structured_output)

Normalize indentation in multi-line string content (e.g., in the content field of messages)

🔗 Analysis chain

Line range hint 1-200: Fix formatting issues

The pipeline indicates formatting issues that need to be addressed with Ruff formatter.

Run the following command to fix formatting:
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check specific formatting issues in this file
rg --line-number '[ ]{2,}' cognee/infrastructure/llm/openai/adapter.py
Length of output: 5223
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

♻️ Duplicate comments (1)

cognee/infrastructure/llm/openai/adapter.py (1)

48-53: 🛠️ Refactor suggestion

Refactor callback configuration

Previous review comments already suggested refactoring this callback configuration pattern.

As previously suggested, consider extracting this into a separate method:

+    def _configure_langfuse_callbacks(self, client):
+        """Configure Langfuse callbacks for the given client."""
+        client.success_callback = ["langfuse"]
+        client.failure_callback = ["langfuse"]

     def __init__(self, ...):
         # ...
         try:
             if base_config.monitoring_tool == MonitoringTool.LANGFUSE:
-                self.aclient.success_callback = ["langfuse"]
-                self.aclient.failure_callback = ["langfuse"]
-                self.client.success_callback = ["langfuse"]
-                self.client.failure_callback = ["langfuse"]
+                self._configure_langfuse_callbacks(self.aclient)
+                self._configure_langfuse_callbacks(self.client)
         except Exception as e:
             print(f"Warning: Failed to configure monitoring: {str(e)}")

🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5243b4 and 667f973.

📒 Files selected for processing (1)

cognee/infrastructure/llm/openai/adapter.py (2 hunks)

🧰 Additional context used

🪛 GitHub Actions: ruff format

cognee/infrastructure/llm/openai/adapter.py

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test

coderabbitai · 2025-01-06T12:54:16Z

cognee/infrastructure/llm/openai/adapter.py

+

    @observe(as_type='generation')
    async def acreate_structured_output(self, text_input: str, system_prompt: str,
                                        response_model: Type[BaseModel]) -> BaseModel:

+


🛠️ Refactor suggestion

Standardize decorator usage and add error handling

The @observe decorator usage is inconsistent between methods and lacks error handling:

acreate_structured_output uses as_type='generation'

create_structured_output uses default parameters

Missing fallback if decorator import fails

Consider standardizing the decorator usage:

- @observe(as_type='generation') + @observe(as_type='generation', name='async_structured_output') async def acreate_structured_output(self, text_input: str, system_prompt: str, response_model: Type[BaseModel]) -> BaseModel: """Generate a response from a user query.""" - @observe + @observe(as_type='generation', name='structured_output') def create_structured_output(self, text_input: str, system_prompt: str, response_model: Type[BaseModel]) -> BaseModel:

Also, add docstring documentation for the monitoring functionality:

def create_structured_output(self, text_input: str, system_prompt: str, response_model: Type[BaseModel]) -> BaseModel: """Generate a response from a user query. Args: text_input (str): The input text to process system_prompt (str): The system prompt to use response_model (Type[BaseModel]): The Pydantic model for response validation Returns: BaseModel: The structured response Note: This method is monitored by Langfuse when monitoring_tool is configured. """

🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

coderabbitai · 2025-01-06T12:54:16Z

cognee/infrastructure/llm/openai/adapter.py

+
 monitoring = get_base_config().monitoring_tool
 if monitoring == MonitoringTool.LANGFUSE:
    from langfuse.decorators import observe

+


🛠️ Refactor suggestion

Improve initialization robustness and error handling

The global configuration setup has several potential issues:

Using global variables for configuration can lead to initialization order problems

Missing error handling for get_base_config()

Potential circular dependency with base_config import

Consider moving this logic into a class method:

-monitoring = get_base_config().monitoring_tool -if monitoring == MonitoringTool.LANGFUSE: - from langfuse.decorators import observe +def get_monitoring_config(): + try: + config = get_base_config() + return config.monitoring_tool if config else None + except Exception as e: + print(f"Warning: Failed to load monitoring configuration: {str(e)}") + return None + +def import_langfuse(): + try: + from langfuse.decorators import observe + return observe + except ImportError: + print("Warning: Langfuse package not installed") + return lambda x: x # No-op decorator

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

monitoring = get_base_config().monitoring_tool

if monitoring == MonitoringTool.LANGFUSE:

from langfuse.decorators import observe

def get_monitoring_config():

try:

config = get_base_config()

return config.monitoring_tool if config else None

except Exception as e:

print(f"Warning: Failed to load monitoring configuration: {str(e)}")

return None

def import_langfuse():

try:

from langfuse.decorators import observe

return observe

except ImportError:

print("Warning: Langfuse package not installed")

return lambda x: x # No-op decorator

🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

Add versioning to the data point model

6fb3b4a

Vasilije1990 added the run-checks label Dec 17, 2024

coderabbitai bot reviewed Dec 17, 2024

View reviewed changes

Vasilije1990 added 7 commits December 17, 2024 20:06

Add versioning to the data point model

87bc5d8

Add versioning to the data point model

15d8eff

Add versioning to the data point model

52b91b4

Add versioning to the data point model

f455ba9

Add versioning to the data point model

f71485e

Add versioning to the data point model

fe31bcd

Add versioning to the data point model

7657b8e

coderabbitai bot reviewed Dec 17, 2024

View reviewed changes

First draft of relationship embeddings

b976f5b

borisarzentar requested changes Dec 18, 2024

View reviewed changes

Implement PR review

2bfc657

coderabbitai bot reviewed Jan 5, 2025

View reviewed changes

cognee/infrastructure/engine/models/DataPoint.py Outdated Show resolved Hide resolved

Vasilije1990 and others added 2 commits January 6, 2025 13:49

Update cognee/infrastructure/engine/models/DataPoint.py

d5243b4

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Merge branch 'dev' into COG-748

667f973

coderabbitai bot reviewed Jan 6, 2025

View reviewed changes

		created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
		updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))

-        base_config = get_base_config()
-        if base_config.monitoring_tool == MonitoringTool.LANGFUSE:
-            # set callbacks
-            # litellm.success_callback = ["langfuse"]
-            # litellm.failure_callback = ["langfuse"]
-            self.aclient.success_callback = ["langfuse"]
-            self.aclient.failure_callback = ["langfuse"]
-            self.client.success_callback = ["langfuse"]
-            self.client.failure_callback = ["langfuse"]
+    def _configure_langfuse_callbacks(self, client):
+        """Configure Langfuse callbacks for the given client."""
+        client.success_callback = ["langfuse"]
+        client.failure_callback = ["langfuse"]
+    def __init__(
+        self,
+        api_key: str,
+        endpoint: str,
+        api_version: str,
+        model: str,
+        transcription_model: str,
+        streaming: bool = False,
+    ):
+        """Initialize OpenAI adapter with optional Langfuse monitoring.
+        Args:
+            api_key (str): OpenAI API key
+            endpoint (str): API endpoint
+            api_version (str): API version
+            model (str): Model identifier
+            transcription_model (str): Model for transcription
+            streaming (bool, optional): Enable streaming. Defaults to False.
+        """
+        self.aclient = instructor.from_litellm(litellm.acompletion)
+        self.client = instructor.from_litellm(litellm.completion)
+        self.transcription_model = transcription_model
+        self.model = model
+        self.api_key = api_key
+        self.endpoint = endpoint
+        self.api_version = api_version
+        self.streaming = streaming
+        try:
+            base_config = get_base_config()
+            if base_config and base_config.monitoring_tool == MonitoringTool.LANGFUSE:
+                self._configure_langfuse_callbacks(self.aclient)
+                self._configure_langfuse_callbacks(self.client)
+        except Exception as e:
+            # Log the error but don't fail initialization
+            print(f"Warning: Failed to configure monitoring: {str(e)}")

		topological_rank: Optional[int] = 0
		extra: Optional[str] = "extra" # For additional properties

-monitoring = get_base_config().monitoring_tool
-if monitoring == MonitoringTool.LANGFUSE:
-    from langfuse.decorators import observe
+def get_monitoring_config():
+    try:
+        config = get_base_config()
+        return config.monitoring_tool if config else None
+    except Exception as e:
+        print(f"Warning: Failed to load monitoring configuration: {str(e)}")
+        return None
+def import_langfuse():
+    try:
+        from langfuse.decorators import observe
+        return observe
+    except ImportError:
+        print("Warning: Langfuse package not installed")
+        return lambda x: x  # No-op decorator

Add versioning to the data point model #378

Are you sure you want to change the base?

Add versioning to the data point model #378

Conversation

Vasilije1990 commented Dec 17, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Dec 17, 2024 • edited Loading

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 17, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 17, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 17, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 17, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 17, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 17, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

gitguardian bot commented Jan 6, 2025

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 6, 2025

Choose a reason for hiding this comment

coderabbitai bot Jan 6, 2025

Choose a reason for hiding this comment

Vasilije1990 commented Dec 17, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 17, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)