Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add versioning to the data point model #378

Open
wants to merge 12 commits into
base: dev
Choose a base branch
from
Open

Add versioning to the data point model #378

wants to merge 12 commits into from

Conversation

Vasilije1990
Copy link
Contributor

@Vasilije1990 Vasilije1990 commented Dec 17, 2024

Summary by CodeRabbit

  • New Features

    • Introduced new fields in the DataPoint model: created_at, updated_at, version, and type.
    • Added methods for serialization: to_json, from_json, to_pickle, from_pickle, to_dict, and from_dict.
    • Added a method to update the version along with the timestamp.
    • New optional attributes in BaseConfig for enhanced configuration capabilities.
  • Improvements

    • Enhanced timestamp handling for created_at and updated_at.
    • Improved error handling in the profiling workflow.
    • Conditional configuration for success and failure callbacks in OpenAIAdapter based on monitoring tool settings.

Copy link
Contributor

coderabbitai bot commented Dec 17, 2024

Walkthrough

The changes modify the DataPoint model in the Cognee infrastructure, introducing new fields and methods to enhance metadata handling and versioning. The model now includes additional attributes like created_at, version, type, and others, with timestamps stored as integers in milliseconds since the epoch. The modifications also update method signatures to include docstrings and add a new update_version method for managing version information. Additionally, the workflow configuration for profiling has been improved, and new optional attributes have been added to the BaseConfig class.

Changes

File Changes
cognee/infrastructure/engine/models/DataPoint.py - Added fields: created_at, updated_at, version, type
- Updated method signatures with docstrings
- Added methods: update_version, to_json, from_json, to_pickle, from_pickle, to_dict, from_dict
- Simplified return logic in get_embeddable_data
.github/workflows/profiling.yaml - Added packages: parso, jedi
- Enhanced error handling in "Compare profiling results" step
cognee/base_config.py - Added attributes: langfuse_public_key, langfuse_secret_key, langfuse_host
cognee/infrastructure/llm/openai/adapter.py - Added imports for MonitoringTool and get_base_config
- Updated constructor to configure callbacks based on monitoring_tool

Poem

🐰 A DataPoint's Tale of Growth

Metadata dancing, version so bright,
Timestamps ticking with rabbit delight,
New fields sprouting like carrots in spring,
Our model now has an extra swing!
Hop, hop, evolve with coding might! 🥕


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
cognee/infrastructure/engine/models/DataPoint.py (3)

11-14: Consider making MetaData more type-safe

The MetaData type definition could be more specific about the allowed values.

Consider this improvement:

class MetaData(TypedDict):
-    index_fields: list[str]
+    index_fields: list[str]
+    type: Literal["DataPoint"]

36-47: Consider adding docstring type hints and return value

The method documentation could be more explicit about types and return values.

Consider enhancing the docstring:

     @classmethod
     def get_embeddable_data(cls, data_point):
-        """Retrieve embeddable data based on metadata's index_fields."""
+        """Retrieve embeddable data based on metadata's index_fields.
+        
+        Args:
+            data_point: DataPoint instance to extract data from
+            
+        Returns:
+            str | Any: The stripped string value if the attribute is a string,
+                      otherwise the original attribute value
+        """

16-29: Consider adding database indexes and documenting versioning strategy

Since this is a database model with timestamp fields and versioning:

  1. Consider adding database indexes for created_at and updated_at fields to optimize queries by time range
  2. Document the versioning strategy, including:
    • When and how versions should be incremented
    • The meaning of version numbers
    • Migration strategy for existing data
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45cb2c3 and 6fb3b4a.

📒 Files selected for processing (1)
  • cognee/infrastructure/engine/models/DataPoint.py (1 hunks)

Comment on lines +20 to +21
created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Simplify timestamp creation and add validation

The timestamp creation could be simplified and should validate against negative values.

Consider this improvement:

-    created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
-    updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
+    created_at: int = Field(
+        default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),
+        ge=0
+    )
+    updated_at: int = Field(
+        default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),
+        ge=0
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
created_at: int = Field(
default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),
ge=0
)
updated_at: int = Field(
default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),
ge=0
)

Comment on lines 61 to 64
def update_version(self, new_version: str):
"""Update the version and updated_at timestamp."""
self.version = new_version
self.updated_at = int(datetime.now(timezone.utc).timestamp() * 1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve version update method with validation

The update_version method should validate the version format and reuse timestamp logic.

Consider this improvement:

     def update_version(self, new_version: str):
         """Update the version and updated_at timestamp."""
+        if not re.match(r"^\d+\.\d+$", new_version):
+            raise ValueError("Version must be in format 'X.Y'")
         self.version = new_version
-        self.updated_at = int(datetime.now(timezone.utc).timestamp() * 1000)
+        self.updated_at = Field(
+            default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000)
+        ).default_factory()

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 22 to 24
version: str = "0.1" # Default version
source: Optional[str] = None # Path to file, URL, etc.
type: Optional[str] = "text" # "text", "file", "image", "video"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add version format validation and type enumeration

The version string and type field should have proper validation.

Consider these improvements:

+from enum import Enum
+import re
+
+class DataPointType(str, Enum):
+    TEXT = "text"
+    FILE = "file"
+    IMAGE = "image"
+    VIDEO = "video"

class DataPoint(BaseModel):
    # ... other fields ...
-    version: str = "0.1"  # Default version
-    type: Optional[str] = "text"  # "text", "file", "image", "video"
+    version: str = Field(
+        default="0.1",
+        regex=r"^\d+\.\d+$"
+    )
+    type: Optional[DataPointType] = Field(default=DataPointType.TEXT)

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (3)
cognee/infrastructure/engine/models/DataPoint.py (3)

20-21: 🛠️ Refactor suggestion

Add validation for timestamp fields

The timestamp fields should include validation to prevent negative values.

-    created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
-    updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
+    created_at: int = Field(
+        default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),
+        ge=0
+    )
+    updated_at: int = Field(
+        default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),
+        ge=0
+    )

22-24: 🛠️ Refactor suggestion

Add version format validation and type enumeration

The version string and type field should have proper validation.

+from enum import Enum
+import re
+
+class DataPointType(str, Enum):
+    TEXT = "text"
+    FILE = "file"
+    IMAGE = "image"
+    VIDEO = "video"

-    version: str = "0.1"  # Default version
-    type: Optional[str] = "text"  # "text", "file", "image", "video"
+    version: str = Field(
+        default="0.1",
+        pattern=r"^\d+\.\d+$"
+    )
+    type: Optional[DataPointType] = Field(default=DataPointType.TEXT)

59-62: 🛠️ Refactor suggestion

Improve version update method with validation

The update_version method should validate the version format.

     def update_version(self, new_version: str):
         """Update the version and updated_at timestamp."""
+        if not re.match(r"^\d+\.\d+$", new_version):
+            raise ValueError("Version must be in format 'X.Y'")
         self.version = new_version
         self.updated_at = int(datetime.now(timezone.utc).timestamp() * 1000)
🧹 Nitpick comments (1)
cognee/infrastructure/llm/openai/adapter.py (1)

42-43: Remove commented code

The commented callback configuration lines should be removed as they are not being used and may cause confusion.

-            # litellm.success_callback = ["langfuse"]
-            # litellm.failure_callback = ["langfuse"]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fb3b4a and 7657b8e.

📒 Files selected for processing (4)
  • .github/workflows/profiling.yaml (1 hunks)
  • cognee/base_config.py (1 hunks)
  • cognee/infrastructure/engine/models/DataPoint.py (2 hunks)
  • cognee/infrastructure/llm/openai/adapter.py (2 hunks)
🔇 Additional comments (2)
.github/workflows/profiling.yaml (1)

60-61: Verify necessity of new dependencies

The newly added packages parso and jedi aren't directly used in the workflow. Please verify if these are necessary dependencies or if they're being installed for a specific purpose.

✅ Verification successful

Dependencies are necessary for code analysis functionality

The parso and jedi packages are actively used in the codebase for Python code analysis:

  • Used in get_local_dependencies.py for parsing code and extracting code entities
  • Used in extract_code_parts.py for parsing source code
  • Used in top_down_repo_parse.py for project-wide code analysis

These packages are essential dependencies for the repository's code analysis functionality and their installation in the workflow is justified.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if parso and jedi are required dependencies
poetry show --tree | grep -A 5 -E "parso|jedi"

Length of output: 145


Script:

#!/bin/bash
# Search for any references to parso or jedi in Python files
rg -l "import (parso|jedi)" -t python

# Search for any references in requirements files
fd "requirements.*\.txt$" --exec grep -l "parso\|jedi" {} \;

# Search for any references in pyproject.toml or setup files
fd "(pyproject\.toml|setup\.py|setup\.cfg)$" --exec grep -l "parso\|jedi" {} \;

# Look for any documentation or comments mentioning these packages
rg -l "parso|jedi" -g "!*.{pyc,pyo,pyd}" -g "!.git"

Length of output: 442


Script:

#!/bin/bash
# Check the content of files that reference these packages
rg "parso|jedi" poetry.lock -A 5 -B 5

# Check how these packages are used in the Python files
rg "parso|jedi" cognee/tasks/repo_processor/get_local_dependencies.py
rg "parso|jedi" cognee/tasks/repo_processor/extract_code_parts.py
rg "parso|jedi" cognee/tasks/repo_processor/top_down_repo_parse.py

Length of output: 5192

cognee/infrastructure/llm/openai/adapter.py (1)

9-9: LGTM: Import statements are correctly added

The new imports for MonitoringTool and get_base_config are appropriately placed and necessary for the monitoring configuration feature.

Also applies to: 13-13

Comment on lines 13 to 15
langfuse_public_key: Optional[str] = os.getenv("LANGFUSE_PUBLIC_KEY")
langfuse_secret_key: Optional[str] = os.getenv("LANGFUSE_SECRET_KEY")
langfuse_host: Optional[str] = os.environ["LANGFUSE_HOST"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix potential KeyError in environment variable access

The direct access to LANGFUSE_HOST using os.environ[] could raise a KeyError if the environment variable isn't set. Consider using os.getenv() for consistent behavior with other environment variables.

-    langfuse_host: Optional[str] = os.environ["LANGFUSE_HOST"]
+    langfuse_host: Optional[str] = os.getenv("LANGFUSE_HOST")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
langfuse_public_key: Optional[str] = os.getenv("LANGFUSE_PUBLIC_KEY")
langfuse_secret_key: Optional[str] = os.getenv("LANGFUSE_SECRET_KEY")
langfuse_host: Optional[str] = os.environ["LANGFUSE_HOST"]
langfuse_public_key: Optional[str] = os.getenv("LANGFUSE_PUBLIC_KEY")
langfuse_secret_key: Optional[str] = os.getenv("LANGFUSE_SECRET_KEY")
langfuse_host: Optional[str] = os.getenv("LANGFUSE_HOST")

Comment on lines 36 to 37
@classmethod
@classmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove duplicate @classmethod decorator

The @classmethod decorator is duplicated.

-    @classmethod
-    @classmethod
+    @classmethod
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@classmethod
@classmethod
@classmethod

Comment on lines +74 to +83
# Pickle Serialization
def to_pickle(self) -> bytes:
"""Serialize the instance to pickle-compatible bytes."""
return pickle.dumps(self.dict())

@classmethod
def from_pickle(self, pickled_data: bytes):
"""Deserialize the instance from pickled bytes."""
data = pickle.loads(pickled_data)
return self(**data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Security concern: Replace pickle with a safer serialization method

Using pickle for serialization poses a security risk as it can execute arbitrary code during deserialization. Consider using a safer alternative like JSON or MessagePack.

-    def to_pickle(self) -> bytes:
-        """Serialize the instance to pickle-compatible bytes."""
-        return pickle.dumps(self.dict())
-
-    @classmethod
-    def from_pickle(self, pickled_data: bytes):
-        """Deserialize the instance from pickled bytes."""
-        data = pickle.loads(pickled_data)
-        return self(**data)
+    def to_bytes(self) -> bytes:
+        """Serialize the instance to bytes using JSON."""
+        return self.json().encode('utf-8')
+
+    @classmethod
+    def from_bytes(cls, data: bytes):
+        """Deserialize the instance from JSON bytes."""
+        return cls.parse_raw(data)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Pickle Serialization
def to_pickle(self) -> bytes:
"""Serialize the instance to pickle-compatible bytes."""
return pickle.dumps(self.dict())
@classmethod
def from_pickle(self, pickled_data: bytes):
"""Deserialize the instance from pickled bytes."""
data = pickle.loads(pickled_data)
return self(**data)
def to_bytes(self) -> bytes:
"""Serialize the instance to bytes using JSON."""
return self.json().encode('utf-8')
@classmethod
def from_bytes(cls, data: bytes):
"""Deserialize the instance from JSON bytes."""
return cls.parse_raw(data)

Comment on lines 39 to 47
base_config = get_base_config()
if base_config.monitoring_tool == MonitoringTool.LANGFUSE:
# set callbacks
# litellm.success_callback = ["langfuse"]
# litellm.failure_callback = ["langfuse"]
self.aclient.success_callback = ["langfuse"]
self.aclient.failure_callback = ["langfuse"]
self.client.success_callback = ["langfuse"]
self.client.failure_callback = ["langfuse"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor callback configuration and add error handling

The current implementation has several areas for improvement:

  1. Duplicate callback configuration for both clients
  2. Missing error handling for base_config
  3. Missing documentation for the monitoring feature

Consider refactoring like this:

+    def _configure_langfuse_callbacks(self, client):
+        """Configure Langfuse callbacks for the given client."""
+        client.success_callback = ["langfuse"]
+        client.failure_callback = ["langfuse"]

     def __init__(
         self,
         api_key: str,
         endpoint: str,
         api_version: str,
         model: str,
         transcription_model: str,
         streaming: bool = False,
     ):
+        """Initialize OpenAI adapter with optional Langfuse monitoring.
+        
+        Args:
+            api_key (str): OpenAI API key
+            endpoint (str): API endpoint
+            api_version (str): API version
+            model (str): Model identifier
+            transcription_model (str): Model for transcription
+            streaming (bool, optional): Enable streaming. Defaults to False.
+        """
         self.aclient = instructor.from_litellm(litellm.acompletion)
         self.client = instructor.from_litellm(litellm.completion)
         self.transcription_model = transcription_model
         self.model = model
         self.api_key = api_key
         self.endpoint = endpoint
         self.api_version = api_version
         self.streaming = streaming

-        base_config = get_base_config()
-        if base_config.monitoring_tool == MonitoringTool.LANGFUSE:
-            self.aclient.success_callback = ["langfuse"]
-            self.aclient.failure_callback = ["langfuse"]
-            self.client.success_callback = ["langfuse"]
-            self.client.failure_callback = ["langfuse"]
+        try:
+            base_config = get_base_config()
+            if base_config and base_config.monitoring_tool == MonitoringTool.LANGFUSE:
+                self._configure_langfuse_callbacks(self.aclient)
+                self._configure_langfuse_callbacks(self.client)
+        except Exception as e:
+            # Log the error but don't fail initialization
+            print(f"Warning: Failed to configure monitoring: {str(e)}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
base_config = get_base_config()
if base_config.monitoring_tool == MonitoringTool.LANGFUSE:
# set callbacks
# litellm.success_callback = ["langfuse"]
# litellm.failure_callback = ["langfuse"]
self.aclient.success_callback = ["langfuse"]
self.aclient.failure_callback = ["langfuse"]
self.client.success_callback = ["langfuse"]
self.client.failure_callback = ["langfuse"]
def _configure_langfuse_callbacks(self, client):
"""Configure Langfuse callbacks for the given client."""
client.success_callback = ["langfuse"]
client.failure_callback = ["langfuse"]
def __init__(
self,
api_key: str,
endpoint: str,
api_version: str,
model: str,
transcription_model: str,
streaming: bool = False,
):
"""Initialize OpenAI adapter with optional Langfuse monitoring.
Args:
api_key (str): OpenAI API key
endpoint (str): API endpoint
api_version (str): API version
model (str): Model identifier
transcription_model (str): Model for transcription
streaming (bool, optional): Enable streaming. Defaults to False.
"""
self.aclient = instructor.from_litellm(litellm.acompletion)
self.client = instructor.from_litellm(litellm.completion)
self.transcription_model = transcription_model
self.model = model
self.api_key = api_key
self.endpoint = endpoint
self.api_version = api_version
self.streaming = streaming
try:
base_config = get_base_config()
if base_config and base_config.monitoring_tool == MonitoringTool.LANGFUSE:
self._configure_langfuse_callbacks(self.aclient)
self._configure_langfuse_callbacks(self.client)
except Exception as e:
# Log the error but don't fail initialization
print(f"Warning: Failed to configure monitoring: {str(e)}")

💡 Codebase verification

Missing Langfuse environment documentation and host configuration

The Langfuse integration requires proper documentation and configuration:

  • The required environment variables (LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST) are not documented in any .md files
  • LANGFUSE_HOST is accessed directly with os.environ[] instead of os.getenv(), which could cause crashes if not set
🔗 Analysis chain

Verify Langfuse configuration requirements

Since Langfuse monitoring is being added, we should ensure all necessary configuration is present.


Let me check the actual Langfuse configuration content to verify if all necessary settings are present.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for Langfuse configuration in base config and environment variables

# Check if BaseConfig contains Langfuse configuration
echo "Checking BaseConfig for Langfuse configuration..."
rg -l "langfuse" --type py

# Check for environment variable documentation
echo "Checking for environment variable documentation..."
find . -type f -name "*.md" -exec grep -l "LANGFUSE" {} \;

Length of output: 456


Script:

#!/bin/bash
# Check Langfuse configuration details

# Check BaseConfig for Langfuse settings
echo "=== BaseConfig Langfuse settings ==="
rg -A 5 -B 5 "langfuse" cognee/base_config.py

# Check environment variables and configuration in data models
echo -e "\n=== Data models Langfuse configuration ==="
rg -A 5 -B 5 "langfuse" cognee/shared/data_models.py

# Check if Langfuse secret/public key settings exist
echo -e "\n=== Environment variables for Langfuse ==="
rg -i "LANGFUSE.*KEY|LANGFUSE.*SECRET" .

Length of output: 1434

base_config = get_base_config()
if base_config.monitoring_tool == MonitoringTool.LANGFUSE:
# set callbacks
# litellm.success_callback = ["langfuse"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove these commented lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

id: UUID = Field(default_factory = uuid4)
updated_at: Optional[datetime] = datetime.now(timezone.utc)
id: UUID = Field(default_factory=uuid4)
created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between datetime.now(timezone.utc) and this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created_at is when the initial record was created, updated at is any change that happens

id: UUID = Field(default_factory=uuid4)
created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
version: str = "0.1" # Default version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep it as a number, and we can just increase it with each version. (1, 2, 3, 4...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk

created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
version: str = "0.1" # Default version
source: Optional[str] = None # Path to file, URL, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source is a Document model related property, doesn't belong to this general DataPoint model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair

updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
version: str = "0.1" # Default version
source: Optional[str] = None # Path to file, URL, etc.
type: Optional[str] = "text" # "text", "file", "image", "video"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for type, doesn't belong here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Laslzo asked me for this one, due to retriever logic. In general I agree

topological_rank: Optional[int] = 0
extra: Optional[str] = "extra" # For additional properties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra is unnecessary. All other models that extend DataPoint can add their own properties.


@classmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double @classmethod

def update_version(self, new_version: str):
"""Update the version and updated_at timestamp."""
self.version = new_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a number as a version, we can do +1 here then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


# JSON Serialization
def to_json(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this serialization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you can parallelize tasks, since you had issues with that. Pickle or json

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
cognee/infrastructure/engine/models/DataPoint.py (1)

71-80: ⚠️ Potential issue

Security concern: Remove pickle serialization

Using pickle for serialization poses a security risk as it can execute arbitrary code during deserialization. Since JSON serialization is already implemented and serves the same purpose, the pickle methods should be removed.

-    # Pickle Serialization
-    def to_pickle(self) -> bytes:
-        """Serialize the instance to pickle-compatible bytes."""
-        return pickle.dumps(self.dict())
-
-    @classmethod
-    def from_pickle(self, pickled_data: bytes):
-        """Deserialize the instance from pickled bytes."""
-        data = pickle.loads(pickled_data)
-        return self(**data)

If binary serialization is needed for parallelization (as mentioned in past comments), consider using MessagePack or Protocol Buffers instead.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7657b8e and 2bfc657.

📒 Files selected for processing (3)
  • cognee/base_config.py (1 hunks)
  • cognee/infrastructure/engine/models/DataPoint.py (2 hunks)
  • cognee/infrastructure/llm/openai/adapter.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cognee/infrastructure/llm/openai/adapter.py
  • cognee/base_config.py
🔇 Additional comments (4)
cognee/infrastructure/engine/models/DataPoint.py (4)

20-21: Add validation for timestamp fields

The timestamp fields should validate against negative values.

-    created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
-    updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
+    created_at: int = Field(
+        default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),
+        ge=0
+    )
+    updated_at: int = Field(
+        default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),
+        ge=0
+    )

82-89: LGTM: Dict serialization methods are well implemented

The implementation correctly uses Pydantic's model_dump and model_validate methods, following best practices for dictionary serialization.


16-18: Add tests for versioning functionality

The new versioning feature needs test coverage to ensure correct behavior, especially for:

  • Version incrementation
  • Timestamp updates
  • Serialization/deserialization of versioned objects

Would you like me to help create test cases for these scenarios?


56-59: ⚠️ Potential issue

Update version increment logic for integer version

Since version is being changed to an integer, the increment logic needs to be updated.

     def update_version(self):
         """Update the version and updated_at timestamp."""
-        self.version += 1
+        self.version = self.version + 1  # Explicit increment for clarity
         self.updated_at = int(datetime.now(timezone.utc).timestamp() * 1000)

Likely invalid or redundant comment.

cognee/infrastructure/engine/models/DataPoint.py Outdated Show resolved Hide resolved
Vasilije1990 and others added 2 commits January 6, 2025 13:49
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link

gitguardian bot commented Jan 6, 2025

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
9573981 Triggered Generic Password 667f973 notebooks/hr_demo.ipynb View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
cognee/infrastructure/engine/models/DataPoint.py (3)

9-10: ⚠️ Potential issue

Security Risk: Avoid using pickle for serialization

Using pickle for serialization poses a security risk as it can execute arbitrary code during deserialization. Consider using JSON serialization which is already implemented in the class.

Since JSON serialization methods are already implemented (to_json, from_json), we should remove the pickle-based methods entirely.


20-21: 🛠️ Refactor suggestion

Add validation for timestamp fields

The timestamp fields should validate against negative values.

-    created_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
-    updated_at: int = Field(default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000))
+    created_at: int = Field(
+        default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),
+        ge=0,
+        description="Creation timestamp in milliseconds since epoch"
+    )
+    updated_at: int = Field(
+        default_factory=lambda: int(datetime.now(timezone.utc).timestamp() * 1000),
+        ge=0,
+        description="Last update timestamp in milliseconds since epoch"
+    )

71-80: ⚠️ Potential issue

Remove pickle serialization methods

These methods should be removed as they pose security risks. Use the JSON or dict serialization methods instead.

🧹 Nitpick comments (2)
cognee/infrastructure/engine/models/DataPoint.py (2)

23-23: Document type field constraints

Since the type field was specifically requested for retriever logic, please add documentation about its purpose and valid values.

-    type: Optional[str] = "text"  # "text", "file", "image", "video"
+    type: Optional[str] = Field(
+        default="text",
+        description="Content type for retriever logic. Valid values: text, file, image, video"
+    )

Line range hint 34-42: Add docstring to get_embeddable_data method

For consistency with other methods, please add a docstring explaining the purpose and return value of this method.

     @classmethod
     def get_embeddable_data(self, data_point):
+        """Retrieve the primary embeddable data from the data point.
+        
+        Args:
+            data_point: The data point instance to extract data from
+        
+        Returns:
+            The stripped string value of the first index field if it's a string,
+            otherwise returns the raw value
+        """
         if data_point._metadata and len(data_point._metadata["index_fields"]) > 0 \
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2bfc657 and d5243b4.

📒 Files selected for processing (1)
  • cognee/infrastructure/engine/models/DataPoint.py (2 hunks)
🔇 Additional comments (4)
cognee/infrastructure/engine/models/DataPoint.py (4)

30-33: LGTM: Proper configuration for private attributes

The Pydantic configuration correctly handles the private _metadata field.


56-59: LGTM: Clean version update implementation

The version update implementation correctly follows the numeric versioning approach as discussed in previous comments.


61-69: LGTM: Secure JSON serialization implementation

The JSON serialization methods correctly utilize Pydantic's built-in functionality.


82-89: LGTM: Proper dictionary serialization implementation

The dictionary serialization methods correctly use Pydantic's model_dump and model_validate functions.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
cognee/infrastructure/llm/openai/adapter.py (1)

Fix indentation issues throughout the file

The following formatting issues need to be addressed:

  • Remove indentation for class-level attributes (name, model, api_key, api_version)
  • Fix method parameter indentation to use 4 spaces instead of 8 (e.g., in __init__, acreate_structured_output)
  • Normalize indentation in multi-line string content (e.g., in the content field of messages)
🔗 Analysis chain

Line range hint 1-200: Fix formatting issues

The pipeline indicates formatting issues that need to be addressed with Ruff formatter.

Run the following command to fix formatting:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check specific formatting issues in this file
rg --line-number '[ ]{2,}' cognee/infrastructure/llm/openai/adapter.py

Length of output: 5223

🧰 Tools
🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

♻️ Duplicate comments (1)
cognee/infrastructure/llm/openai/adapter.py (1)

48-53: 🛠️ Refactor suggestion

Refactor callback configuration

Previous review comments already suggested refactoring this callback configuration pattern.

As previously suggested, consider extracting this into a separate method:

+    def _configure_langfuse_callbacks(self, client):
+        """Configure Langfuse callbacks for the given client."""
+        client.success_callback = ["langfuse"]
+        client.failure_callback = ["langfuse"]

     def __init__(self, ...):
         # ...
         try:
             if base_config.monitoring_tool == MonitoringTool.LANGFUSE:
-                self.aclient.success_callback = ["langfuse"]
-                self.aclient.failure_callback = ["langfuse"]
-                self.client.success_callback = ["langfuse"]
-                self.client.failure_callback = ["langfuse"]
+                self._configure_langfuse_callbacks(self.aclient)
+                self._configure_langfuse_callbacks(self.client)
         except Exception as e:
             print(f"Warning: Failed to configure monitoring: {str(e)}")
🧰 Tools
🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5243b4 and 667f973.

📒 Files selected for processing (1)
  • cognee/infrastructure/llm/openai/adapter.py (2 hunks)
🧰 Additional context used
🪛 GitHub Actions: ruff format
cognee/infrastructure/llm/openai/adapter.py

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_notebook_test / test

Comment on lines +54 to +60


@observe(as_type='generation')
async def acreate_structured_output(self, text_input: str, system_prompt: str,
response_model: Type[BaseModel]) -> BaseModel:


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Standardize decorator usage and add error handling

The @observe decorator usage is inconsistent between methods and lacks error handling:

  1. acreate_structured_output uses as_type='generation'
  2. create_structured_output uses default parameters
  3. Missing fallback if decorator import fails

Consider standardizing the decorator usage:

-    @observe(as_type='generation')
+    @observe(as_type='generation', name='async_structured_output')
     async def acreate_structured_output(self, text_input: str, system_prompt: str,
                                     response_model: Type[BaseModel]) -> BaseModel:
         """Generate a response from a user query."""

-    @observe
+    @observe(as_type='generation', name='structured_output')
     def create_structured_output(self, text_input: str, system_prompt: str,
                              response_model: Type[BaseModel]) -> BaseModel:

Also, add docstring documentation for the monitoring functionality:

def create_structured_output(self, text_input: str, system_prompt: str,
                           response_model: Type[BaseModel]) -> BaseModel:
    """Generate a response from a user query.
    
    Args:
        text_input (str): The input text to process
        system_prompt (str): The system prompt to use
        response_model (Type[BaseModel]): The Pydantic model for response validation
        
    Returns:
        BaseModel: The structured response
        
    Note:
        This method is monitored by Langfuse when monitoring_tool is configured.
    """
🧰 Tools
🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

Comment on lines +15 to +20

monitoring = get_base_config().monitoring_tool
if monitoring == MonitoringTool.LANGFUSE:
from langfuse.decorators import observe


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve initialization robustness and error handling

The global configuration setup has several potential issues:

  1. Using global variables for configuration can lead to initialization order problems
  2. Missing error handling for get_base_config()
  3. Potential circular dependency with base_config import

Consider moving this logic into a class method:

-monitoring = get_base_config().monitoring_tool
-if monitoring == MonitoringTool.LANGFUSE:
-    from langfuse.decorators import observe
+def get_monitoring_config():
+    try:
+        config = get_base_config()
+        return config.monitoring_tool if config else None
+    except Exception as e:
+        print(f"Warning: Failed to load monitoring configuration: {str(e)}")
+        return None
+
+def import_langfuse():
+    try:
+        from langfuse.decorators import observe
+        return observe
+    except ImportError:
+        print("Warning: Langfuse package not installed")
+        return lambda x: x  # No-op decorator
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
monitoring = get_base_config().monitoring_tool
if monitoring == MonitoringTool.LANGFUSE:
from langfuse.decorators import observe
def get_monitoring_config():
try:
config = get_base_config()
return config.monitoring_tool if config else None
except Exception as e:
print(f"Warning: Failed to load monitoring configuration: {str(e)}")
return None
def import_langfuse():
try:
from langfuse.decorators import observe
return observe
except ImportError:
print("Warning: Langfuse package not installed")
return lambda x: x # No-op decorator
🧰 Tools
🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants