[THIS-1049] Neat alpha autofix infrastructure#1569
Conversation
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. |
☂️ Python Coverage
Overall Coverage
New Files
Modified Files
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1569 +/- ##
==========================================
- Coverage 91.76% 91.73% -0.03%
==========================================
Files 121 122 +1
Lines 7065 7161 +96
==========================================
+ Hits 6483 6569 +86
- Misses 582 592 +10
🚀 New features to boost your workflow:
|
fb4de40 to
4a7321c
Compare
Add ability for validators to provide automatic fixes for issues they detect. This implements the core infrastructure and fixes for: Infrastructure: - FixAction class extending ResourceChange for atomic schema fixes - Helper functions for generating auto IDs (constraints, indexes) - Orchestrator support for apply_fixes parameter - Tracking of applied fixes in provenance Fixable validators: - MissingRequiresConstraint: Adds requires constraints - SuboptimalRequiresConstraint: Removes suboptimal constraints - RequiresConstraintCycle: Removes constraints to break cycles - MissingReverseDirectRelationTargetIndex: Adds indexes All fixes use __auto suffix for easy identification and are only applied when explicitly enabled via the orchestrator.
4a7321c to
945ab97
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a FixAction class for applying atomic fixes to schema issues, along with helper functions for generating constraint and index identifiers, ensuring they adhere to CDF's length limits by truncating and hashing if necessary. It also includes validators for identifying and resolving performance-related issues such as missing requires constraints and unindexed reverse direct relations, as well as breaking constraint cycles. The review comments suggest an iterative approach to applying fixes to avoid conflicts and ensure a more robust autofix mechanism, as applying one fix can change the data model in a way that invalidates other pending fixes.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new autofix infrastructure for DMS validators, allowing FixAction objects to automatically resolve validation issues. The changes include defining FixAction and helper functions for auto-generated IDs, integrating fix application into the DmsDataModelRulesOrchestrator, and implementing fix methods for several performance-related validators. The new fix methods correctly generate FixAction objects to add/remove constraints and indexes. Comprehensive unit and end-to-end tests have been added to verify the functionality of the fix actions and their ability to resolve validation issues.
doctrino
left a comment
There was a problem hiding this comment.
I like the FixAction structure. I suggest utilizing the pydantic model_copy method for efficient update. See comment
nikokaoja
left a comment
There was a problem hiding this comment.
Comments in the first pass are related to:
- scope of PR
- readability of code
…xed snapshot - Merge _fix_actions.py + _fix_helpers.py into _fix.py - Move as_resource_update onto FixAction as a method (uses model_copy instead of in-place mutation) - Group fixes by resource ID in orchestrator with conflict detection - Return fixed_snapshot from orchestrator.run() for provenance - Revert validator changes to separate follow-up PR Co-authored-by: Cursor <cursoragent@cursor.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a robust infrastructure for automatically fixing data model validation issues. The core components like FixAction and FixApplicator are well-designed and the integration into the existing provenance and session management through transform_physical is clean. The changes are extensive but logical, and the addition of tests for the new components is great.
I've found a critical bug in the conflict detection logic within FixApplicator and a high-severity issue related to incorrect provenance tracking for applied fixes. Please see the detailed comments for suggestions on how to resolve these.
| for action in self._fix_actions: | ||
| fix_by_resource_id[action.resource_id].append(action) | ||
|
|
||
| resources_list_lookup: dict[type, dict[SchemaResourceId, DataModelResource]] = { |
There was a problem hiding this comment.
We should likely refactor RequestSchema to use dicts for more efficient lookup instead of having to do the lookup table here. Tracking this here: https://cognitedata.atlassian.net/browse/THIS-1068
There was a problem hiding this comment.
Then we can avoid maintaining SchemaSnapshot class, as that one in an exactly the shape your FixApplicator needs
There was a problem hiding this comment.
I am wandering if we attach a resource look up on RequestSchema could be an alternative @doctrino
nikokaoja
left a comment
There was a problem hiding this comment.
Good work, some optional refactoring which can be done in another PR (not critical)
| for action in self._fix_actions: | ||
| fix_by_resource_id[action.resource_id].append(action) | ||
|
|
||
| resources_list_lookup: dict[type, dict[SchemaResourceId, DataModelResource]] = { |
There was a problem hiding this comment.
Then we can avoid maintaining SchemaSnapshot class, as that one in an exactly the shape your FixApplicator needs
| for action in self._fix_actions: | ||
| fix_by_resource_id[action.resource_id].append(action) | ||
|
|
||
| resources_list_lookup: dict[type, dict[SchemaResourceId, DataModelResource]] = { |
There was a problem hiding this comment.
I am wandering if we attach a resource look up on RequestSchema could be an alternative @doctrino
| if resource_lookup is None: | ||
| raise RuntimeError( | ||
| f"{type(self).__name__}: Unsupported resource type {type(resource_id)}. This is a bug in NEAT." | ||
| ) | ||
| resource = resource_lookup.get(resource_id) | ||
| if resource is None: | ||
| raise RuntimeError( | ||
| f"{type(self).__name__}: Resource {resource_id} not found in schema. This is a bug in NEAT." | ||
| ) |
| def _check_no_field_path_conflicts(self, changes: list[FieldChange]) -> None: | ||
| """Raise if any changes touch a field_path already modified by a previous change.""" | ||
| seen_paths: set[str] = set() |
| def make_auto_id(base_id: str) -> str: | ||
| """Generate an auto-generated identifier with truncation if needed. | ||
|
|
||
| CDF has a 43-character limit on constraint/index identifiers. This function | ||
| ensures the ID stays within that limit while maintaining uniqueness. | ||
|
|
||
| Args: | ||
| base_id: The primary identifier to use (e.g., external_id or property_id). | ||
|
|
||
| Returns: | ||
| For short base_ids (≤37 chars): "{base_id}__auto" | ||
| For long base_ids (>37 chars): "{truncated_id}_{hash}__auto" | ||
| """ | ||
| if len(base_id) <= MAX_BASE_LENGTH_NO_HASH: | ||
| return f"{base_id}{AUTO_SUFFIX}" | ||
|
|
||
| hash_suffix = hashlib.sha256(base_id.encode()).hexdigest()[:HASH_LENGTH] | ||
| truncated_id = base_id[:MAX_BASE_LENGTH_WITH_HASH] | ||
| return f"{truncated_id}_{hash_suffix}{AUTO_SUFFIX}" | ||
|
|
||
|
|
||
| def make_auto_constraint_id(dst: ContainerReference) -> str: | ||
| """Generate a constraint identifier for auto-generated requires constraints.""" | ||
| return make_auto_id(dst.external_id) | ||
|
|
||
|
|
||
| def make_auto_index_id(property_id: str) -> str: | ||
| """Generate an index identifier for auto-generated indexes.""" | ||
| return make_auto_id(property_id) |
There was a problem hiding this comment.
Feel free to move this into utils under a dedicated module identifiers or similar
| def transform_physical(self, activity: Callable, on_success: OnSuccess | None = None) -> Change: | ||
| """Transform the current physical data model and record in provenance.""" | ||
| change, transformed_model = self._do_activity(activity, on_success) | ||
|
|
||
| if transformed_model: | ||
| self.physical_data_model.append(transformed_model) | ||
|
|
||
| self.provenance.append(change) | ||
| return change |
There was a problem hiding this comment.
If possible transform_physical should return None and updates to the provenance (change) occur here, as in the example of read_physical.
This is optional refactoring, not nesseraly needed to be done in this PR
| applicator = FixApplicator(self._store.physical_data_model[-1], on_success.pending_fixes) | ||
| post_fix_on_success = self._create_on_success() | ||
| change = self._store.transform_physical(applicator.apply_fixes, post_fix_on_success) | ||
| change.applied_fixes = on_success.pending_fixes |
There was a problem hiding this comment.
Optional refactoring (in another PR):
_create_on_success can be extended allowing creation of data model transformer on_success object, then _read_validate_fix becomes simpler.
| applicator = FixApplicator(self._store.physical_data_model[-1], on_success.pending_fixes) | ||
| post_fix_on_success = self._create_on_success() | ||
| change = self._store.transform_physical(applicator.apply_fixes, post_fix_on_success) | ||
| change.applied_fixes = on_success.pending_fixes |
There was a problem hiding this comment.
Optional refactoring (in another PR):
_create_on_success can be extended allowing creation of data model transformer on_success object, then _read_validate_fix becomes
| target_entity: str | None = field(default="FailedEntity") | ||
| issues: IssueList | None = field(default=None) | ||
| errors: IssueList | None = field(default=None) | ||
| applied_fixes: list[FixAction] | None = field(default=None) |
There was a problem hiding this comment.
optional nitpic , keep it simple
| applied_fixes: list[FixAction] | None = field(default=None) | |
| fixes: list[FixAction] | None = field(default=None) |
Description
Add core infrastructure for automatic fixing data models. Introduces
FixAction(immutable data model for field-level changes),FixApplicator(groups fixes by resource, checks for conflicts, and applies them efficiently via in-place mutation of a deep copy), andtransform_physicalonNeatStoreto integrate fix application into the provenance pipeline. The orchestrator and session layer are wired up to support the read-validate-fix flow behind an alpha feature flag.The code path for fixe functionality is disabled until exposed in a later PR.
Bump