-
Notifications
You must be signed in to change notification settings - Fork 117
Refactoring Catalog.get_dataset()
#1249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideThis PR refactors the dataset lookup API by changing Sequence diagram for dataset lookup with new get_dataset signaturesequenceDiagram
participant Caller
participant Catalog
participant Metastore
Caller->>Catalog: get_dataset(name, namespace_name, project_name)
Catalog->>Metastore: get_dataset(name, namespace_name, project_name)
Metastore-->>Catalog: DatasetRecord
Catalog-->>Caller: DatasetRecord
Class diagram for refactored Catalog.get_dataset and Metastore.get_datasetclassDiagram
class Catalog {
+get_dataset(name: str, namespace_name: Optional[str] = None, project_name: Optional[str] = None) DatasetRecord
}
class Metastore {
+get_dataset(name: str, namespace_name: Optional[str] = None, project_name: Optional[str] = None, conn=None) DatasetRecord
}
Catalog --> Metastore : uses
class Project {
+name: str
+namespace: Namespace
}
class Namespace {
+name: str
}
class DatasetRecord {
+name: str
+project: Project
+latest_version: str
+has_version(version: str): bool
}
Metastore --> Project
Project --> Namespace
Catalog --> DatasetRecord
Metastore --> DatasetRecord
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ilongin - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `src/datachain/data_storage/metastore.py:921` </location>
<code_context>
) -> DatasetRecord:
"""Creates new dataset."""
- project_id = project_id or self.default_project.id
+ if not project_id:
+ project = self.default_project
+ else:
+ project = self.get_project_by_id(project_id)
query = self._datasets_insert().values(
</code_context>
<issue_to_address>
Logic for determining project_id in create_dataset may be redundant.
Since project.id is always used in the insert, fetching the full project object when project_id is already provided may be unnecessary. Consider simplifying to avoid the extra lookup.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
if not project_id: | ||
project = self.default_project | ||
else: | ||
project = self.get_project_by_id(project_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: Logic for determining project_id in create_dataset may be redundant.
Since project.id is always used in the insert, fetching the full project object when project_id is already provided may be unnecessary. Consider simplifying to avoid the extra lookup.
for r in catalog.get_dataset("dogs_custom_columns").get_version("1.0.0").preview: | ||
assert isinstance(r.get("file__last_modified"), str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Avoid loops in tests. (no-loop-in-tests
)
Explanation
Avoid complex code, like loops, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
ds = self._parse_dataset(self.db.execute(query, conn=conn)) | ||
if not ds: | ||
raise DatasetNotFoundError( | ||
f"Dataset {name} not found in project with id {project_id}" | ||
f"Dataset {name} not found in namespace {namespace_name}" | ||
f" and project {project_name}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Use named expression to simplify assignment and conditional (
use-named-expression
) - Lift code into else after jump in control flow (
reintroduce-else
) - Swap if/else branches (
swap-if-else-branches
)
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1249 +/- ##
==========================================
+ Coverage 88.74% 88.75% +0.01%
==========================================
Files 153 153
Lines 13848 13843 -5
Branches 1938 1939 +1
==========================================
- Hits 12289 12287 -2
+ Misses 1103 1100 -3
Partials 456 456
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Deploying datachain-documentation with
|
Latest commit: |
8060b62
|
Status: | ✅ Deploy successful! |
Preview URL: | https://165cd897.datachain-documentation.pages.dev |
Branch Preview URL: | https://ilongin-1237-refactore-get-d.datachain-documentation.pages.dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use fully qualified dataset names instead of 3 params: name, namespace, project.
self, | ||
name: str, | ||
namespace_name: Optional[str] = None, | ||
project_name: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use full project name instead of additional params - ns1.ns2.ds3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is internal API where we already have it split this into 3 parts. In public API we use fully qualified name. Honestly I wouldn't change this atm as it would again require me some refactoring which would take time and we already have this kind of signature in other internal methods so not much would be changed. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's approve to move fast. But we are accumulating tech depth using different notations in internal and external APIs. Please create a followup issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
Refactoring
Catalog.get_dataset()
to acceptproject_name
andnamespace_name
instead of already instantiatedProject
. This is because in more than one place we have project and namespace name available, but don't have wholeProject
instance and we needed to fetch it from DB because of that which was basically +1 SQL command that could've been avoided.Summary by Sourcery
Refactor dataset lookup to use namespace and project names instead of passing Project objects throughout the codebase
Enhancements: