Skip to content

Refactoring Catalog.get_dataset() #1249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Jul 18, 2025

Refactoring Catalog.get_dataset() to accept project_name and namespace_name instead of already instantiated Project. This is because in more than one place we have project and namespace name available, but don't have whole Project instance and we needed to fetch it from DB because of that which was basically +1 SQL command that could've been avoided.

Summary by Sourcery

Refactor dataset lookup to use namespace and project names instead of passing Project objects throughout the codebase

Enhancements:

  • Change Catalog.get_dataset signature to accept namespace_name and project_name parameters
  • Update Metastore.get_dataset interface and queries to use namespace and project names
  • Modify all internal callers and tests to pass namespace_name and project_name rather than Project instances

Copy link
Contributor

sourcery-ai bot commented Jul 18, 2025

Reviewer's Guide

This PR refactors the dataset lookup API by changing Catalog.get_dataset (and its metastore counterpart) to accept namespace_name and project_name instead of a Project object, centralizes default resolution of namespace/project, and updates all callers and tests accordingly.

Sequence diagram for dataset lookup with new get_dataset signature

sequenceDiagram
    participant Caller
    participant Catalog
    participant Metastore
    Caller->>Catalog: get_dataset(name, namespace_name, project_name)
    Catalog->>Metastore: get_dataset(name, namespace_name, project_name)
    Metastore-->>Catalog: DatasetRecord
    Catalog-->>Caller: DatasetRecord
Loading

Class diagram for refactored Catalog.get_dataset and Metastore.get_dataset

classDiagram
    class Catalog {
        +get_dataset(name: str, namespace_name: Optional[str] = None, project_name: Optional[str] = None) DatasetRecord
    }
    class Metastore {
        +get_dataset(name: str, namespace_name: Optional[str] = None, project_name: Optional[str] = None, conn=None) DatasetRecord
    }
    Catalog --> Metastore : uses
    class Project {
        +name: str
        +namespace: Namespace
    }
    class Namespace {
        +name: str
    }
    class DatasetRecord {
        +name: str
        +project: Project
        +latest_version: str
        +has_version(version: str): bool
    }
    Metastore --> Project
    Project --> Namespace
    Catalog --> DatasetRecord
    Metastore --> DatasetRecord
Loading

File-Level Changes

Change Details Files
Refactor Catalog.get_dataset signature and internals
  • Replace the project: Project parameter with namespace_name and project_name
  • Resolve default namespace/project inside the method for listing and default cases
  • Remove the old try/except and delegate directly to metastore.get_dataset
src/datachain/catalog/catalog.py
Update Metastore.get_dataset interface and query logic
  • Extend signature to accept namespace_name and project_name instead of project_id
  • Adjust SQL query to join namespaces and projects tables by name
  • Modify dataset creation and version methods to call new signature
src/datachain/data_storage/metastore.py
Propagate new get_dataset signature across all callers
  • Replace calls passing Project objects with namespace/project name arguments
  • Align helper modules (listing, query, delta, datachain core, CLI) to new API
  • Simplify redundant code around dataset resolution
src/datachain/query/dataset.py
src/datachain/listing.py
src/datachain/lib/dc/datasets.py
src/datachain/lib/dc/datachain.py
src/datachain/delta.py
src/datachain/cli/commands/datasets.py
Adjust functional tests to reflect new signature
  • Remove passing Project to get_dataset in tests
  • Verify namespace/project defaults by calling get_dataset(name) only
tests/func/test_datasets.py
tests/func/test_pull.py

Possibly linked issues

  • Initial DataChain Commit #1: The PR refactors Catalog.get_dataset to accept namespace_name and project_name, directly addressing the issue.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ilongin - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/data_storage/metastore.py:921` </location>
<code_context>
     ) -> DatasetRecord:
         """Creates new dataset."""
-        project_id = project_id or self.default_project.id
+        if not project_id:
+            project = self.default_project
+        else:
+            project = self.get_project_by_id(project_id)

         query = self._datasets_insert().values(
</code_context>

<issue_to_address>
Logic for determining project_id in create_dataset may be redundant.

Since project.id is always used in the insert, fetching the full project object when project_id is already provided may be unnecessary. Consider simplifying to avoid the extra lookup.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +921 to +924
if not project_id:
project = self.default_project
else:
project = self.get_project_by_id(project_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Logic for determining project_id in create_dataset may be redundant.

Since project.id is always used in the insert, fetching the full project object when project_id is already provided may be unnecessary. Consider simplifying to avoid the extra lookup.

Comment on lines +732 to 733
for r in catalog.get_dataset("dogs_custom_columns").get_version("1.0.0").preview:
assert isinstance(r.get("file__last_modified"), str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

ExplanationAvoid complex code, like loops, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines 1254 to 1259
ds = self._parse_dataset(self.db.execute(query, conn=conn))
if not ds:
raise DatasetNotFoundError(
f"Dataset {name} not found in project with id {project_id}"
f"Dataset {name} not found in namespace {namespace_name}"
f" and project {project_name}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Copy link

codecov bot commented Jul 18, 2025

Codecov Report

Attention: Patch coverage is 88.57143% with 4 lines in your changes missing coverage. Please review.

Project coverage is 88.75%. Comparing base (5bd9d5f) to head (60e6717).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/catalog/catalog.py 83.33% 3 Missing ⚠️
src/datachain/cli/commands/datasets.py 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1249      +/-   ##
==========================================
+ Coverage   88.74%   88.75%   +0.01%     
==========================================
  Files         153      153              
  Lines       13848    13843       -5     
  Branches     1938     1939       +1     
==========================================
- Hits        12289    12287       -2     
+ Misses       1103     1100       -3     
  Partials      456      456              
Flag Coverage Δ
datachain 88.69% <88.57%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/data_storage/metastore.py 93.86% <100.00%> (+0.05%) ⬆️
src/datachain/delta.py 92.77% <100.00%> (ø)
src/datachain/lib/dc/datachain.py 91.40% <100.00%> (ø)
src/datachain/lib/dc/datasets.py 95.12% <100.00%> (ø)
src/datachain/listing.py 85.32% <100.00%> (+1.25%) ⬆️
src/datachain/query/dataset.py 93.37% <ø> (-0.02%) ⬇️
src/datachain/cli/commands/datasets.py 71.95% <0.00%> (+0.86%) ⬆️
src/datachain/catalog/catalog.py 85.97% <83.33%> (-0.06%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 8060b62
Status: ✅  Deploy successful!
Preview URL: https://165cd897.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-1237-refactore-get-d.datachain-documentation.pages.dev

View logs

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use fully qualified dataset names instead of 3 params: name, namespace, project.

self,
name: str,
namespace_name: Optional[str] = None,
project_name: Optional[str] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use full project name instead of additional params - ns1.ns2.ds3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is internal API where we already have it split this into 3 parts. In public API we use fully qualified name. Honestly I wouldn't change this atm as it would again require me some refactoring which would take time and we already have this kind of signature in other internal methods so not much would be changed. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's approve to move fast. But we are accumulating tech depth using different notations in internal and external APIs. Please create a followup issue.

@ilongin ilongin requested a review from dmpetrov August 4, 2025 09:45
Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor Catalog.get_dataset() to accept project_name and namespace_name instead of Project
3 participants