Skip to content

add filesystem copy helper #2806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: devel
Choose a base branch
from

Conversation

sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Jun 24, 2025

Description

This PR implements a copy / import helper for the filesystem. When used, all files iterated by the filesystem resource are forwarded unchanged into the destination filesystem and no dlt metadata tables are created.

TODO

  • Implement test stubs
  • set up ci to test filesystem source from all buckets (credentials "problem" is already solved in the copy tests)
  • Update docs to explain how to use this feature

Copy link

netlify bot commented Jun 24, 2025

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 7693143
🔍 Latest deploy log https://app.netlify.com/projects/dlt-hub-docs/deploys/686fc8032e0fee0008231926

@sh-rp sh-rp force-pushed the feat/2631-experimental-exclude-dlt-tables branch from a414d2b to c8b6667 Compare June 24, 2025 12:43
@sh-rp sh-rp changed the title add experimental setting toallow to disable creation of internal dlt table folders for filesystem add experimental setting to allow to disable creation of internal dlt table folders for filesystem Jun 25, 2025
@sh-rp sh-rp marked this pull request as ready for review June 25, 2025 11:50
@sh-rp sh-rp requested a review from rudolfix June 27, 2025 08:02
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think implementation itself is good but I'm not sure this should be user facing. My understanding is that what we need is some kind of copy command. I'd wrap this in a helper function.

  1. uses filesystem source to read files (user needs to pass instance of resource)
  2. we add transformer / transform that imports the files (https://dlthub.com/docs/api_reference/dlt/extract/extractors#with_file_import) we do not normalize those files as user is apparently not interested in schemas :)
  3. we set additional config to prevent the tables from being created so obscure settings are hidden. also we should block creation of the state on the pipeline level
  4. we should run supplied pipeline
  5. (optional): copy mode enables files on any type, not only those known by dlt

@sh-rp sh-rp self-assigned this Jun 30, 2025
@sh-rp sh-rp linked an issue Jul 7, 2025 that may be closed by this pull request
@sh-rp sh-rp changed the title add experimental setting to allow to disable creation of internal dlt table folders for filesystem add filesystem copy helper Jul 10, 2025
# for string we assume it is a local file path
# TODO: does this make sense?
elif isinstance(item, str):
local_file_path = item
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should raise here is there is a dataitem that we can't use

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should never happen... filesystem yields FileItemDict. on anything else we should raise...

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks pretty good! maybe we can ask our users if they like such command?

# for string we assume it is a local file path
# TODO: does this make sense?
elif isinstance(item, str):
local_file_path = item
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should never happen... filesystem yields FileItemDict. on anything else we should raise...

ext = os.path.splitext(local_file_path)[1][1:]

# TODO: should we raise? Should it be configurable?
if ext not in [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm maybe we can copy anything that is in the glob? but then we need more hack in filesystem to allow any file type

"""Copy a file from the source to the destination.

Args:
resource: The source to copy from.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated

raise ValueError("Invalid resource parameter type: " + type(items))

with custom_environ({"DESTINATION__EXPERIMENTAL_EXCLUDE_DLT_TABLES": "True"}):
pipeline.run(items, **(run_kwargs or {}))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can set this directly on the pipeline.destination.config_params no need to mock env variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prevent creation of _dlt folders
2 participants