-
Notifications
You must be signed in to change notification settings - Fork 320
add filesystem copy helper #2806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
a414d2b
to
c8b6667
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think implementation itself is good but I'm not sure this should be user facing. My understanding is that what we need is some kind of copy command. I'd wrap this in a helper function.
- uses filesystem source to read files (user needs to pass instance of resource)
- we add transformer / transform that imports the files (https://dlthub.com/docs/api_reference/dlt/extract/extractors#with_file_import) we do not normalize those files as user is apparently not interested in schemas :)
- we set additional config to prevent the tables from being created so obscure settings are hidden. also we should block creation of the state on the pipeline level
- we should run supplied pipeline
- (optional): copy mode enables files on any type, not only those known by dlt
# for string we assume it is a local file path | ||
# TODO: does this make sense? | ||
elif isinstance(item, str): | ||
local_file_path = item |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should raise here is there is a dataitem that we can't use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should never happen... filesystem
yields FileItemDict
. on anything else we should raise...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks pretty good! maybe we can ask our users if they like such command?
# for string we assume it is a local file path | ||
# TODO: does this make sense? | ||
elif isinstance(item, str): | ||
local_file_path = item |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should never happen... filesystem
yields FileItemDict
. on anything else we should raise...
ext = os.path.splitext(local_file_path)[1][1:] | ||
|
||
# TODO: should we raise? Should it be configurable? | ||
if ext not in [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm maybe we can copy anything that is in the glob? but then we need more hack in filesystem to allow any file type
"""Copy a file from the source to the destination. | ||
|
||
Args: | ||
resource: The source to copy from. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outdated
raise ValueError("Invalid resource parameter type: " + type(items)) | ||
|
||
with custom_environ({"DESTINATION__EXPERIMENTAL_EXCLUDE_DLT_TABLES": "True"}): | ||
pipeline.run(items, **(run_kwargs or {})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can set this directly on the pipeline.destination
.config_params no need to mock env variable.
Description
This PR implements a copy / import helper for the filesystem. When used, all files iterated by the filesystem resource are forwarded unchanged into the destination filesystem and no dlt metadata tables are created.
TODO