passing list of files (e.g., parquets) as input from previous task #634
-
Hi All, I'm looking for some help on structuring my tasks correctly when I have a set or unknown number of input files. I have in the past worked with dask using a set of parquet files on disk. Or using duckdb to treat a folder of files as a database. Sometimes I will specific the number of partitions to create, and sometimes I have dask do the partitioning. I think this is separate from the dask + pytask option that does distribution computing of tasks. So my questions is: What is the best way in the pytask workflow to pass these "distributed" data to a task? I've included a toy example (I'm not sure it would actually run, but hopefully gets the point across) of a basic workflow that takes a collection of input CSVs, uses dask to read them, partition them, and save them to disc. Then a function that would do some sort of analysis and output results. So in between the first and second task, is this the best approach or is there something else?
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Hi @nkruskamp! Can you take a look at this guide that explains provisional nodes. I believe the |
Beta Was this translation helpful? Give feedback.
-
Hi @tobiasraabe this is great, and exactly what I needed to get the tasks running. Another situation I have to address is when a task has an unknown number of inputs that are not saved in the same folder, but rather have to be specified separately. Using a parameter function to create the dictionary of arguments, each task run could have 1 - n number of input files with the exact paths listed. My initial test is to pass the input files as a list, and equal length list of sheet names (or other needed variables), and that seems to work, but I wanted to see if there was a more pytask way to accomplish this? Thanks again for your help!
|
Beta Was this translation helpful? Give feedback.
Hi @nkruskamp! Can you take a look at this guide that explains provisional nodes. I believe the
DirectoryNode
is what you are looking for. If you have some feedback for the guide, let me know.