Skip to content

Dev/download data file dataflow#544

Open
EreboPSilva wants to merge 3 commits into
mainfrom
dev/download-data-file-dataflow
Open

Dev/download data file dataflow#544
EreboPSilva wants to merge 3 commits into
mainfrom
dev/download-data-file-dataflow

Conversation

@EreboPSilva
Copy link
Copy Markdown
Member

Requirements

When creating your Pull request, please fill out the template below:

PR details

Is this a fix/ update/ new feature?

Fix/Update.

Include a short description

To prevent HiveDownloadData from overwrittting data, wasting time and resources, it has been slightly modified to: skip files that exists in the destination folder and that have a valid md5sum hash, and to handle the dataflow into HiveCalculateReadLength based on the presence/ausence of a value for the file there already. To accomplish this a minor change to the config module had to be effected (just the addition of a static param).

Commits are messy, as I tried two different aproached between realising the correct working of pre_cleanup subroutines.

Include links to JIRA tickets

https://embl.atlassian.net/browse/ENSGENEBUI-3582

Testing

Have you tested it?

Keeping this as a draft while testing it, to allow other to test it as well (@JackCurragh).

Assign to the weekly GitHub reviewer

If you are a member of Ensembl, please check the Genebuild weekly Rotas and assign this week's GitHub reviewer to the PR

No reviews yet.

with this change the module will not delete everything and redownload.
instead for an already existing file it will check the md5sum and if
matching the one in the csv file, it will skip that. if the file is not
in the csv or the md5sum is not matching (indicating a corrupted file)
it will proceed as usual (delete and re-download).
For debuggingg and clarity purposes
pre_cleanup now only deletes files with md5 mismatch or no md5 defined,
leaving verified files for fetch_input to handle. fetch_input exits early
when a file is present and md5-verified, querying the read_length_table
to decide whether to dataflow to HiveCalculateReadLength or skip it,
avoiding duplicate job INSERT errors on re-runs and manual file placement.

changes required for a misunderstanding of the behaviour of pre_cleanup
and the need to cater to the niche but possible scenario of a manual
addition of files whose read length calculation would have been skipped,
maybe, in the previous scenario.
@EreboPSilva EreboPSilva marked this pull request as ready for review April 10, 2026 09:59
Copy link
Copy Markdown
Contributor

@ens-ftricomi ens-ftricomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it ready for review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants