Dev/download data file dataflow by EreboPSilva · Pull Request #544 · Ensembl/ensembl-analysis

EreboPSilva · 2026-03-06T16:08:15Z

Requirements

When creating your Pull request, please fill out the template below:

PR details

Is this a fix/ update/ new feature?

Fix/Update.

Include a short description

To prevent HiveDownloadData from overwrittting data, wasting time and resources, it has been slightly modified to: skip files that exists in the destination folder and that have a valid md5sum hash, and to handle the dataflow into HiveCalculateReadLength based on the presence/ausence of a value for the file there already. To accomplish this a minor change to the config module had to be effected (just the addition of a static param).

Commits are messy, as I tried two different aproached between realising the correct working of pre_cleanup subroutines.

Include links to JIRA tickets

https://embl.atlassian.net/browse/ENSGENEBUI-3582

Testing

Have you tested it?

Keeping this as a draft while testing it, to allow other to test it as well (@JackCurragh).

Assign to the weekly GitHub reviewer

If you are a member of Ensembl, please check the Genebuild weekly Rotas and assign this week's GitHub reviewer to the PR

No reviews yet.

with this change the module will not delete everything and redownload. instead for an already existing file it will check the md5sum and if matching the one in the csv file, it will skip that. if the file is not in the csv or the md5sum is not matching (indicating a corrupted file) it will proceed as usual (delete and re-download).

For debuggingg and clarity purposes

pre_cleanup now only deletes files with md5 mismatch or no md5 defined, leaving verified files for fetch_input to handle. fetch_input exits early when a file is present and md5-verified, querying the read_length_table to decide whether to dataflow to HiveCalculateReadLength or skip it, avoiding duplicate job INSERT errors on re-runs and manual file placement. changes required for a misunderstanding of the behaviour of pre_cleanup and the need to cater to the niche but possible scenario of a manual addition of files whose read length calculation would have been skipped, maybe, in the previous scenario.

ens-ftricomi

is it ready for review?

EreboPSilva added 3 commits February 23, 2026 14:12

Added better reporting

f484c99

For debuggingg and clarity purposes

EreboPSilva marked this pull request as ready for review April 10, 2026 09:59

ens-ftricomi requested review from JackCurragh and ens-ftricomi April 23, 2026 09:43

ens-ftricomi reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev/download data file dataflow#544

Dev/download data file dataflow#544
EreboPSilva wants to merge 3 commits into
mainfrom
dev/download-data-file-dataflow

EreboPSilva commented Mar 6, 2026

Uh oh!

ens-ftricomi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EreboPSilva commented Mar 6, 2026

Requirements

PR details

Testing

Assign to the weekly GitHub reviewer

Uh oh!

ens-ftricomi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants