Dev/download data file dataflow#544
Open
EreboPSilva wants to merge 3 commits into
Open
Conversation
with this change the module will not delete everything and redownload. instead for an already existing file it will check the md5sum and if matching the one in the csv file, it will skip that. if the file is not in the csv or the md5sum is not matching (indicating a corrupted file) it will proceed as usual (delete and re-download).
For debuggingg and clarity purposes
pre_cleanup now only deletes files with md5 mismatch or no md5 defined, leaving verified files for fetch_input to handle. fetch_input exits early when a file is present and md5-verified, querying the read_length_table to decide whether to dataflow to HiveCalculateReadLength or skip it, avoiding duplicate job INSERT errors on re-runs and manual file placement. changes required for a misunderstanding of the behaviour of pre_cleanup and the need to cater to the niche but possible scenario of a manual addition of files whose read length calculation would have been skipped, maybe, in the previous scenario.
Contributor
ens-ftricomi
left a comment
There was a problem hiding this comment.
is it ready for review?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Requirements
When creating your Pull request, please fill out the template below:
PR details
Is this a fix/ update/ new feature?
Fix/Update.
Include a short description
To prevent HiveDownloadData from overwrittting data, wasting time and resources, it has been slightly modified to: skip files that exists in the destination folder and that have a valid md5sum hash, and to handle the dataflow into HiveCalculateReadLength based on the presence/ausence of a value for the file there already. To accomplish this a minor change to the config module had to be effected (just the addition of a static param).
Commits are messy, as I tried two different aproached between realising the correct working of pre_cleanup subroutines.
Include links to JIRA tickets
https://embl.atlassian.net/browse/ENSGENEBUI-3582
Testing
Have you tested it?
Keeping this as a draft while testing it, to allow other to test it as well (@JackCurragh).
Assign to the weekly GitHub reviewer
If you are a member of Ensembl, please check the Genebuild weekly Rotas and assign this week's GitHub reviewer to the PR
No reviews yet.