Conversation
Vcf sanity check
There was a problem hiding this comment.
Pull Request Overview
This pull request implements a comprehensive update to the BIGapp project with new VCF file validation capabilities, Docker infrastructure improvements, and enhanced genotype matrix processing. The update introduces VCF sanity checking across all modules that process genomic data and modernizes the build/deployment workflow with multi-architecture Docker support.
Key changes:
- Added comprehensive VCF sanity checking functionality with exported functions for validation and user feedback
- Refactored Docker build system with dependency image separation and multi-architecture support via GitHub Actions
- Enhanced genotype matrix processing to use explicit Gmatrix parameters for improved polyploid support
- Updated app version display to use dynamic package versioning and added citation information across help files
Reviewed Changes
Copilot reviewed 36 out of 36 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| R/vcf_sanity_check.R | New comprehensive VCF validation functions with checks for headers, columns, format consistency, and data integrity |
| tests/testthat/test-vcf_sanity_check.R | Test coverage for VCF sanity checking functionality |
| R/utils.R | Added VCF compression detection utility and integrated sanity checks into read_geno_file function |
| R/GS_functions.R | Updated genotype matrix formatting and relationship matrix generation with explicit Gmatrix parameters |
| Dockerfile/Dockerfile.deps | Modernized Docker build with dependency separation and multi-architecture support |
| .github/workflows/dockerhub-on-version.yml | Automated Docker build and push workflow triggered by version changes |
| R/mod_*.R | Integration of VCF sanity checks across all genomic analysis modules |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| } | ||
| gzfile(vcf_path, open = "rt") | ||
| } else if(is_gz == "bzip2 (.bz2)") { | ||
| if (verbose) warning("File is compressed th bzip2 (.bz2), which is not supported.") |
There was a problem hiding this comment.
Typo in warning message: 'th' should be 'with'.
| if (verbose) warning("File is compressed th bzip2 (.bz2), which is not supported.") | |
| if (verbose) warning("File is compressed with bzip2 (.bz2), which is not supported.") |
| if(is.null(checks$checks)){ | ||
| shinyalert( | ||
| title = "File Error", | ||
| text = checks$message, |
There was a problem hiding this comment.
The function accesses checks$message in line 437, but this property doesn't exist in the vcf_sanity_check object structure (which has checks, messages, duplicates, ploidy_max). This should be checks$messages.
| text = checks$message, | |
| text = checks$messages, |
| if (!inherits(x, "vcf_sanity_check")) stop("Object must be of class 'vcf_sanity_check'.") | ||
|
|
||
| if(is.null(x$checks)) { | ||
| warning(paste("Checks could not be performed. Because:", x$message)) |
There was a problem hiding this comment.
Similar to the previous issue, x$message should be x$messages since the vcf_sanity_check object structure uses 'messages' not 'message'.
| warning(paste("Checks could not be performed. Because:", x$message)) | |
| warning(paste("Checks could not be performed. Because:", x$messages)) |
| warning_if_true = NULL) | ||
|
|
||
| print(checks) | ||
| print(checks_result) |
There was a problem hiding this comment.
Debug print statements should be removed from production code.
| print(checks_result) |
| warning_if_true = NULL) | ||
|
|
||
| print(checks) | ||
| print(checks_result) |
There was a problem hiding this comment.
Debug print statements should be removed from production code.
| print(checks_result) |
| read_geno_file <- function(file_path, requires = c("GT")){ | ||
| read_geno_file <- function(file_path, requires = c("GT"), ploidy, check=TRUE) { | ||
| if (grepl("\\.csv$", file_path)) { | ||
| geno <- read.csv(geno_path, header = TRUE, row.names = 1, check.names = FALSE) |
There was a problem hiding this comment.
Variable 'geno_path' is undefined. It should be 'file_path' based on the function parameter.
| geno <- read.csv(geno_path, header = TRUE, row.names = 1, check.names = FALSE) | |
| geno <- read.csv(file_path, header = TRUE, row.names = 1, check.names = FALSE) |
| # --- CHROM and POS column checks --- | ||
| chrom_pos <- do.call(rbind, chrom_pos) | ||
| checks["chrom_info"] <- all(chrom_pos[,1] != "." | chrom_pos[,1] != "" | !is.na(chrom_pos[,1])) | ||
| checks["pos_info"] <- all(chrom_pos[,2] != "." | chrom_pos[,2] != "" | !is.na(chrom_pos[,2])) |
There was a problem hiding this comment.
Logical error in condition: using OR (|) operators means this will always be TRUE. Should use AND (&) operators to check that values are NOT missing: all(chrom_pos[,1] != "." & chrom_pos[,1] != "" & !is.na(chrom_pos[,1]))
| checks["pos_info"] <- all(chrom_pos[,2] != "." | chrom_pos[,2] != "" | !is.na(chrom_pos[,2])) | |
| checks["chrom_info"] <- all(chrom_pos[,1] != "." & chrom_pos[,1] != "" & !is.na(chrom_pos[,1])) | |
| checks["pos_info"] <- all(chrom_pos[,2] != "." & chrom_pos[,2] != "" & !is.na(chrom_pos[,2])) |
| # --- CHROM and POS column checks --- | ||
| chrom_pos <- do.call(rbind, chrom_pos) | ||
| checks["chrom_info"] <- all(chrom_pos[,1] != "." | chrom_pos[,1] != "" | !is.na(chrom_pos[,1])) | ||
| checks["pos_info"] <- all(chrom_pos[,2] != "." | chrom_pos[,2] != "" | !is.na(chrom_pos[,2])) |
There was a problem hiding this comment.
Same logical error as above: using OR (|) operators means this will always be TRUE. Should use AND (&) operators: all(chrom_pos[,2] != "." & chrom_pos[,2] != "" & !is.na(chrom_pos[,2]))
| checks["pos_info"] <- all(chrom_pos[,2] != "." | chrom_pos[,2] != "" | !is.na(chrom_pos[,2])) | |
| checks["chrom_info"] <- all(chrom_pos[,1] != "." & chrom_pos[,1] != "" & !is.na(chrom_pos[,1])) | |
| checks["pos_info"] <- all(chrom_pos[,2] != "." & chrom_pos[,2] != "" & !is.na(chrom_pos[,2])) |
This pull request introduces a major update to the Docker build and deployment workflow for the BIGapp project, refactors the Dockerfile setup for improved efficiency and multi-architecture support, and adds new VCF sanity checking functionality to the R Shiny application. It also updates the app version and makes several improvements to the genotype matrix formatting and filtering logic.
Docker build and deployment improvements:
.github/workflows/dockerhub-on-version.yml) to automatically build and push multi-architecture Docker images (amd64 and arm64) to DockerHub when the package version changes, including manifest creation for multi-arch support.Dockerfileto use a dependency image (bigapp-deps), improving build caching and separating base dependencies from app-specific installation. Runtime improvements include health checks and proper user setup.Dockerfile.depsto build and cache R/CRAN/Bioc dependencies efficiently, including fallback logic for missing packages and version snapshotting for auditability.VCF sanity checking and error handling:
vcf_sanity_check,vcf_sanity_messages) and integrated VCF sanity checks into both the DosageCall and Filtering modules, providing user feedback and halting analysis if checks fail. [1] [2] [3]Genotype matrix and relationship matrix logic:
get_relationship_matto use theGmatrixfunction with explicit parameters (ploidy.correction,ratio,missingValue) for more robust polyploid support and consistency. [1] [2]format_geno_matrixto restrict rrBLUP codification to diploid and Gmatrix scenarios only.User interface and filtering enhancements:
utils::packageVersion.Version and dependency updates:
1.5.1inDESCRIPTION, and removed theBIGrremote from Imports (now handled via Docker dependencies). [1] [2]Let me know if you want to dive deeper into any of these areas or need help understanding how the new Docker workflow or VCF sanity checks work!