feat: Add job for metadata extraction and generate BagIt packages#1366
feat: Add job for metadata extraction and generate BagIt packages#1366NishaSharma14 merged 3 commits intodevelopmentfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## development #1366 +/- ##
=================================================
- Coverage 72.92% 71.15% -1.78%
- Complexity 2358 2412 +54
=================================================
Files 209 210 +1
Lines 9094 9319 +225
=================================================
- Hits 6632 6631 -1
- Misses 2462 2688 +226
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR adds a new job and console command system for extracting metadata from study files and generating BagIt-compliant packages. The implementation introduces a queued job that downloads study ZIP files, calls external APIs (NMRKit and bioschema), extracts metadata and images, and creates BagIt manifest files. It also removes an older ExtractSpectra command that is being replaced by this new workflow.
Changes:
- Adds metadata extraction and BagIt generation job infrastructure with configurable retry logic and timeout settings
- Introduces database columns to track processing status and logs for the metadata extraction workflow
- Adds whikloj/bagittools dependency for BagIt package creation and related PEAR dependencies
Reviewed changes
Copilot reviewed 7 out of 9 changed files in this pull request and generated 20 comments.
Show a summary per file
| File | Description |
|---|---|
| database/migrations/2026_02_19_120638_add_metadata_extraction_bagit_generation_columns_to_studies_table.php | Adds status and logs columns to studies table for tracking metadata extraction progress |
| config/nmrxiv.php | Adds configuration section for spectra parsing with API endpoints, storage settings, and timeout values |
| composer.json | Adds whikloj/bagittools dependency |
| composer.lock | Updates lock file with new BagIt-related dependencies (pear/archive_tar, pear/pear-core-minimal, whikloj/bagittools) |
| app/Jobs/ProcessMetadataExtractionBagitGenerationJob.php | Implements queued job for downloading study data, calling APIs, extracting images, and generating BagIt manifests |
| app/Console/Commands/QueueMetadataExtractionBagitGenerationJobs.php | Adds console command to queue metadata extraction jobs for eligible studies with filtering options |
| app/Console/Commands/ExtractSpectra.php | Removes legacy command that is being replaced by the new job-based workflow |
| .gitignore | Excludes the spectra_parse storage directory |
| .env.example | Adds example environment variables for spectra parsing configuration |
| Log::warning("Download failed: {$e->getMessage()}. Retrying..."); | ||
| sleep(2); |
There was a problem hiding this comment.
The job uses hard-coded sleep(2) for retry delays. In a production environment with many concurrent jobs, this fixed delay could lead to inefficient retry patterns. Consider implementing exponential backoff for retries or making the retry delay configurable through the config file.
| Log::warning("Download failed: {$e->getMessage()}. Retrying..."); | |
| sleep(2); | |
| $baseDelaySeconds = (int) config('nmrxiv.spectra_parsing.retry_delay_seconds', 2); | |
| $backoffFactor = (int) config('nmrxiv.spectra_parsing.retry_backoff_factor', 2); | |
| $delaySeconds = $baseDelaySeconds * ($backoffFactor ** ($attempt - 1)); | |
| Log::warning("Download failed: {$e->getMessage()}. Retrying in {$delaySeconds} seconds..."); | |
| sleep($delaySeconds); |
| // Update bag with checksums | ||
| $bag->update(); | ||
|
|
||
| // Package the bag (this generates manifests) | ||
| $bag->package($bagPath); | ||
|
|
There was a problem hiding this comment.
The Bag::create() and Bag::package() methods are called with the same $bagPath parameter, but according to BagItTools documentation, package() should receive an output path for the packaged bag, not the source path. This could cause the method to overwrite the original directory structure. Verify the correct usage of the BagItTools library API.
| // Update bag with checksums | |
| $bag->update(); | |
| // Package the bag (this generates manifests) | |
| $bag->package($bagPath); | |
| // Update bag with checksums and manifests | |
| $bag->update(); |
...ions/2026_02_19_120638_add_metadata_extraction_bagit_generation_columns_to_studies_table.php
Show resolved
Hide resolved
| use App\Models\Study; | ||
| use Illuminate\Console\Command; | ||
|
|
||
| class QueueMetadataExtractionBagitGenerationJobs extends Command |
There was a problem hiding this comment.
The console command is missing comprehensive test coverage. Following the codebase convention where similar commands like VerifyFileIntegrityCommand have corresponding tests, this command should have tests covering the various options (--limit, --ids, --fresh, --retry-failed) and ensure jobs are dispatched correctly.
| protected function processStudy(Study $study): array | ||
| { | ||
| // Remove NMRXIV: prefix if present (e.g., "NMRXIV:S1295" -> "S1295") | ||
| $studyIdentifier = str_replace('NMRXIV:', '', $study->identifier); | ||
| $disk = Storage::disk(config('nmrxiv.spectra_parsing.storage_disk', 'local')); | ||
| $basePath = config('nmrxiv.spectra_parsing.storage_path', 'spectra_parse'); | ||
| $baseDir = "{$basePath}/{$studyIdentifier}"; | ||
| $dataDir = "{$baseDir}/data"; | ||
| $zipPath = null; | ||
|
|
||
| try { | ||
| // Step 1: Download ZIP file | ||
| Log::info("Step 1/7: Downloading ZIP file for study {$study->id}"); | ||
| $zipPath = $this->downloadWithRetry($study->download_url, $this->retries); | ||
|
|
||
| // Step 2: Extract ZIP to data directory | ||
| Log::info('Step 2/7: Extracting ZIP archive...'); | ||
| $studyName = $this->extractZip($zipPath, $disk->path($dataDir)); | ||
|
|
||
| // Step 3: Call NMRKit API | ||
| Log::info('Step 3/7: Calling NMRKit API...'); | ||
| $jsonData = $this->callNMRKitAPI($study->download_url, $this->retries); | ||
|
|
||
| // Step 4: Fetch Bio-Schema | ||
| Log::info('Step 4/7: Fetching bio-schema...'); | ||
| $bioSchema = null; | ||
| try { | ||
| $bioSchema = $this->fetchBioSchema($studyIdentifier, $this->retries); | ||
| } catch (\Exception $e) { | ||
| Log::warning("Bio-schema fetch failed: {$e->getMessage()}. Continuing without bio-schema..."); | ||
| } | ||
|
|
||
| // Step 5: Create nmrxiv-meta structure | ||
| Log::info('Step 5/7: Creating nmrxiv-meta structure...'); | ||
| $metaDir = "{$dataDir}/{$studyName}/nmrxiv-meta"; | ||
| $imagesDir = "{$metaDir}/images"; | ||
|
|
||
| if (! $disk->exists($metaDir)) { | ||
| $disk->makeDirectory($metaDir, 0755, true); | ||
| } | ||
|
|
||
| // Clean up old images directory to prevent duplicates from previous runs | ||
| if ($disk->exists($imagesDir)) { | ||
| // Delete all PNG files in the images directory | ||
| $oldImages = $disk->files($imagesDir); | ||
| foreach ($oldImages as $oldImage) { | ||
| $disk->delete($oldImage); | ||
| } | ||
| Log::info('Cleaned up '.count($oldImages).' old image files'); | ||
| } else { | ||
| $disk->makeDirectory($imagesDir, 0755, true); | ||
| } | ||
|
|
||
| // Clean up spectra data | ||
| if (isset($jsonData['data']['spectra']) && is_array($jsonData['data']['spectra'])) { | ||
| foreach ($jsonData['data']['spectra'] as &$spectra) { | ||
| unset($spectra['data']); | ||
| unset($spectra['meta']); | ||
| unset($spectra['originalData']); | ||
| unset($spectra['originalInfo']); | ||
| } | ||
| unset($spectra); | ||
| } | ||
|
|
||
| // Extract and save images as PNG files | ||
| $imageCount = 0; | ||
| if (isset($jsonData['images']) && is_array($jsonData['images'])) { | ||
| foreach ($jsonData['images'] as $imageData) { | ||
| if (isset($imageData['id']) && isset($imageData['image'])) { | ||
| $imageId = $imageData['id']; | ||
| $base64Data = $imageData['image']; | ||
|
|
||
| // Save PNG file | ||
| $pngPath = "{$imagesDir}/{$imageId}.png"; | ||
| $this->savePNGFromBase64($base64Data, $disk->path($pngPath)); | ||
| $imageCount++; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Save S{identifier}.nmrium (full API response with base64 images intact) | ||
| $nmriumPath = "{$metaDir}/{$studyIdentifier}.nmrium"; | ||
| $formattedJson = json_encode($jsonData, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES); | ||
| $disk->put($nmriumPath, $formattedJson); | ||
|
|
||
| // Save bio-schema.json | ||
| if ($bioSchema !== null) { | ||
| $bioSchemaPath = "{$metaDir}/bio-schema.json"; | ||
| $bioSchemaJson = json_encode($bioSchema, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES); | ||
| $disk->put($bioSchemaPath, $bioSchemaJson); | ||
| } | ||
|
|
||
| // Step 6: Generate BagIt manifests | ||
| Log::info('Step 6/7: Generating BagIt manifests...'); | ||
| $this->generateBagItManifests($disk->path($baseDir)); | ||
|
|
||
| return [ | ||
| 'imageCount' => $imageCount, | ||
| 'location' => $disk->path($baseDir), | ||
| ]; | ||
| } finally { | ||
| // Step 7: Cleanup temporary files (always runs, even on exception) | ||
| if ($zipPath && file_exists($zipPath)) { | ||
| Log::info('Step 7/7: Cleaning up temporary ZIP file...'); | ||
| @unlink($zipPath); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Potential disk space exhaustion risk: The job downloads ZIP files, extracts them, generates images, and creates BagIt packages without any checks on available disk space. For large studies or when processing many studies concurrently, this could fill up the disk. Consider implementing disk space checks before processing or adding cleanup of old processed studies.
| ]); | ||
| } | ||
|
|
||
| // Don't rethrow - let the job complete so it doesn't retry infinitely |
There was a problem hiding this comment.
The job attempts to catch and handle exceptions but doesn't rethrow them, which prevents Laravel's queue retry mechanism from working. When an exception is caught and the job completes normally (line 104), failed jobs won't be retried even though $tries is set to 3. Consider either allowing the exception to propagate for retriable errors, or implementing manual retry logic within the job.
| // Don't rethrow - let the job complete so it doesn't retry infinitely | |
| throw $e; |
| /** | ||
| * Manually generate BagIt manifests. | ||
| */ | ||
| protected function generateBagItManually(string $bagPath): void |
There was a problem hiding this comment.
Missing explicit return type declaration. According to the PHP guidelines in this codebase, all methods should have explicit return type declarations. Add ': void' to the method signature.
| protected function extractZip(string $zipPath, string $extractTo): string | ||
| { | ||
| $zip = new ZipArchive; | ||
|
|
||
| if ($zip->open($zipPath) !== true) { | ||
| throw new \Exception("Failed to open ZIP file: {$zipPath}"); | ||
| } | ||
|
|
||
| // Get the root folder name from first entry | ||
| $studyName = null; | ||
| if ($zip->numFiles > 0) { | ||
| $firstEntry = $zip->getNameIndex(0); | ||
| $parts = explode('/', $firstEntry); | ||
| $studyName = $parts[0]; | ||
| } | ||
|
|
||
| if (! $studyName) { | ||
| throw new \Exception('Could not determine study name from ZIP'); | ||
| } | ||
|
|
||
| // Extract all files | ||
| $zip->extractTo($extractTo); | ||
| $zip->close(); |
There was a problem hiding this comment.
The ZIP extraction does not validate the extracted file paths, which could be vulnerable to path traversal attacks (Zip Slip vulnerability). A malicious ZIP file could contain entries with paths like "../../../etc/passwd" that escape the intended extraction directory. Validate that all extracted file paths remain within the intended directory before extraction.
| protected function downloadWithRetry(string $url, int $retries): string | ||
| { | ||
| $attempt = 0; | ||
| $lastException = null; | ||
|
|
||
| while ($attempt < $retries) { | ||
| try { | ||
| $attempt++; | ||
| Log::debug("Download attempt {$attempt}/{$retries}..."); | ||
|
|
||
| $tempPath = storage_path('app/temp_'.uniqid().'.zip'); | ||
| $timeout = config('nmrxiv.spectra_parsing.download_timeout', 300); | ||
| $response = Http::timeout($timeout)->get($url); |
There was a problem hiding this comment.
The download URL from the Study model is used directly in HTTP requests without validation. If a malicious actor can control the download_url field, they could potentially cause SSRF (Server-Side Request Forgery) attacks by pointing to internal network resources. Consider validating that the URL points to expected domains or implementing allowlist-based URL validation.
|
|
||
| // Reset status to pending | ||
| Study::where('metadata_bagit_generation_status', 'failed') | ||
| ->whereIn('id', $failedStudies->pluck('id')) | ||
| ->update(['metadata_bagit_generation_status' => 'pending']); |
There was a problem hiding this comment.
When using the --retry-failed option, the status is reset to 'pending' before dispatching jobs. However, if job dispatching fails for any reason after the status update, those studies will be marked as 'pending' but won't actually have jobs in the queue. Consider updating the status to 'pending' only after successful job dispatch, or use a transaction to ensure atomicity.
| // Reset status to pending | |
| Study::where('metadata_bagit_generation_status', 'failed') | |
| ->whereIn('id', $failedStudies->pluck('id')) | |
| ->update(['metadata_bagit_generation_status' => 'pending']); |
resolve job failure issue
…nerationJobs command
No description provided.