feat: Add job for metadata extraction and generate BagIt packages by NishaSharma14 · Pull Request #1366 · NFDI4Chem/nmrxiv

NishaSharma14 · 2026-02-19T14:05:49Z

No description provided.

codecov · 2026-02-19T14:11:04Z

Codecov Report

❌ Patch coverage is 0% with 225 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.15%. Comparing base (fd35ae4) to head (354f1a9).

Files with missing lines	Patch %	Lines
...bs/ProcessMetadataExtractionBagitGenerationJob.php	0.00%	225 Missing ⚠️

Additional details and impacted files

@@                Coverage Diff                @@
##             development    #1366      +/-   ##
=================================================
- Coverage          72.92%   71.15%   -1.78%     
- Complexity          2358     2412      +54     
=================================================
  Files                209      210       +1     
  Lines               9094     9319     +225     
=================================================
- Hits                6632     6631       -1     
- Misses              2462     2688     +226

Flag	Coverage Δ
unittests	`71.15% <0.00%> (-1.78%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

This PR adds a new job and console command system for extracting metadata from study files and generating BagIt-compliant packages. The implementation introduces a queued job that downloads study ZIP files, calls external APIs (NMRKit and bioschema), extracts metadata and images, and creates BagIt manifest files. It also removes an older ExtractSpectra command that is being replaced by this new workflow.

Changes:

Adds metadata extraction and BagIt generation job infrastructure with configurable retry logic and timeout settings
Introduces database columns to track processing status and logs for the metadata extraction workflow
Adds whikloj/bagittools dependency for BagIt package creation and related PEAR dependencies

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 20 comments.

Show a summary per file

File	Description
database/migrations/2026_02_19_120638_add_metadata_extraction_bagit_generation_columns_to_studies_table.php	Adds status and logs columns to studies table for tracking metadata extraction progress
config/nmrxiv.php	Adds configuration section for spectra parsing with API endpoints, storage settings, and timeout values
composer.json	Adds whikloj/bagittools dependency
composer.lock	Updates lock file with new BagIt-related dependencies (pear/archive_tar, pear/pear-core-minimal, whikloj/bagittools)
app/Jobs/ProcessMetadataExtractionBagitGenerationJob.php	Implements queued job for downloading study data, calling APIs, extracting images, and generating BagIt manifests
app/Console/Commands/QueueMetadataExtractionBagitGenerationJobs.php	Adds console command to queue metadata extraction jobs for eligible studies with filtering options
app/Console/Commands/ExtractSpectra.php	Removes legacy command that is being replaced by the new job-based workflow
.gitignore	Excludes the spectra_parse storage directory
.env.example	Adds example environment variables for spectra parsing configuration

Copilot · 2026-02-19T14:18:41Z

app/Jobs/ProcessMetadataExtractionBagitGenerationJob.php

+                    Log::warning("Download failed: {$e->getMessage()}. Retrying...");
+                    sleep(2);


The job uses hard-coded sleep(2) for retry delays. In a production environment with many concurrent jobs, this fixed delay could lead to inefficient retry patterns. Consider implementing exponential backoff for retries or making the retry delay configurable through the config file.

Suggested change

Log::warning("Download failed: {$e->getMessage()}. Retrying...");

sleep(2);

$baseDelaySeconds = (int) config('nmrxiv.spectra_parsing.retry_delay_seconds', 2);

$backoffFactor = (int) config('nmrxiv.spectra_parsing.retry_backoff_factor', 2);

$delaySeconds = $baseDelaySeconds * ($backoffFactor ** ($attempt - 1));

Log::warning("Download failed: {$e->getMessage()}. Retrying in {$delaySeconds} seconds...");

sleep($delaySeconds);

Copilot · 2026-02-19T14:18:42Z

app/Jobs/ProcessMetadataExtractionBagitGenerationJob.php

+            // Update bag with checksums
+            $bag->update();
+
+            // Package the bag (this generates manifests)
+            $bag->package($bagPath);
+


The Bag::create() and Bag::package() methods are called with the same $bagPath parameter, but according to BagItTools documentation, package() should receive an output path for the packaged bag, not the source path. This could cause the method to overwrite the original directory structure. Verify the correct usage of the BagItTools library API.

Suggested change

// Update bag with checksums

$bag->update();

// Package the bag (this generates manifests)

$bag->package($bagPath);

// Update bag with checksums and manifests

$bag->update();

...ions/2026_02_19_120638_add_metadata_extraction_bagit_generation_columns_to_studies_table.php

Copilot · 2026-02-19T14:18:42Z

app/Console/Commands/QueueMetadataExtractionBagitGenerationJobs.php

+use App\Models\Study;
+use Illuminate\Console\Command;
+
+class QueueMetadataExtractionBagitGenerationJobs extends Command


The console command is missing comprehensive test coverage. Following the codebase convention where similar commands like VerifyFileIntegrityCommand have corresponding tests, this command should have tests covering the various options (--limit, --ids, --fresh, --retry-failed) and ensure jobs are dispatched correctly.

Copilot · 2026-02-19T14:18:42Z

app/Jobs/ProcessMetadataExtractionBagitGenerationJob.php

+    protected function processStudy(Study $study): array
+    {
+        // Remove NMRXIV: prefix if present (e.g., "NMRXIV:S1295" -> "S1295")
+        $studyIdentifier = str_replace('NMRXIV:', '', $study->identifier);
+        $disk = Storage::disk(config('nmrxiv.spectra_parsing.storage_disk', 'local'));
+        $basePath = config('nmrxiv.spectra_parsing.storage_path', 'spectra_parse');
+        $baseDir = "{$basePath}/{$studyIdentifier}";
+        $dataDir = "{$baseDir}/data";
+        $zipPath = null;
+
+        try {
+            // Step 1: Download ZIP file
+            Log::info("Step 1/7: Downloading ZIP file for study {$study->id}");
+            $zipPath = $this->downloadWithRetry($study->download_url, $this->retries);
+
+            // Step 2: Extract ZIP to data directory
+            Log::info('Step 2/7: Extracting ZIP archive...');
+            $studyName = $this->extractZip($zipPath, $disk->path($dataDir));
+
+            // Step 3: Call NMRKit API
+            Log::info('Step 3/7: Calling NMRKit API...');
+            $jsonData = $this->callNMRKitAPI($study->download_url, $this->retries);
+
+            // Step 4: Fetch Bio-Schema
+            Log::info('Step 4/7: Fetching bio-schema...');
+            $bioSchema = null;
+            try {
+                $bioSchema = $this->fetchBioSchema($studyIdentifier, $this->retries);
+            } catch (\Exception $e) {
+                Log::warning("Bio-schema fetch failed: {$e->getMessage()}. Continuing without bio-schema...");
+            }
+
+            // Step 5: Create nmrxiv-meta structure
+            Log::info('Step 5/7: Creating nmrxiv-meta structure...');
+            $metaDir = "{$dataDir}/{$studyName}/nmrxiv-meta";
+            $imagesDir = "{$metaDir}/images";
+
+            if (! $disk->exists($metaDir)) {
+                $disk->makeDirectory($metaDir, 0755, true);
+            }
+
+            // Clean up old images directory to prevent duplicates from previous runs
+            if ($disk->exists($imagesDir)) {
+                // Delete all PNG files in the images directory
+                $oldImages = $disk->files($imagesDir);
+                foreach ($oldImages as $oldImage) {
+                    $disk->delete($oldImage);
+                }
+                Log::info('Cleaned up '.count($oldImages).' old image files');
+            } else {
+                $disk->makeDirectory($imagesDir, 0755, true);
+            }
+
+            // Clean up spectra data
+            if (isset($jsonData['data']['spectra']) && is_array($jsonData['data']['spectra'])) {
+                foreach ($jsonData['data']['spectra'] as &$spectra) {
+                    unset($spectra['data']);
+                    unset($spectra['meta']);
+                    unset($spectra['originalData']);
+                    unset($spectra['originalInfo']);
+                }
+                unset($spectra);
+            }
+
+            // Extract and save images as PNG files
+            $imageCount = 0;
+            if (isset($jsonData['images']) && is_array($jsonData['images'])) {
+                foreach ($jsonData['images'] as $imageData) {
+                    if (isset($imageData['id']) && isset($imageData['image'])) {
+                        $imageId = $imageData['id'];
+                        $base64Data = $imageData['image'];
+
+                        // Save PNG file
+                        $pngPath = "{$imagesDir}/{$imageId}.png";
+                        $this->savePNGFromBase64($base64Data, $disk->path($pngPath));
+                        $imageCount++;
+                    }
+                }
+            }
+
+            // Save S{identifier}.nmrium (full API response with base64 images intact)
+            $nmriumPath = "{$metaDir}/{$studyIdentifier}.nmrium";
+            $formattedJson = json_encode($jsonData, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES);
+            $disk->put($nmriumPath, $formattedJson);
+
+            // Save bio-schema.json
+            if ($bioSchema !== null) {
+                $bioSchemaPath = "{$metaDir}/bio-schema.json";
+                $bioSchemaJson = json_encode($bioSchema, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES);
+                $disk->put($bioSchemaPath, $bioSchemaJson);
+            }
+
+            // Step 6: Generate BagIt manifests
+            Log::info('Step 6/7: Generating BagIt manifests...');
+            $this->generateBagItManifests($disk->path($baseDir));
+
+            return [
+                'imageCount' => $imageCount,
+                'location' => $disk->path($baseDir),
+            ];
+        } finally {
+            // Step 7: Cleanup temporary files (always runs, even on exception)
+            if ($zipPath && file_exists($zipPath)) {
+                Log::info('Step 7/7: Cleaning up temporary ZIP file...');
+                @unlink($zipPath);
+            }
+        }
+    }


Potential disk space exhaustion risk: The job downloads ZIP files, extracts them, generates images, and creates BagIt packages without any checks on available disk space. For large studies or when processing many studies concurrently, this could fill up the disk. Consider implementing disk space checks before processing or adding cleanup of old processed studies.

Copilot · 2026-02-19T14:18:45Z

app/Jobs/ProcessMetadataExtractionBagitGenerationJob.php

+                ]);
+            }
+
+            // Don't rethrow - let the job complete so it doesn't retry infinitely


The job attempts to catch and handle exceptions but doesn't rethrow them, which prevents Laravel's queue retry mechanism from working. When an exception is caught and the job completes normally (line 104), failed jobs won't be retried even though $tries is set to 3. Consider either allowing the exception to propagate for retriable errors, or implementing manual retry logic within the job.

Suggested change

// Don't rethrow - let the job complete so it doesn't retry infinitely

throw $e;

Copilot · 2026-02-19T14:18:45Z

app/Jobs/ProcessMetadataExtractionBagitGenerationJob.php

+    /**
+     * Manually generate BagIt manifests.
+     */
+    protected function generateBagItManually(string $bagPath): void


Missing explicit return type declaration. According to the PHP guidelines in this codebase, all methods should have explicit return type declarations. Add ': void' to the method signature.

Copilot · 2026-02-19T14:18:46Z

app/Jobs/ProcessMetadataExtractionBagitGenerationJob.php

+    protected function extractZip(string $zipPath, string $extractTo): string
+    {
+        $zip = new ZipArchive;
+
+        if ($zip->open($zipPath) !== true) {
+            throw new \Exception("Failed to open ZIP file: {$zipPath}");
+        }
+
+        // Get the root folder name from first entry
+        $studyName = null;
+        if ($zip->numFiles > 0) {
+            $firstEntry = $zip->getNameIndex(0);
+            $parts = explode('/', $firstEntry);
+            $studyName = $parts[0];
+        }
+
+        if (! $studyName) {
+            throw new \Exception('Could not determine study name from ZIP');
+        }
+
+        // Extract all files
+        $zip->extractTo($extractTo);
+        $zip->close();


The ZIP extraction does not validate the extracted file paths, which could be vulnerable to path traversal attacks (Zip Slip vulnerability). A malicious ZIP file could contain entries with paths like "../../../etc/passwd" that escape the intended extraction directory. Validate that all extracted file paths remain within the intended directory before extraction.

Copilot · 2026-02-19T14:18:46Z

app/Jobs/ProcessMetadataExtractionBagitGenerationJob.php

+    protected function downloadWithRetry(string $url, int $retries): string
+    {
+        $attempt = 0;
+        $lastException = null;
+
+        while ($attempt < $retries) {
+            try {
+                $attempt++;
+                Log::debug("Download attempt {$attempt}/{$retries}...");
+
+                $tempPath = storage_path('app/temp_'.uniqid().'.zip');
+                $timeout = config('nmrxiv.spectra_parsing.download_timeout', 300);
+                $response = Http::timeout($timeout)->get($url);


The download URL from the Study model is used directly in HTTP requests without validation. If a malicious actor can control the download_url field, they could potentially cause SSRF (Server-Side Request Forgery) attacks by pointing to internal network resources. Consider validating that the URL points to expected domains or implementing allowlist-based URL validation.

Copilot · 2026-02-19T14:18:46Z

app/Console/Commands/QueueMetadataExtractionBagitGenerationJobs.php

+
+            // Reset status to pending
+            Study::where('metadata_bagit_generation_status', 'failed')
+                ->whereIn('id', $failedStudies->pluck('id'))
+                ->update(['metadata_bagit_generation_status' => 'pending']);


When using the --retry-failed option, the status is reset to 'pending' before dispatching jobs. However, if job dispatching fails for any reason after the status update, those studies will be marked as 'pending' but won't actually have jobs in the queue. Consider updating the status to 'pending' only after successful job dispatch, or use a transaction to ensure atomicity.

Suggested change

// Reset status to pending

Study::where('metadata_bagit_generation_status', 'failed')

->whereIn('id', $failedStudies->pluck('id'))

->update(['metadata_bagit_generation_status' => 'pending']);

resolve job failure issue

…nerationJobs command

feat: Add job for metadata extraction and generate BagIt packages

354f1a9

NishaSharma14 requested review from Copilot and vcnainala February 19, 2026 14:05

NishaSharma14 linked an issue Feb 19, 2026 that may be closed by this pull request

Metadata extraction and BagIt Generation #1363

Open

1 task

Copilot started reviewing on behalf of NishaSharma14 February 19, 2026 14:06 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

NishaSharma14 added 2 commits February 20, 2026 12:01

fix: update study model to include bagit generation status

3958caf

resolve job failure issue

fix: remove unnecessary lines from the QueueMetadataExtractionBagitGe…

73e67f2

…nerationJobs command

vcnainala approved these changes Feb 23, 2026

View reviewed changes

NishaSharma14 merged commit 90e86be into development Feb 23, 2026
3 of 5 checks passed

NishaSharma14 deleted the issue-#1363 branch February 23, 2026 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add job for metadata extraction and generate BagIt packages#1366

feat: Add job for metadata extraction and generate BagIt packages#1366
NishaSharma14 merged 3 commits intodevelopmentfrom
issue-#1363

NishaSharma14 commented Feb 19, 2026

Uh oh!

codecov bot commented Feb 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		Log::warning("Download failed: {$e->getMessage()}. Retrying...");
		sleep(2);

-                    Log::warning("Download failed: {$e->getMessage()}. Retrying...");
-                    sleep(2);
+                    $baseDelaySeconds = (int) config('nmrxiv.spectra_parsing.retry_delay_seconds', 2);
+                    $backoffFactor = (int) config('nmrxiv.spectra_parsing.retry_backoff_factor', 2);
+                    $delaySeconds = $baseDelaySeconds * ($backoffFactor ** ($attempt - 1));
+                    Log::warning("Download failed: {$e->getMessage()}. Retrying in {$delaySeconds} seconds...");
+                    sleep($delaySeconds);

	// Don't rethrow - let the job complete so it doesn't retry infinitely
	throw $e;

Conversation

NishaSharma14 commented Feb 19, 2026

Uh oh!

codecov bot commented Feb 19, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants