Skip to content

feat: Add job for metadata extraction and generate BagIt packages#1366

Merged
NishaSharma14 merged 3 commits intodevelopmentfrom
issue-#1363
Feb 23, 2026
Merged

feat: Add job for metadata extraction and generate BagIt packages#1366
NishaSharma14 merged 3 commits intodevelopmentfrom
issue-#1363

Conversation

@NishaSharma14
Copy link
Contributor

No description provided.

@codecov
Copy link

codecov bot commented Feb 19, 2026

Codecov Report

❌ Patch coverage is 0% with 225 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.15%. Comparing base (fd35ae4) to head (354f1a9).

Files with missing lines Patch % Lines
...bs/ProcessMetadataExtractionBagitGenerationJob.php 0.00% 225 Missing ⚠️
Additional details and impacted files
@@                Coverage Diff                @@
##             development    #1366      +/-   ##
=================================================
- Coverage          72.92%   71.15%   -1.78%     
- Complexity          2358     2412      +54     
=================================================
  Files                209      210       +1     
  Lines               9094     9319     +225     
=================================================
- Hits                6632     6631       -1     
- Misses              2462     2688     +226     
Flag Coverage Δ
unittests 71.15% <0.00%> (-1.78%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new job and console command system for extracting metadata from study files and generating BagIt-compliant packages. The implementation introduces a queued job that downloads study ZIP files, calls external APIs (NMRKit and bioschema), extracts metadata and images, and creates BagIt manifest files. It also removes an older ExtractSpectra command that is being replaced by this new workflow.

Changes:

  • Adds metadata extraction and BagIt generation job infrastructure with configurable retry logic and timeout settings
  • Introduces database columns to track processing status and logs for the metadata extraction workflow
  • Adds whikloj/bagittools dependency for BagIt package creation and related PEAR dependencies

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 20 comments.

Show a summary per file
File Description
database/migrations/2026_02_19_120638_add_metadata_extraction_bagit_generation_columns_to_studies_table.php Adds status and logs columns to studies table for tracking metadata extraction progress
config/nmrxiv.php Adds configuration section for spectra parsing with API endpoints, storage settings, and timeout values
composer.json Adds whikloj/bagittools dependency
composer.lock Updates lock file with new BagIt-related dependencies (pear/archive_tar, pear/pear-core-minimal, whikloj/bagittools)
app/Jobs/ProcessMetadataExtractionBagitGenerationJob.php Implements queued job for downloading study data, calling APIs, extracting images, and generating BagIt manifests
app/Console/Commands/QueueMetadataExtractionBagitGenerationJobs.php Adds console command to queue metadata extraction jobs for eligible studies with filtering options
app/Console/Commands/ExtractSpectra.php Removes legacy command that is being replaced by the new job-based workflow
.gitignore Excludes the spectra_parse storage directory
.env.example Adds example environment variables for spectra parsing configuration

Comment on lines +246 to +247
Log::warning("Download failed: {$e->getMessage()}. Retrying...");
sleep(2);
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job uses hard-coded sleep(2) for retry delays. In a production environment with many concurrent jobs, this fixed delay could lead to inefficient retry patterns. Consider implementing exponential backoff for retries or making the retry delay configurable through the config file.

Suggested change
Log::warning("Download failed: {$e->getMessage()}. Retrying...");
sleep(2);
$baseDelaySeconds = (int) config('nmrxiv.spectra_parsing.retry_delay_seconds', 2);
$backoffFactor = (int) config('nmrxiv.spectra_parsing.retry_backoff_factor', 2);
$delaySeconds = $baseDelaySeconds * ($backoffFactor ** ($attempt - 1));
Log::warning("Download failed: {$e->getMessage()}. Retrying in {$delaySeconds} seconds...");
sleep($delaySeconds);

Copilot uses AI. Check for mistakes.
Comment on lines +386 to +391
// Update bag with checksums
$bag->update();

// Package the bag (this generates manifests)
$bag->package($bagPath);

Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Bag::create() and Bag::package() methods are called with the same $bagPath parameter, but according to BagItTools documentation, package() should receive an output path for the packaged bag, not the source path. This could cause the method to overwrite the original directory structure. Verify the correct usage of the BagItTools library API.

Suggested change
// Update bag with checksums
$bag->update();
// Package the bag (this generates manifests)
$bag->package($bagPath);
// Update bag with checksums and manifests
$bag->update();

Copilot uses AI. Check for mistakes.
use App\Models\Study;
use Illuminate\Console\Command;

class QueueMetadataExtractionBagitGenerationJobs extends Command
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The console command is missing comprehensive test coverage. Following the codebase convention where similar commands like VerifyFileIntegrityCommand have corresponding tests, this command should have tests covering the various options (--limit, --ids, --fresh, --retry-failed) and ensure jobs are dispatched correctly.

Copilot uses AI. Check for mistakes.
Comment on lines +110 to +217
protected function processStudy(Study $study): array
{
// Remove NMRXIV: prefix if present (e.g., "NMRXIV:S1295" -> "S1295")
$studyIdentifier = str_replace('NMRXIV:', '', $study->identifier);
$disk = Storage::disk(config('nmrxiv.spectra_parsing.storage_disk', 'local'));
$basePath = config('nmrxiv.spectra_parsing.storage_path', 'spectra_parse');
$baseDir = "{$basePath}/{$studyIdentifier}";
$dataDir = "{$baseDir}/data";
$zipPath = null;

try {
// Step 1: Download ZIP file
Log::info("Step 1/7: Downloading ZIP file for study {$study->id}");
$zipPath = $this->downloadWithRetry($study->download_url, $this->retries);

// Step 2: Extract ZIP to data directory
Log::info('Step 2/7: Extracting ZIP archive...');
$studyName = $this->extractZip($zipPath, $disk->path($dataDir));

// Step 3: Call NMRKit API
Log::info('Step 3/7: Calling NMRKit API...');
$jsonData = $this->callNMRKitAPI($study->download_url, $this->retries);

// Step 4: Fetch Bio-Schema
Log::info('Step 4/7: Fetching bio-schema...');
$bioSchema = null;
try {
$bioSchema = $this->fetchBioSchema($studyIdentifier, $this->retries);
} catch (\Exception $e) {
Log::warning("Bio-schema fetch failed: {$e->getMessage()}. Continuing without bio-schema...");
}

// Step 5: Create nmrxiv-meta structure
Log::info('Step 5/7: Creating nmrxiv-meta structure...');
$metaDir = "{$dataDir}/{$studyName}/nmrxiv-meta";
$imagesDir = "{$metaDir}/images";

if (! $disk->exists($metaDir)) {
$disk->makeDirectory($metaDir, 0755, true);
}

// Clean up old images directory to prevent duplicates from previous runs
if ($disk->exists($imagesDir)) {
// Delete all PNG files in the images directory
$oldImages = $disk->files($imagesDir);
foreach ($oldImages as $oldImage) {
$disk->delete($oldImage);
}
Log::info('Cleaned up '.count($oldImages).' old image files');
} else {
$disk->makeDirectory($imagesDir, 0755, true);
}

// Clean up spectra data
if (isset($jsonData['data']['spectra']) && is_array($jsonData['data']['spectra'])) {
foreach ($jsonData['data']['spectra'] as &$spectra) {
unset($spectra['data']);
unset($spectra['meta']);
unset($spectra['originalData']);
unset($spectra['originalInfo']);
}
unset($spectra);
}

// Extract and save images as PNG files
$imageCount = 0;
if (isset($jsonData['images']) && is_array($jsonData['images'])) {
foreach ($jsonData['images'] as $imageData) {
if (isset($imageData['id']) && isset($imageData['image'])) {
$imageId = $imageData['id'];
$base64Data = $imageData['image'];

// Save PNG file
$pngPath = "{$imagesDir}/{$imageId}.png";
$this->savePNGFromBase64($base64Data, $disk->path($pngPath));
$imageCount++;
}
}
}

// Save S{identifier}.nmrium (full API response with base64 images intact)
$nmriumPath = "{$metaDir}/{$studyIdentifier}.nmrium";
$formattedJson = json_encode($jsonData, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES);
$disk->put($nmriumPath, $formattedJson);

// Save bio-schema.json
if ($bioSchema !== null) {
$bioSchemaPath = "{$metaDir}/bio-schema.json";
$bioSchemaJson = json_encode($bioSchema, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES);
$disk->put($bioSchemaPath, $bioSchemaJson);
}

// Step 6: Generate BagIt manifests
Log::info('Step 6/7: Generating BagIt manifests...');
$this->generateBagItManifests($disk->path($baseDir));

return [
'imageCount' => $imageCount,
'location' => $disk->path($baseDir),
];
} finally {
// Step 7: Cleanup temporary files (always runs, even on exception)
if ($zipPath && file_exists($zipPath)) {
Log::info('Step 7/7: Cleaning up temporary ZIP file...');
@unlink($zipPath);
}
}
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential disk space exhaustion risk: The job downloads ZIP files, extracts them, generates images, and creates BagIt packages without any checks on available disk space. For large studies or when processing many studies concurrently, this could fill up the disk. Consider implementing disk space checks before processing or adding cleanup of old processed studies.

Copilot uses AI. Check for mistakes.
]);
}

// Don't rethrow - let the job complete so it doesn't retry infinitely
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job attempts to catch and handle exceptions but doesn't rethrow them, which prevents Laravel's queue retry mechanism from working. When an exception is caught and the job completes normally (line 104), failed jobs won't be retried even though $tries is set to 3. Consider either allowing the exception to propagate for retriable errors, or implementing manual retry logic within the job.

Suggested change
// Don't rethrow - let the job complete so it doesn't retry infinitely
throw $e;

Copilot uses AI. Check for mistakes.
/**
* Manually generate BagIt manifests.
*/
protected function generateBagItManually(string $bagPath): void
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing explicit return type declaration. According to the PHP guidelines in this codebase, all methods should have explicit return type declarations. Add ': void' to the method signature.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +258 to +280
protected function extractZip(string $zipPath, string $extractTo): string
{
$zip = new ZipArchive;

if ($zip->open($zipPath) !== true) {
throw new \Exception("Failed to open ZIP file: {$zipPath}");
}

// Get the root folder name from first entry
$studyName = null;
if ($zip->numFiles > 0) {
$firstEntry = $zip->getNameIndex(0);
$parts = explode('/', $firstEntry);
$studyName = $parts[0];
}

if (! $studyName) {
throw new \Exception('Could not determine study name from ZIP');
}

// Extract all files
$zip->extractTo($extractTo);
$zip->close();
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ZIP extraction does not validate the extracted file paths, which could be vulnerable to path traversal attacks (Zip Slip vulnerability). A malicious ZIP file could contain entries with paths like "../../../etc/passwd" that escape the intended extraction directory. Validate that all extracted file paths remain within the intended directory before extraction.

Copilot uses AI. Check for mistakes.
Comment on lines +222 to +234
protected function downloadWithRetry(string $url, int $retries): string
{
$attempt = 0;
$lastException = null;

while ($attempt < $retries) {
try {
$attempt++;
Log::debug("Download attempt {$attempt}/{$retries}...");

$tempPath = storage_path('app/temp_'.uniqid().'.zip');
$timeout = config('nmrxiv.spectra_parsing.download_timeout', 300);
$response = Http::timeout($timeout)->get($url);
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The download URL from the Study model is used directly in HTTP requests without validation. If a malicious actor can control the download_url field, they could potentially cause SSRF (Server-Side Request Forgery) attacks by pointing to internal network resources. Consider validating that the URL points to expected domains or implementing allowlist-based URL validation.

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +74

// Reset status to pending
Study::where('metadata_bagit_generation_status', 'failed')
->whereIn('id', $failedStudies->pluck('id'))
->update(['metadata_bagit_generation_status' => 'pending']);
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using the --retry-failed option, the status is reset to 'pending' before dispatching jobs. However, if job dispatching fails for any reason after the status update, those studies will be marked as 'pending' but won't actually have jobs in the queue. Consider updating the status to 'pending' only after successful job dispatch, or use a transaction to ensure atomicity.

Suggested change
// Reset status to pending
Study::where('metadata_bagit_generation_status', 'failed')
->whereIn('id', $failedStudies->pluck('id'))
->update(['metadata_bagit_generation_status' => 'pending']);

Copilot uses AI. Check for mistakes.
@NishaSharma14 NishaSharma14 merged commit 90e86be into development Feb 23, 2026
3 of 5 checks passed
@NishaSharma14 NishaSharma14 deleted the issue-#1363 branch February 23, 2026 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metadata extraction and BagIt Generation

3 participants