Skip to content

Latest commit

 

History

History
310 lines (243 loc) · 9.58 KB

File metadata and controls

310 lines (243 loc) · 9.58 KB

MOSAIQ schema-level harmonisation

This document describes the first MOSAIQ harmonisation layer for ISD and ARAUS. The scope is deliberately narrow: it prepares a common schema and conservative ISO 12913 semantic representation for later benchmark work.

What schema-level harmonisation means

Schema-level harmonisation means representing heterogeneous source datasets with a shared record structure. In this repository, that structure covers:

  • dataset identity
  • data source and citation
  • access and licence
  • study type
  • acoustic environment
  • people or participants
  • context
  • sound sources
  • audio modality
  • visual modality
  • perceptual annotation framework
  • ISO 12913 perceived affective quality fields
  • acoustic and psychoacoustic indicators
  • derived feature records
  • missingness
  • provenance
  • validation status

The shared schema is stored at:

shared_schemas/schema_level_harmonisation.schema.json

What this implementation does

This implementation adds a minimal schema-level layer:

  • a shared JSON schema for dataset-level and sample-level harmonisation records
  • controlled vocabularies for access, study, audio, visual, missingness, and provenance fields
  • explicit harmonisation-level status fields for structural, semantic, feature, statistical, and benchmark-split harmonisation
  • a checklist file for documenting what is aligned, partially aligned, documented only, or not performed: mappings/harmonisation_checklist.json
  • a lightweight graph model for explainable schema-level information fusion: mappings/mosaiq_harmonisation_graph.json
  • a multi-view alignment block for audio, visual, context, perception, and feature views
  • a harmonisation potential score that reports structural readiness without claiming statistical equivalence
  • canonical ISO 12913 PAQ item names: pleasant, vibrant, eventful, chaotic, annoying, monotonous, uneventful, and calm
  • separate derived ISO coordinate fields: pleasantness and eventfulness
  • mapping tables for ISD and ARAUS: mappings/isd_to_mosaiq_schema.json mappings/araus_to_mosaiq_schema.json
  • two demonstration JSONL records: examples/harmonised_samples/isd_sample.jsonl examples/harmonised_samples/araus_sample.jsonl
  • a lightweight validator: scripts/validate_schema_harmonisation.py
  • a small ISO helper module: scripts/iso12913.py

The example records use existing values from the current MOSAIQ CSV files where available. Fields that are unavailable in those rows are represented through explicit missingness records rather than fabricated values.

Each mapping record now also includes mapping_confidence, evidence_type, review_status, ambiguity_note, and source_column_examples. These fields make the harmonisation auditable: a direct ISO PAQ mapping can be marked as high-confidence, while modality or feature metadata can remain medium-confidence until source documentation and extraction provenance are reviewed.

What this implementation does not do

This layer does not create a fully harmonised benchmark dataset.

It also does not perform:

  • statistical harmonisation
  • domain adaptation
  • distribution matching
  • label rescaling
  • imputation of missing labels
  • train/validation/test split creation
  • cross-framework mapping to SAM, EmojiGrid, valence/arousal, annoyance, or other non-ISO frameworks

Existing split columns in the current ARAUS/ISD tables may still be documented as source metadata, but this task does not introduce new benchmark splits.

ISO 12913 semantic boundary

ISD and ARAUS both use ISO 12913 soundscape constructs, so the semantic layer is ISO-only. MOSAIQ canonical PAQ fields preserve the eight ISO perceived affective quality items and keep original field names in each mapping record.

Item-level ratings are stored under:

perception.iso_12913.paq

Derived Method A coordinates are stored separately under:

perception.iso_12913.derived_coordinates

This separation avoids mixing raw PAQ ratings with derived pleasantness and eventfulness coordinates.

MOSAIQ also separates value layers:

  • raw_response: participant-level item ratings, when represented
  • aggregated_sample_annotation: clip/sample-level PAQ summaries such as means
  • derived_coordinate: pleasantness/eventfulness coordinates computed or copied as ISO Method A derived values

Structural, semantic, and statistical harmonisation

Structural harmonisation defines the shared record layout: where dataset, sample, modality, feature, missingness, provenance, and validation information should be stored.

Semantic harmonisation defines the meaning of fields. For this task, semantic harmonisation is limited to ISO 12913 concepts already used by ISD and ARAUS.

Statistical harmonisation would address differences in label distributions, collection settings, domains, participant populations, or sampling strategy. That work is intentionally out of scope here and should be documented as a future benchmark-construction step.

Multi-view alignment and graph representation

Inspired by information-fusion and multimodal entity-alignment literature, the sample examples now include a lightweight alignment block:

{
  "audio_view": "available",
  "visual_view": "available",
  "context_view": "available",
  "perception_view": "available",
  "feature_view": "partial",
  "alignment_status": "partially_schema_aligned",
  "unresolved_issues": ["sound source taxonomy not reported"]
}

This is a schema-level alignment statement only. It records whether each modality/view can be represented, not whether distributions or labels are statistically aligned.

The knowledge_graph block records simple relationships such as:

  • Dataset has_sample Sample
  • Sample has_audio AudioAsset
  • Sample has_visual VisualAsset
  • Sample has_annotation PAQAnnotation
  • PAQAnnotation uses_framework_item ISO12913Item
  • Sample has_feature FeatureRecord

This graph-style representation supports traceability and explanation, but it does not implement knowledge-graph learning or automatic entity matching.

Harmonisation potential

harmonisation_potential is a structural readiness score in the range [0, 1]. It is useful for reporting whether a sample or dataset is ready for later benchmark construction. It is not a performance metric and not a statistical harmonisation score.

Example components include:

  • PAQ completeness
  • audio metadata completeness
  • visual metadata availability
  • context completeness
  • provenance completeness

Missingness

Missingness is represented explicitly as a list of records:

{
  "field": "visual",
  "status": "not_reported",
  "reason": "The source row has no populated video asset field."
}

Allowed statuses are:

  • available
  • not_collected
  • not_reported
  • not_accessible
  • not_applicable
  • unknown

The examples use missingness records for unavailable source citation fields, unreported sound-source details, and unavailable visual/audio assets.

Provenance

Provenance is also represented explicitly as a list of records:

{
  "source_type": "metadata_file",
  "note": "Values copied from datasets/ISD/data/clips.csv."
}

Allowed source types are:

  • paper
  • dataset_documentation
  • metadata_file
  • manual_inspection
  • code_extraction
  • author_communication
  • unknown

Acoustic and psychoacoustic indicators should include either a method or a provenance note when present.

ISO Method A utility

The helper function in scripts/iso12913.py can compute ISO Method A pleasantness and eventfulness when all eight PAQ items are available:

from iso12913 import compute_method_a_coordinates

coords = compute_method_a_coordinates({
    "pleasant": 3,
    "vibrant": 4,
    "eventful": 3,
    "chaotic": 2,
    "annoying": 2,
    "monotonous": 1,
    "uneventful": 2,
    "calm": 4,
})

The utility raises an error if required PAQ items are missing. It does not overwrite existing pleasantness or eventfulness values; callers must decide how to store any computed result and should record the computation in provenance.

Validation

Run the schema-level harmonisation validator from the repository root:

uv run python scripts/validate_schema_harmonisation.py

The validator checks:

  • required top-level fields
  • controlled vocabulary values
  • harmonisation-level statuses
  • multi-view alignment statuses
  • dataset_id and sample_id
  • perception.framework == ISO_12913 for ISD and ARAUS examples
  • missingness statuses
  • provenance records
  • canonical ISO PAQ field names
  • separation of raw PAQ items and derived ISO coordinates
  • lightweight graph node/edge consistency
  • harmonisation potential scores
  • mapping confidence/evidence/review fields
  • explicit visual missingness when no visual asset is available
  • method or provenance for acoustic and psychoacoustic indicators

Expected current summary:

files checked: 2
records checked: 2
mapping files checked: 2
checklist files checked: 1
graph model files checked: 1
warnings: 0
errors: 0

Literature-informed design notes

This MOSAIQ layer adapts three ideas from the referenced information-fusion literature while keeping the implementation conservative:

  • Nan et al. (2022), DOI 10.1016/j.inffus.2022.01.001: harmonisation should be reported with explicit dataset properties, missingness, provenance, reproducibility, and non-goals.
  • Holzinger et al. (2022), DOI 10.1016/j.inffus.2021.10.007: information fusion benefits from explainable, verifiable, graph-style representations.
  • Zhu et al. (2023), DOI 10.1016/j.inffus.2023.101935: multimodal alignment is better treated as multiple views rather than a single flattened feature table.