A knowledge engineering pipeline for building a Hybrid Intelligence (HI) knowledge graph from research papers and competition scenarios, using:
- LLM-based structured extraction into JSON
- deterministic Python-based RDF generation
- external linking, validation, querying, metrics, and embeddings
The project is built on top of the Hybrid Intelligence Ontology and the HINT thesaurus and was developed as an academic-engineering project for Knowledge Graphs & Semantic Technologies.
Current checked-in repository state:
- 11 HHAI papers
- 3 HI competition scenarios
- 14 source artifacts
- 3839 triples in the final merged graph
- 641 individuals
- 229 external links (including DOI-based
owl:sameAsfor all 11 papers) - SHACL conformance: 0 violations
- 5 competency-question SPARQL query outputs (each with CSV and PNG chart)
- provenance / repair audit outputs in
output/audit/
Canonical merged graph:
output/merged_kg.ttl
Populating an ontology from research papers and scenario descriptions is slow and error-prone when done manually. Important knowledge about:
- agents
- goals
- tasks
- capabilities
- contexts
- interactions
- evaluations
- experiments
is usually embedded in natural language rather than available in machine-readable form.
This repository addresses that problem through a hybrid workflow:
- an LLM extracts structured semantic content from text
- Python turns that structured output into RDF/Turtle deterministically
This design keeps the pipeline:
- inspectable
- debuggable
- reproducible downstream
- more academically defensible than direct LLM-to-RDF generation
The current project pipeline produces a knowledge graph that:
- integrates 11 research papers and 3 competition scenarios
- links to external resources through 229 external links (including DOI-based
owl:sameAsfor all 11 papers) - passes SHACL validation with zero violations
- supports five non-trivial competency-question queries (each with CSV and PNG chart)
- includes metrics, community analysis, embeddings, and ablation results
- explicitly reports repaired and unresolved ABox links instead of hiding them
- models
hi:hasSubTaskdecompositions in three papers (paper_01, paper_07, paper_11) - resolves all previously unresolved
requiresCapabilitylinks in paper_09
This makes the repository suitable both as:
- an academic submission
- a portfolio-ready knowledge graph engineering project
data/hi-ontology.ttl— Hybrid Intelligence Ontologydata/hi-thesaurus.ttl— HINT thesaurusdata/hi-ontology-extensions.ttl— project-specific ontology extensions
papers/*.pdf— 11 HHAI paper PDFsscenarios/scenarios.csv— 3 HI competition scenarios
prompts/extraction_prompt.txt— prompt template used during structured extraction
01 -> 02 -> 03 -> 04 -> 05
parse -> LLM extraction -> RDF instances -> external linking -> merge
This is the canonical graph-construction workflow and produces:
output/merged_kg.ttl
06 -> 10 -> 07 -> 08 -> 09
reasoning -> SHACL -> SPARQL -> metrics -> embeddings
These stages are downstream consumers of the merged graph, not the core build path.
Parses PDFs and scenario rows into normalized text files.
Output:
output/text/*.txt
Uses the OpenAI API to extract structured JSON from normalized text.
Output:
output/json/*.json
Deterministically converts JSON into RDF/Turtle instance graphs.
Outputs:
output/instances/*.ttloutput/hi-thesaurus-extensions.ttloutput/audit/instance_generation/output/audit/instance_generation_summary.jsonoutput/audit/instance_generation_summary.csv
Adds external links to concepts and instances using static mappings and query-based entity linking.
Output:
output/external_links.ttl
Merges schema, thesaurus, extensions, instances, and links into a canonical KG and performs structural validation checks.
It also loads protected manual patch files from output/manual_patches/ after the generated instances and links.
Outputs:
output/merged_kg.ttloutput/merged_kg.ntoutput/audit/merge_validation_summary.json
Runs OWL reasoning with Pellet through Owlready2.
Builds SHACL shapes and validates the merged KG.
Outputs:
output/shacl/hi_shapes.ttloutput/shacl/validation_report.ttloutput/shacl/validation_summary.txt
Runs five competency-question-oriented SPARQL queries and exports result tables/charts.
Output:
output/queries/
Computes ontology-level and graph-level metrics and produces visualizations.
Output:
output/metrics/
Trains KG embedding models, performs link prediction, visualizes embeddings, and runs an ablation experiment using a combined KG.
Output:
output/embeddings/
.
├── data/ # Ontology, thesaurus, populated KG, extensions
├── docs/ # Project contract / course guidance files
├── output/
│ ├── text/ # Parsed source text
│ ├── json/ # LLM-generated structured metadata
│ ├── manual_patches/ # Protected manual RDF patches loaded after generated files
│ ├── instances/ # Per-source RDF instance graphs
│ ├── audit/ # Repair/provenance summaries
│ ├── queries/ # SPARQL outputs
│ ├── metrics/ # Graph metrics and visualisations
│ ├── embeddings/ # KGE outputs and ablation artifacts
│ ├── shacl/ # SHACL shapes and validation reports
│ ├── external_links.ttl # External links
│ ├── merged_kg.ttl # Canonical merged graph
│ └── merged_kg.nt # N-Triples export for reasoning tools
├── papers/ # Source paper PDFs
├── prompts/ # Extraction prompt
├── scenarios/ # Scenario CSV input
├── src/ # Numbered pipeline scripts
├── tests/ # Lightweight canonical-path checks
├── AGENTS.md # Repo rules for coding agents
├── requirements.txt
└── README.md
Install Python dependencies:
pip install -r requirements.txtCurrent requirements.txt covers the main Python dependencies used by the numbered scripts, including:
rdflibopenaipdfplumberrequestsmatplotlibnetworkxnumpyowlready2pandaspykeenpyshaclscikit-learnscipytorchpytest
Stage 02_extract_metadata.py requires an OpenAI API key.
Set it in the environment before running extraction.
PowerShell:
$env:OPENAI_API_KEY="your-key-here"Some optional downstream stages require non-Python runtime support:
06_reasoning.pyrequires Java for Pellet / Owlready2 reasoning04_add_external_links.pyrequires internet access for Wikidata lookups
Run commands from the src/ directory.
cd src/
# 01. Parse PDFs and scenarios into normalized text
python 01_parse_sources.py
# 02. Extract structured metadata with the OpenAI API
python 02_extract_metadata.py
# 03. Generate deterministic RDF instance graphs
python 03_generate_instances.py
# 04. Add external links
python 04_add_external_links.py
# 05. Merge and validate
python 05_merge_and_validate.py# 06. OWL reasoning
python 06_reasoning.py
# 10. SHACL validation
python 10_shacl_validation.py
# 07. SPARQL competency queries
python 07_sparql_queries.py
# 08. KG metrics
python 08_kg_metrics.py
# 09. Embeddings + ablation
python 09_kg_embeddings.pyThis repository supports two practical reproducibility modes.
Rebuild the entire pipeline from raw papers and scenarios.
Requires:
- OpenAI API access for stage 02
- internet access for stage 04
- Java for stage 06
Reuse the checked intermediate outputs and rerun deterministic stages and analyses.
This allows reproducibility of:
- RDF generation
- merging
- validation
- SPARQL querying
- metrics
- embeddings
without repeating the LLM extraction step.
This distinction is important: the project is not fully self-contained from zero without API access, but it is substantially reproducible downstream from checked artifacts.
Several post-pipeline improvements were applied as direct edits to checked-in artifacts rather than as pipeline changes:
- DOI external links — added to
output/external_links.ttldirectly and tosrc/expanded_concept_mappings.pyso they are picked up by stage 05 and persist through future stage 04 reruns. hi:hasSubTaskdecompositions — originally added directly tooutput/instances/paper_01.ttl,paper_07.ttl, andpaper_11.ttl, but now migrated tooutput/manual_patches/instances/as protected patch files loaded during stage 05. The correspondingoutput/json/files were not updated, so a full stage 03 rebuild alone would not recreate these decompositions.- paper_09 capability alignment — corrected in both
output/json/paper_09.jsonandoutput/instances/paper_09.ttl, so a stage 03 re-run from the corrected JSON would reproduce the fix.
output/text/— normalized text extracted from papers and scenariosoutput/json/— structured LLM extraction outputsoutput/instances/— deterministic RDF instance graphsoutput/hi-thesaurus-extensions.ttl— proposed HINT concept extensionsoutput/external_links.ttl— external linksoutput/merged_kg.ttl— canonical merged KGoutput/merged_kg.nt— N-Triples export
output/audit/instance_generation/— per-source repair / unresolved audit filesoutput/audit/instance_generation_summary.jsonoutput/audit/instance_generation_summary.csvoutput/audit/merge_validation_summary.json
output/shacl/— SHACL shapes and reportsoutput/queries/— SPARQL outputsoutput/metrics/— graph metrics and visualizationsoutput/embeddings/— model comparison, link prediction, embeddings, ablation outputs
The current checked-in snapshot includes:
- 14 normalized text files in
output/text/ - 14 JSON extraction files in
output/json/ - 14 instance graphs in
output/instances/ - 1 canonical merged graph in
output/merged_kg.ttl
Key current graph statistics:
- 3839 triples
- 641 individuals
- 11 papers
- 3 competition scenarios
- 229 external links
These values describe the current repository snapshot and should not be interpreted as benchmark claims.
The project uses multiple quality-control layers.
The most important quality-control design choice is the separation between:
- LLM extraction
- deterministic RDF serialization
This keeps the KG generation process inspectable and reduces prompt-driven RDF variability.
05_merge_and_validate.py performs merge-time validation checks on key graph structures.
Current merged validation summary:
-
3839 triples
-
0 validation errors
-
0 validation warnings
-
explicit counts for:
- extracted links
- repaired links
- unresolved matches
- linked external actions
10_shacl_validation.py validates the merged KG against SHACL constraints.
Current status:
- CONFORMS
- 0 violations
Lightweight regression checks exist in:
tests/test_canonical_graph_contract.py
These focus on key canonical relations and the explicit query path used in Query 5.
A key late-stage improvement in this repository is that repaired and unresolved ABox links are no longer hidden.
Current audited counts from the instance-generation / merge summaries include:
- 46 direct
task -> requiresCapabilitylinks - 30 direct
execution -> realizesTasklinks - 5 fallback
task -> requiresCapabilityrepairs - 1 fallback
execution -> realizesTaskrepair - 5 unresolved capability matches
- 1 unresolved execution-task match
The three previously unresolved requiresCapability links in paper_09 were resolved by a direct correction to both output/json/paper_09.json and output/instances/paper_09.ttl. The correction aligned the capability concepts assigned to agents with the capability concepts referenced in the task required_capabilities fields (the two sides of LLM extraction had been misaligned).
This is important academically: the project now exposes structural repairs explicitly instead of silently masking them.
The repository currently includes five competency-question-oriented SPARQL query outputs, covering:
- team composition and roles per use case
- capability distribution across papers
- constraints and phenomena across use cases
- interaction patterns and methods
- evaluation and experiment structure
Current query stage highlights:
- 15 results for team composition / roles
- 11 capability-distribution results (filter:
> 1capability per paper) - 10 constraint / phenomenon results (filter:
>= 2total per use case) - 29 interaction / method results
- 11 evaluation / experiment results
Each query produces both a CSV result table and a PNG chart. These outputs are available in:
output/queries/
08_kg_metrics.py produced ontology-level and graph-level metrics, including:
- 25 OWL classes
- 48 object properties
- 11 datatype properties
- 292 SKOS concepts
- 921 graph nodes
- 2560 graph edges
- largest weakly connected component: 98.3% of nodes
- 15 Louvain communities
- modularity: 0.5883
This provides a useful graph-level view of cohesion, connectivity, and community structure.
09_kg_embeddings.py was run successfully on the project KG and on a combined KG for ablation.
Main result:
- DistMult was the best-performing model
Own KG:
- MRR = 0.3132
- Hits@10 = 0.5564
Combined KG:
- MRR = 0.4226
- Hits@10 = 0.5855
This suggests that enriching the training graph with the additional populated KG improved the embedding quality for the best-performing model.
The raw predicted links should be interpreted cautiously because the training graph mixes instance-level, thesaurus-level, and schema-level structure.
This repository is strong, but not perfect.
Stage 02_extract_metadata.py depends on an external LLM API and therefore is not fully self-contained without API access.
Author and affiliation linking uses conservative query-based matching and does not guarantee perfect entity resolution.
A small number of fallback-generated and unresolved links remain. These are now explicitly audited. The paper_09 requiresCapability misalignments were resolved by direct JSON and TTL correction rather than pipeline re-generation.
The embedding experiments are useful enrichment, but raw link prediction outputs include semantically noisy candidates and should not be overclaimed.
Pellet reasoning through Owlready2 requires Java and is more environment-sensitive than the rest of the pipeline.
The hi:hasSubTask decompositions for paper_01, paper_07, and paper_11 are stored in output/manual_patches/instances/ and merged after the generated instance files. The corresponding output/json/ files were not updated, so a full rebuild from stage 03 alone would not reproduce these decompositions unless the manual patches are also retained.
This repository now contains a complete academic-engineering pipeline that:
- builds a Hybrid Intelligence KG from 11 papers and 3 scenarios
- produces a final graph of 3839 triples
- adds 229 external links, including DOI-based
owl:sameAsfor all 11 papers - passes SHACL validation with zero violations
- answers five competency questions, each with a CSV result table and PNG chart
- provides graph metrics, community analysis, embeddings, and ablation results
- exposes repaired and unresolved ABox cases through explicit audit outputs
- models
hi:hasSubTasktask decompositions in three papers - resolves all
requiresCapabilitymisalignments in paper_09
The strongest current deliverables for review are:
output/merged_kg.ttloutput/shacl/validation_summary.txtoutput/audit/merge_validation_summary.jsonoutput/queries/output/metrics/output/embeddings/
- Python
- RDF / Turtle / N-Triples
- rdflib
- OpenAI API
- SHACL / pySHACL
- Owlready2 / Pellet
- NetworkX / Matplotlib
- PyKEEN / PyTorch
This repository was developed as a course project and research-engineering prototype. Reuse of included ontology resources, papers, or derived outputs should respect the original licenses and academic context.