| title | LLM-Annotator: Automated Named Entity Recognition Annotation for Energy Materials Literature Using Large Language Models | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| tags |
|
|||||||||
| authors |
|
|||||||||
| affiliations |
|
|||||||||
| date | 19 April 2026 | |||||||||
| bibliography | paper.bib |
LLM-Annotator is an open-source Python web application for automated named entity recognition
(NER) annotation of scientific literature, designed specifically for energy materials and chemistry
research. The tool integrates large language models (LLMs)—currently OpenAI GPT-4o and Anthropic
Claude 3.7 Sonnet—with domain-aware validation algorithms to extract structured entity annotations
from unstructured text. It supports both flat and hierarchical (nested) annotation modes, a
multi-stage position validation and correction pipeline, LLM-based quality assessment, automated
co-reference detection, and export to JSON and CoNLL formats compatible with standard NER training
pipelines such as spaCy. A Streamlit-based browser interface makes the tool accessible to
researchers without programming expertise. On a 50-paper gold-standard corpus of energy materials
literature, the system achieves an overall F1 score of 0.88—comparable to expert human
annotators—while reducing annotation time by 94% and API cost by up to 98% relative to fully
manual workflows.
The volume of scientific literature in energy materials research—spanning batteries, photovoltaics, thermoelectrics, and fuel cells—grows faster than domain experts can manually curate. Critical knowledge about material compositions, synthesis protocols, characterisation techniques, and performance metrics remains locked in free text, which limits its accessibility for high-throughput screening, machine learning model training, and digital twin construction [@tshitoyan2019; @olivetti2020]. Converting this literature into structured, machine-readable databases is a recognised bottleneck in modern materials discovery pipelines [@kim2017; @kononova2019].
Existing scientific NLP tools such as ChemDataExtractor [@mavracic2021] and MatScholar
[@trewartha2022] address parts of this need but share three important limitations: (1) they do
not support nested or overlapping entities, which are common in materials descriptions (e.g.,
a lithium-doped mesoporous TiO2 scaffold); (2) they rely on fixed, pre-trained entity schemas
that are difficult to extend to emerging material classes; and (3) they lack integrated quality
assessment mechanisms. No existing open tool simultaneously supports hierarchical annotation,
character-level position validation, phantom-entity detection, and NER-compatible export in a
single pipeline. LLM-Annotator directly addresses these gaps, serving materials informatics
researchers, NLP practitioners building domain-specific NER models, and any scientist who needs
to convert literature corpora into structured datasets.
Scientific text mining for materials science has a substantial history. Early approaches used
rule-based parsing and chemical dictionaries [@swain2016]. Deep learning models such as BiLSTM-CRF
and BERT-based architectures [@devlin2019] advanced performance on NER tasks but require large
annotated training datasets, which are scarce in specialised domains [@gupta2022]. This annotation
bottleneck motivated research into transfer learning and the use of LLMs for information
extraction [@jablonka2024]. Recent work has demonstrated that GPT-4-class models can perform
competitive zero-shot and few-shot extraction of chemical entities [@zheng2023], though
hallucination and positional accuracy remain open challenges [@ji2023]. LLM-Annotator
contributes to this landscape by coupling LLM extraction with deterministic validation layers and
a human-in-the-loop interface, combining the contextual power of LLMs with the reliability
requirements of machine learning data pipelines.
Documents are segmented into overlapping chunks (200–4,000 characters, configurable) with absolute character offset tracking. Each chunk is submitted to the selected LLM using a structured prompt that defines entity types, labelling rules, and a required JSON output schema. Two prompt templates support flat (non-overlapping) and nested (hierarchical) annotation modes. Extracted entities are deduplicated across chunk boundaries and subjected to structural validation. Character-level position accuracy is verified via a three-stage correction procedure: (1) local search within a ±50-character window; (2) global document search; (3) fuzzy-normalised matching. Entities that fail all three stages are flagged for manual review. A secondary LLM evaluation pass assesses label correctness and returns per-entity recommendations (keep / relabel / delete) with confidence scores.
The nested annotation mode encodes parent–child entity relationships in which smaller entities are fully contained within a compositional or procedural parent. This is essential for representing complex materials science concepts such as nested dopant–host–architecture descriptions, where a flat schema would lose structural meaning. Two-level nesting achieves 78% exact-match accuracy; three-level nesting achieves 64%.
After initial LLM annotation, a pattern-matching module identifies all co-referent mentions of each annotated entity throughout the document, increasing entity-instance recall by 287% (3.87×) with less than 2% additional processing time. Annotations are exported as (i) structured JSON preserving full metadata, or (ii) CoNLL-format IOB sequences tokenised by spaCy, enabling immediate use for training domain-specific NER models that operate independently of LLM API access.
The Streamlit interface guides users through model selection, tag definition via CSV upload,
annotation, interactive review, and export without programming. Tag schemas are user-defined,
making the tool adaptable to any materials subdomain or scientific field. The software requires
Python ≥ 3.9 and depends on the streamlit, openai, anthropic, spacy, and pandas
packages, all available on PyPI.
LLM-Annotator enables several research capabilities that were previously impractical at scale.
First, it accelerates the construction of large materials databases by processing documents in
minutes rather than hours, making comprehensive corpus-level annotation economically feasible.
Second, CoNLL export allows researchers to bootstrap domain-specific NER models from
automatically annotated data, producing models that operate without ongoing LLM API costs.
Third, the hierarchical annotation mode supports knowledge graph construction by preserving
compositional and procedural entity relationships that are critical for digital twin development
and data-driven materials design [@tao2019]. In domain case studies, the system processed 40
battery-materials papers (3,827 annotations, 7.6 hours versus an estimated 134 hours manually)
and 40 perovskite solar-cell papers (3,156 annotations, 6.8 hours versus an estimated 110 hours
manually), in each case yielding a >93% reduction in processing time and enabling corpus-level
trend analyses that would otherwise be prohibitively expensive.
| Entity Type | Precision | Recall | F1 |
|---|---|---|---|
| Material Names | 0.92 | 0.92 | 0.92 |
| Chemical Formulas | 0.91 | 0.90 | 0.91 |
| Synthesis Methods | 0.88 | 0.86 | 0.87 |
| Characterisation Techniques | 0.87 | 0.85 | 0.86 |
| Performance Metrics | 0.85 | 0.83 | 0.84 |
| Experimental Conditions | 0.84 | 0.82 | 0.83 |
| Process Parameters | 0.86 | 0.84 | 0.85 |
| Overall | 0.89 | 0.87 | 0.88 |
: System performance on the gold-standard corpus (n = 13,941 test instances across 15 expert-annotated papers; Cohen's κ = 0.84; 95% CI via bootstrap resampling, B = 1,000).
The software itself uses GPT-4o and Claude 3.7 Sonnet at runtime to perform entity extraction and quality assessment, as described throughout this paper. These model interactions are a core part of the tool's functionality and are disclosed to end users in the interface. During preparation of this manuscript, AI assistance was used for copy-editing of the paper text. All scientific content, experimental design, results, and conclusions were produced, reviewed, and validated by the author.
This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy – EXC-2193/1 – 390951807.