title

LLM-Annotator: Automated Named Entity Recognition Annotation for Energy Materials Literature Using Large Language Models

Summary

LLM-Annotator is an open-source Python web application for automated named entity recognition (NER) annotation of scientific literature, designed specifically for energy materials and chemistry research. The tool integrates large language models (LLMs)—currently OpenAI GPT-4o and Anthropic Claude 3.7 Sonnet—with domain-aware validation algorithms to extract structured entity annotations from unstructured text. It supports both flat and hierarchical (nested) annotation modes, a multi-stage position validation and correction pipeline, LLM-based quality assessment, automated co-reference detection, and export to JSON and CoNLL formats compatible with standard NER training pipelines such as spaCy. A Streamlit-based browser interface makes the tool accessible to researchers without programming expertise. On a 50-paper gold-standard corpus of energy materials literature, the system achieves an overall F1 score of 0.88—comparable to expert human annotators—while reducing annotation time by 94% and API cost by up to 98% relative to fully manual workflows.

Statement of Need

The volume of scientific literature in energy materials research—spanning batteries, photovoltaics, thermoelectrics, and fuel cells—grows faster than domain experts can manually curate. Critical knowledge about material compositions, synthesis protocols, characterisation techniques, and performance metrics remains locked in free text, which limits its accessibility for high-throughput screening, machine learning model training, and digital twin construction [@tshitoyan2019; @olivetti2020]. Converting this literature into structured, machine-readable databases is a recognised bottleneck in modern materials discovery pipelines [@kim2017; @kononova2019].

Existing scientific NLP tools such as ChemDataExtractor [@mavracic2021] and MatScholar [@trewartha2022] address parts of this need but share three important limitations: (1) they do not support nested or overlapping entities, which are common in materials descriptions (e.g., a lithium-doped mesoporous TiO2 scaffold); (2) they rely on fixed, pre-trained entity schemas that are difficult to extend to emerging material classes; and (3) they lack integrated quality assessment mechanisms. No existing open tool simultaneously supports hierarchical annotation, character-level position validation, phantom-entity detection, and NER-compatible export in a single pipeline. LLM-Annotator directly addresses these gaps, serving materials informatics researchers, NLP practitioners building domain-specific NER models, and any scientist who needs to convert literature corpora into structured datasets.

State of the Field

Scientific text mining for materials science has a substantial history. Early approaches used rule-based parsing and chemical dictionaries [@swain2016]. Deep learning models such as BiLSTM-CRF and BERT-based architectures [@devlin2019] advanced performance on NER tasks but require large annotated training datasets, which are scarce in specialised domains [@gupta2022]. This annotation bottleneck motivated research into transfer learning and the use of LLMs for information extraction [@jablonka2024]. Recent work has demonstrated that GPT-4-class models can perform competitive zero-shot and few-shot extraction of chemical entities [@zheng2023], though hallucination and positional accuracy remain open challenges [@ji2023]. LLM-Annotator contributes to this landscape by coupling LLM extraction with deterministic validation layers and a human-in-the-loop interface, combining the contextual power of LLMs with the reliability requirements of machine learning data pipelines.

Software Design

Core Annotation Pipeline

Documents are segmented into overlapping chunks (200–4,000 characters, configurable) with absolute character offset tracking. Each chunk is submitted to the selected LLM using a structured prompt that defines entity types, labelling rules, and a required JSON output schema. Two prompt templates support flat (non-overlapping) and nested (hierarchical) annotation modes. Extracted entities are deduplicated across chunk boundaries and subjected to structural validation. Character-level position accuracy is verified via a three-stage correction procedure: (1) local search within a ±50-character window; (2) global document search; (3) fuzzy-normalised matching. Entities that fail all three stages are flagged for manual review. A secondary LLM evaluation pass assesses label correctness and returns per-entity recommendations (keep / relabel / delete) with confidence scores.

Nested Annotation

The nested annotation mode encodes parent–child entity relationships in which smaller entities are fully contained within a compositional or procedural parent. This is essential for representing complex materials science concepts such as nested dopant–host–architecture descriptions, where a flat schema would lose structural meaning. Two-level nesting achieves 78% exact-match accuracy; three-level nesting achieves 64%.

Auto-Detection and Export

After initial LLM annotation, a pattern-matching module identifies all co-referent mentions of each annotated entity throughout the document, increasing entity-instance recall by 287% (3.87×) with less than 2% additional processing time. Annotations are exported as (i) structured JSON preserving full metadata, or (ii) CoNLL-format IOB sequences tokenised by spaCy, enabling immediate use for training domain-specific NER models that operate independently of LLM API access.

User Interface and Customisation

The Streamlit interface guides users through model selection, tag definition via CSV upload, annotation, interactive review, and export without programming. Tag schemas are user-defined, making the tool adaptable to any materials subdomain or scientific field. The software requires Python ≥ 3.9 and depends on the streamlit, openai, anthropic, spacy, and pandas packages, all available on PyPI.

Research Impact Statement

LLM-Annotator enables several research capabilities that were previously impractical at scale. First, it accelerates the construction of large materials databases by processing documents in minutes rather than hours, making comprehensive corpus-level annotation economically feasible. Second, CoNLL export allows researchers to bootstrap domain-specific NER models from automatically annotated data, producing models that operate without ongoing LLM API costs. Third, the hierarchical annotation mode supports knowledge graph construction by preserving compositional and procedural entity relationships that are critical for digital twin development and data-driven materials design [@tao2019]. In domain case studies, the system processed 40 battery-materials papers (3,827 annotations, 7.6 hours versus an estimated 134 hours manually) and 40 perovskite solar-cell papers (3,156 annotations, 6.8 hours versus an estimated 110 hours manually), in each case yielding a >93% reduction in processing time and enabling corpus-level trend analyses that would otherwise be prohibitively expensive.

Entity Type	Precision	Recall	F1
Material Names	0.92	0.92	0.92
Chemical Formulas	0.91	0.90	0.91
Synthesis Methods	0.88	0.86	0.87
Characterisation Techniques	0.87	0.85	0.86
Performance Metrics	0.85	0.83	0.84
Experimental Conditions	0.84	0.82	0.83
Process Parameters	0.86	0.84	0.85
Overall	0.89	0.87	0.88

: System performance on the gold-standard corpus (n = 13,941 test instances across 15 expert-annotated papers; Cohen's κ = 0.84; 95% CI via bootstrap resampling, B = 1,000).

AI Usage Disclosure

The software itself uses GPT-4o and Claude 3.7 Sonnet at runtime to perform entity extraction and quality assessment, as described throughout this paper. These model interactions are a core part of the tool's functionality and are disclosed to end users in the interface. During preparation of this manuscript, AI assistance was used for copy-editing of the paper text. All scientific content, experimental design, results, and conclusions were produced, reviewed, and validated by the author.

Acknowledgements

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy – EXC-2193/1 – 390951807.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summary

Statement of Need

State of the Field

Software Design

Core Annotation Pipeline

Nested Annotation

Auto-Detection and Export

User Interface and Customisation

Research Impact Statement

AI Usage Disclosure

Acknowledgements

References

FilesExpand file tree

paper.md

Latest commit

History

paper.md

File metadata and controls

Summary

Statement of Need

State of the Field

Software Design

Core Annotation Pipeline

Nested Annotation

Auto-Detection and Export

User Interface and Customisation

Research Impact Statement

AI Usage Disclosure

Acknowledgements

References