diff --git a/CHANGELOG.md b/CHANGELOG.md
index f738e16c..db4b1017 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,6 +1,10 @@
# CHANGELOG
-## 1.0 (combined with 0.6.0)
+## 1.0.1
+
+- Updated technical design document to reflect 1.0 functionality
+
+## 1.0
### Sentence corpus creation
diff --git a/docs/images/system-architecture.svg b/docs/images/system-architecture.svg
index ba7d868e..0c967ba8 100644
--- a/docs/images/system-architecture.svg
+++ b/docs/images/system-architecture.svg
@@ -1 +1 @@
-
+
diff --git a/docs/technical-design.md b/docs/technical-design.md
index 0c895397..c3435dd7 100644
--- a/docs/technical-design.md
+++ b/docs/technical-design.md
@@ -3,450 +3,147 @@ title: Technical Design
hide: [navigation]
---
-# Technical Design Document
+# Technical Design
-*This serves as a living document to record the current design decisions for this project. The expectation is not a comprehensive listing of issues or tasks, but rather a recording of the key structural and process elements, as well as primary deliverables, that will guide development.*
-
-*The primary audience for this document is internal to the development team.*
-
-*Note: This document was originally written for planning. It will be updated during development, but it may be out of date.*
-
-______________________________________________________________________
-
-## `remarx`
-
-**Project Title:** Citing Marx: Die Neue Zeit, 1891-1918
-
-**RSE Lead:** Rebecca Koeser
-
-**RSE Team Members:** Laure Thompson, Hao Tan, Mary Naydan, Jeri Wieringa
-
-**Faculty PI:** Edward Baring
-
-**Development Start Date:** 2025-08-04
-
-**Target Release Date:** Fall 2025
+*Updated January 2026 to reflect the 1.0 release.*
______________________________________________________________________
### Statement of Purpose
-*Primary goals for the software*
-
-The primary goal of this software tool is to provide a simple user interface to enable researchers to upload files for two sets of German language content to be compared, and download a dataset of similar passages between the two sets of content.
-
-The primary use case for this tool is to identify direct quotes of Karl Marx's *Manifest der Kommunistischen Partei* (Communist Manifesto) and the first volume of *Das Kapital* (Kapital) within a subset of *Die Neue Zeit* (DNZ) articles.
-
-### Task Identification and Definition
-
-| **Task** | **Definition** |
-| :----------------------- | :--------------------------------------------------------------------------------------------- |
-| Quotation Identification | Determine which passages of text quote a source text and identify the passages that are quoted |
-
-### Out of Scope
-
-- Non-German quote detection
-- Cross-lingual quote detection is out of scope for this phase, but the tool will be built with an eye towards future expansion/support of other languages
-- Paraphrase / indirect quote detection
-- Attribution, citation, and named-reference detection
-
-## Requirements
-
-### Functional Requirements
+The `remarx` software package was developed to provide a simple user interface to enable researchers to find quotations in German-language content, and supports generating a dataset of similar passages (i.e., quotations or partial quotations) between two sets of texts.
-*What the software must do*
+The immediate use case motivating the development of this software was to identify direct quotes of Karl Marx's _Manifest der Kommunistischen Partei_ (Communist Manifesto) and the first volume of _Das Kapital_ (Kapital) within a subset of _Die Neue Zeit_ (DNZ) articles.
-- Reads plaintext files corresponding to original (i.e., source of quotes) and reuse (i.e., where quotations occur) texts
-- Builds a quotation corpus that operates at a sentence-level granularity
+The core functionality is to take text content input from _original_ (i.e., source of quotes) and _reuse_ (i.e., where quotations occur) documents, and builds a sentence or passage level quotation corpus. The functionality is primarily accessible through a web-based interface, which is comprised of two Marimo notebooks running in application mode; the functionality is also accessible through two command-line scripts. Both interfaces provide entry points to the same functionality in the `remarx` project. The application is designed to run locally on researcher machines.
-### Non-Functional Requirements
-
-*How the software must behave*
-
-- Assumes input content is written in German (will be built with an eye towards supporting other languages in future, but not in this version)
-- Is optimized for the DNZ-Marx use case
-- Can be adapted to other monolingual and crosslingual settings, but with no quality guarantees
-- Compatible with any *German* original (e.g., other texts written by Marx) and reuse texts (e.g., earlier/later issues of DNZ, other journals)
-- Uses sentence embeddings as a core data representation
-- For sentence embedding construction, uses a preexisting model available through sentence-transformers or huggingface
-- Developed and tested on Python 3.12
-- If this software is run locally it should be compatible with macOS, linux, and windows based operating systems; and be compatible with both x86 and ARM architectures.
-
-### System Architecture
-
-*High-level description of the system and its components*
+### Application Architecture

-The system is composed of two separate software programs: the sentence corpus builder and the core pipeline.
-
-The sentence corpus builder is a program for creating the input CSV files for the core pipeline.
-It takes as input some number of texts and outputs a CSV file containing sentence-level text and metadata for these input texts.
-This program will have custom solutions for the primary use case of the tool (TEI-XML, ALTO-XML) as well as plaintext for general use.
+The application has two components: [Sentence Corpus Builder](#sentence-corpus-builder) and [Quote Finder](#quote-finder).
-The core pipeline has two key inputs: reuse and original sentence corpora. Each of these inputs are one or more CSV files where each row corresponds to a sentence from a reuse or original text (i.e., as produced by the sentence corpus builder). Ultimately, the system will output a CSV file containing the identified quotations which will be called the quoted corpus.
+#### Default directory structure
-The core pipeline can be broken down into two key components:
+For convenience, the application recommends (but does not require) a default directory structure under a `remarx-data` directory in the user's home directory. The web interface for the Sentence Corpus Builder prompts saving sentence output files in the default original or reuse locations (with an option to store elsewhere), and the Quote Finder web interface defaults to loading content from the original and reuse corpora folders.
-1. Sentence-Level Quote Detection
-2. Quote Compilation
+```
+$HOME/remarx-data/
+├── corpora/
+│ ├── original/
+│ └── reuse/
+└── quotes/
+```
-**Sentence-Level Quote Detection.** This component takes an original and reuse sentence corpus as input and outputs a corpus of sentence-pairs corresponding to quotes.
+Sentence embeddings are named based on the input sentence corpus file they are generated from and cache files are created in the same directory as the input
-**Quote Compilation.** In this step, the quote sentence pairs from the last step are refined into our final quotes corpus. Since quotes can be more than one sentence long, the primary goal of this component is to merge quote sentence-pairs that correspond to the same multi-sentence quote.
+Quote Finder output by default will be saved to the quotes directory.
-- Minimum functionality: merge passages with sequential pairs of sentences from both corpora
+### Sentence Corpus Builder
-#### Infrastructure and Deployment
+The Sentence Corpus Builder is used to convert input content into the format needed for the Quote Finder.
+It takes a single file in a supported format (TEI XML, zipfile of ALTO XML pages, or plain text) and outputs a CSV file containing a sentence-level text corpus generated from the input text.
-*Where the software will run; includes all different environments expected for development, staging/testing, and production, and any additional resources needed or expected (e.g., new VMs, OnDemand developer access, HPC, TigerData, etc).*
+It is implemented as python function that extracts sentences from an input text and generates a corresponding
+sentence-level corpus CSV file. It consists of the following:
-**Environments:**
+- text extraction
+- sentence segmentation
+- sentence filtering (exclude two-word and numeric & punctuation-only "sentences")
-- Development: RSE local machines as possible; della as needed
-- Staging: della
- - To ensure software works on a common environment, we plan to use della as our staging environment for acceptance testing & quality assurance testing
-- Production: researcher machines (via installable python package, using uv and uv run if possible) or della via a shared Open OnDemand app (to be determined based on scale of data and compute requirements)
+Supported text input formats:
-**Resources:**
+- TEI XML. Yields all body content first followed by all footnotes, to aid in consolidating sequential sentences
+- Zip file of ALTO-XML files. Uses block-level segmentation tags to determine what content to include and to provide article metadata and page numbers
+- Plain text (TXT)
-- TigerData project partition
-- Della access
-- OnDemand developer access
+TEI and ALTO input processing has been customized to suit specific project inputs for this phase (MEGA digital editions, transcriptions generated by customized segmentation and training in eScriptorium). The software may require modification to work well for other inputs.
-### Component Details
+Sentence segmentation relies on off-the-shelf NLP tools with German support (currently SpaCy, which was evaluated in comparison with Stanza). Uses a a German language model by default and may perform poorly for other languages.
-*Description of each component and its functionality*
+#### Sentence Corpus output
-#### Sentence Corpus Builder
-
-This component is a python program that extracts sentences from some number of input texts and compiles them into a single sentence-level corpus CSV file.
-It can be broken into two core stages: text extraction and sentence segmentation.
-
-Text extraction will be customized to each of the supported input types:
-
-- MEGA digital TEI-XML files
-- Communist Manifesto TXT files as produced by a custom, one-time script converting HTML files to plaintext
-- Transcription ALTO-XML files as produced by the research team’s custom text transcription and segmentation pipeline
-- Plaintext files (TXT)
+A CSV file with each row corresponding to a single sentence in either an original or reuse text. This corpus may include additional data (e.g., document metadata for citation); this will be ignored by the system but preserved and passed through in the generated corpus.
-Each of these solutions will involve extracting the text from the corresponding file type that can be subsequently passed to sentence segmentation.
+| **Field** | **Description** | **Type** | **Required?** |
+| :------------- | :------------------------------------------------------------------ | :------- | :----------------------------------- |
+| `sent_id` | Unique identifier for each sentence | String | Yes |
+| `file` | Source document filename 1 | String | Yes |
+| `sent_index` | Sentence-level index within document (0-based indexing) | Integer | Yes |
+| `text` | Sentence text content | String | Yes |
+| `section_type` | What text section the sentence comes from (e.g., text vs. footnote) | String | Optional; TEI/ALTO only 2 |
+| `page_number` | Page of the source text where the sentence begins | String | Optional; TEI/ALTO only 2 |
+| `title` | Title of the article the sentence comes from, when available | String | Optional; ALTO only 2 |
+| `author` | Author of the article the sentence comes from , when available | String | Optional; ALTO only 2 |
+| `page_file` | Page file document within the ALTO zip file | String | Optional; ALTO only |
+| `line_number` | Line number of the source text where the sentence begins | String | Optional; TEI only |
-Sentence segmentation will rely on an off-the-shelf NLP package that supports sentence segmentation for German. We expect to use stanza for this purpose, but will have research team review segmentation quality to determine if we need to select a different tool. This step expects the input texts to be in German and may perform poorly for other languages.
+1. For ALTO XML, the the `file` field is the zip file that contains a set of ALTO page XML files.
+2. Depends on ALTO custom segmentation; may not always be present.
-This program will require unique input filenames to ensure sentences can be linked back to their original input file.
+### Quote Finder
-This component must also provide some linking mechanism for the passages of the output file to be linked back to their original document locations (e.g., starting character indices, line numbers). This component may also extract additional metadata such as article title or page number.
+The Quote Finder requires two sets of inputs: original and reuse sentence corpora. Each input is a CSV file as generated by the Sentence Corpus Builder, where each row corresponds to a sentence from a reuse or original text. Currently, the system allows selecting multiple original corpora and one reuse corpus. The quote finder generates a CSV file with identified quotations found when comparing the reuse and original corpora.
#### Sentence-Level Quote Detection
-This component is a python program that identifies quote sentence-pairs for some input reuse and original sentence corpus. It will involve two core steps. First, the sentence embeddings will be constructed for the sentences in each corpus. These embeddings will be generated by a pretrained model available via SBERT or HuggingFace. Then, likely quotes will be identified using embedding similarity via an off-the-shelf approximate nearest neighbors library. Finally, the program will output a quote sentence-pair corpus CSV file.
+This component is a python program that identifies quote sentence-pairs for a given set of original and reuse sentence corpora. First, sentence embeddings are generated for each sentences in the input corpora, using a pretrained model available via SBERT or HuggingFace. Then, likely quotes are identified using embedding similarity with an off-the-shelf approximate nearest neighbors library ([Voyager, from Spotify](https://spotify.github.io/voyager/)), which are filtered by a cutoff score. The result is a quote sentence-pair corpus CSV file ([described below](#quote-finder-output)).
-This component expects the sentences to be in German, but other multilingual or monolingual methods could easily be swapped in. In development, we expect to save the sentence embeddings as a (pickled) numpy binary file. The final pipeline may save these embeddings as an additional intermediate file.
+The default embeddings model is a multilingual model, but has not been tested extensively against multilingual content. Other models could be added in future.
-#### Quote Compilation
+The most compute intensive part of this process is computing sentence embeddings, which may be take some time depending on researcher hardware. For convenience, sentence embeddings are cached as a pickled numpy binary file adjacent to the input sentence corpus file; embeddings are regenerated when the input sentence corpus file is newer than the embeddings file.
-This component is a python program that will transform the quote sentence-pairs corpus into a more general quote corpus. Sentence-pairs will be merged into a single multi-sentence quote when sentence pairs are sequential or near-sequential. Two sentence pairs are sequential when both the original and reuse sentence of one pair directly preceded the original and reuse sentence respectively of the other pair.
+#### Quote Consolidation
-Additionally, if the quote-sentence pair corpus does not include original and reuse sentence metadata, then these will be linked in via the sentence corpora produced by the Sentence Segmentation component.
+This is an optional python method to consolidate quote sentence-pairs corpus into passages, where possible. Sentence pairs that are sequential in both the original and reuse source texts are merged into a single multi-sentence quote, with a field indicating the number of sentences that were consolidated. Future versions may refine the consolidation logic to support more cases, such as skipped sentences or slight variety in order, e.g. for quotations that re-order source content.
-The initial implementation will be based on sequential sentences in both corpora; this will be refined based on project team testing and feedback, as time and priorities allow. Potentially revisions include skipped sentences and alternate order within some context window (e.g., for quotations that re-order content from the same paragraph).
-
-#### Application Interface
-
-This component is a Marimo notebook designed to run in application mode, which will provide a graphical user interface to the `remarx` software.
-This interface will allow users to run both the sentence corpus builder and the core pipeline.
-This notebook will allow users to select and upload original and reuse texts and sentence corpora.
-The notebook will contain minimal code and primarily call methods within the `remarx` package.
-Users will have an option to save or download the output to a file location of their choice.
-
-Depending on whether this notebook is run locally or on della via an Open OnDemand app, the pipeline’s intermediate files will be stored on either the user’s local machine or in a scratch folder on della. The software will provide a simple way to configure paths for downloaded and generated models, with sensible defaults for common scenarios.
-
-The initial version of this application will not automatically clean up files generated by the processing, but will report locations and sizes (e.g., to allow manual cleanup). Future refinements may include clean up on user request or automated cleanup, depending on project team feedback.
-
-#### POST: Final Citation Corpus Compilation
-
-This component is outside of the core software pipeline. This will link the identified quotes to any additional required metadata from the original texts (Kapital and Communist Manifesto) and the reuse texts (DNZ). This could include the following metadata:
-
-- Both: Starting page number
-- Both: Content type (e.g., text, footnote)
-- Original: Marx title
-- DNZ: Journal name
-- DNZ: Volume issue
-- DNZ: Year of issue
-- DNZ: Article title
-- DNZ: Article author (optional, where possible)
-
-We aim for data compilation to be done without custom development effort, and will look for a sufficient solution for this phase of the project, which will empower research team members to review and work with the data prior to export for publication. We have requested a no-code database interface from PUL ([#princeton_ansible/6373](https://github.com/pulibrary/princeton_ansible/issues/6373)); other alternatives are Google Sheets, Excel template, or AirTable.
-
-## Data
-
-### Data Design
-
-*Description of the data structures used by the software, with expected or required fields*
-
-#### Original/Reuse Sentence Corpus
-
-A CSV file with each row corresponding to a single sentence in either an original or reuse text. This corpus may include additional data (e.g., document metadata for citation); this will be ignored by the system but preserved and passed through in the generated corpus.
-
-| **Field Name** | **Description** | **Type** | **Required / Optional** | **Reason** |
-| :------------- | :------------------------------------------------------------ | :------- | :---------------------- | :--------------------------------------------------------- |
-| id | Unique identifier for each sentence | String | Required | For tracking, reference |
-| text | Sentence text content | String | Required | For quote detection |
-| file | Corresponding document filename | String | Required | For tracking, reference, metadata linking |
-| sent_index | Sentence-level index within document (0-based indexing) | Integer | Required | For identifying sequential sentences for quote compilation |
-| section_type | What text section the sentence comes from (text vs. footnote) | enum | Optional | For reference and debugging |
-| multi_page | Indicates whether the sentence spans multiple pages. | Boolean | Required | For reference and debugging |
-
-The sentence corpora produced by the Sentence Corpus Builder program must include one or more fields that link the sentence back to its location within the corresponding input document. This will vary by the input format (e.g., MEGAdigital: page and line number; plaintext: starting character index).
-
-#### Quote Sentence Pairs
-
-A CSV file with each row corresponding to an original-reuse sentence pair that has been identified as a quote.
-
-| **Field Name** | **Description** | **Type** | **Required / Optional** | **Reason** |
-| :------------------ | :-------------------------------------------- | :------- | :---------------------- | :--------------------------------------------------------- |
-| match_score | Some match quality score | Float | Required | For development, evaluation |
-| reuse_id | ID of the reuse text sentence | String | Required | For tracking, reference |
-| reuse_text | Text of the reuse sentence | String | Required | For reference |
-| reuse_file | Reuse document filename | String | Required | For tracking, reference |
-| reuse_sent_index | Sentence-level index within reuse document | Integer | Required | For identifying sequential sentences for quote compilation |
-| original_id | ID of the original sentence | String | Required | For tracking, reference, metadata linking |
-| original_text | Text of the original sentence | String | Required | For reference |
-| original_file | Original document filename | String | Required | For tracking, reference, metadata linking |
-| original_sent_index | Sentence-level index within original document | Integer | Required | For identifying sequential sentences for quote compilation |
-
-Additional metadata for the corresponding reuse and original sentences will be included depending on the contents of the input sentence corpora.
-Each corpus's additional metadata fields will follow its sentence index field (i.e., `reuse_sent_index`, `original_sent_index`).
-
-#### Quotes Corpus
+#### Quote Finder output
A CSV file with each row corresponding to a text passage containing a quotation. The granularity of these passages operate at the sentence level; in other words, a passage corresponds to one or more sentences.
-| **Field Name** | **Description** | **Type** | **Required / Optional** | **Reason** |
-| :-------------------- | :-------------------------------------------------------------------------------------------------------- | :------- | :---------------------- | :-------------------------------------- |
-| id | Unique identifier for quote. Starting sentence id for multi-sentence passages. | String | Required | For tracking, reference |
-| match_score | Some match quality score. For multi-sentence passages this will be the mean of the sentence-level scores. | Float | Required | For development, evaluation, reference. |
-| reuse_id | ID of the reuse sentence | String | Required | For tracking, reference |
-| reuse_doc | Filename of reuse document | String | Required | For tracking, reference |
-| reuse_text | Text of the quote as it appears in reuse document | String | Required | For reference, evaluation |
-| original_id | ID of the original sentence | String | Required | For tracking, reference |
-| original_doc | Filename of original document | String | Required | For tracking, reference |
-| original_text | Text of the quote as it appears in original document | String | Required | For reference, evaluation |
-| original_section_type | What text section the original sentence comes from (text vs. footnote) | enum | Optional | For reference and debugging |
-| reuse_section_type | What text section the reuse sentence comes from (text vs. footnote) | enum | Optional | For reference and debugging |
-
-Any additional metadata included with the input sentences will be passed through and prefixed based on the input corpus (reuse/original).
-
-#### Final Citation Corpus
-
-This CSV file is a tailored version of the Quotes Corpus produced by the core software pipeline where relevant page- and document-level metadata has been added.
-
-| **Field Name** | **Description** | **Type** | **Required / Optional** |
-| :------------- | :----------------------------------------- | :------- | :---------------------- |
-| quote_id | Unique identifier for quote | String | Required |
-| marx_quote | Text of direct quote from Marx text | String | Required |
-| article_quote | Text of direct quote from the article text | String | Required |
-| marx_title | Title of Marx text | String | Required |
-| marx_page | Starting page of quote in Max text | String | Required |
-| article_title | Title of article | String | Required |
-| article_author | Name of article’s author | String | Optional |
-| journal | Title of journal | String | Required |
-| volume | Volume Issue | String | Required |
-| issue | Journal Issue | String | Required |
-| year | Year of article/journal publication | String | Required |
-| certainty | A score (0-1) of how quote certainty | Float | Required |
-
-Note: article and marx naming convention could be altered to something more general. Should have the research team confirm desired form and field names and check if there are any additional fields that should be added (e.g., page vs. footnote).
-
-### Work Plan for Additional Data Work (if needed)
-
-*Describe the data work that needs to be done and include proposed deadlines.*
-
-#### Kapital Texts
-
-The text of Kapital needs to be extracted from the MEGA TEI XML file. To properly handle sentence spanning page boundaries, the main text and footnotes will be split into separate text files.
-
-#### Communist Manifesto Text
-
-The text of the Communist Manifesto may need to be transformed into plaintext files. The scope of this effort is dependent on how the text is being acquired.
-
-#### DNZ Text
+| **Field** | **Description** | **Type** | **Required?** |
+| :-------------------- | :-------------------------------------------------------------------------------------------------------- | :------- | :----------------------- |
+| `match_score` | Some match quality score. For multi-sentence passages this is the average of the scores within the group. | Float | Yes |
+| `reuse_id` | ID of the reuse sentence | String | Yes |
+| `reuse_text` | Text of the quote as it appears in reuse document | String | Yes |
+| `reuse_file` | Filename of reuse document | String | Yes |
+| `reuse_sent_index` | Reuse sentence index | Integer | Yes |
+| `reuse_...` | Other reuse input fields included in input sentence corpus | Any | |
+| `original_id` | ID of the original sentence | String | Yes |
+| `original_text` | Text of the quote as it appears in original document | String | Yes |
+| `original_file` | Filename of original document | String | Yes |
+| `original_sent_index` | Original sentence index | Integer | Yes |
+| `original_...` | Other original input fields included in input sentence corpus | Any | |
+| `num_sentences` | Number of sentences included | Integer | Consolidated quotes only |
-The text of DNZ articles must be extracted from the ALTO-XML transcription files. For simplicity, each article should be separated into separate text files. If the article has footnotes, footnotes will be split into a separate text file.
+Any fields included with the input sentences will be passed through and prefixed based on the input corpus (either `reuse_` or `original_`). All reuse fields precede all original fields.
-#### Evaluation Dataset(s)
+### Notes
-A dataset of identified direct quote annotations from Kapital and/or the Communist Manifesto must be constructed. This will be used as the evaluation reference during development to measure the performance of the software pipeline, and to provide an estimate for the expected scale of the dataset.
+#### Out of Scope
-Derivative versions of the text reuse data will be created as needed to compare passages identified as text reuse with sentence level matches.
+For this phase of the project, the following features are out of scope:
-### Data Management Plan
-
-#### Datasets
-
-*Describe input and output data formats*
-
-| **Name** | **Description of how generated** | **Data type** | **Format** | **Stability** |
-| :----------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------ | :------------ | :--------- | :------------ |
-| Kapital XML | MEGAdigital’s MEGA II/5 digital copy. This text corresponds to the first edition of the first volume of *Das Kapital* (Berlin: Dietz-Verlag 1983) | Input | TEI XML | |
-| Communist Manifesto text | For this phase, we will use content from this [online edition](https://www.marxists.org/deutsch/archiv/marx-engels/1848/manifest/index.htm) | Input | HTML | |
-| DNZ transcriptions | Transcriptions of DNZ volumes with specialized text segment labels. These were created using eScriptorium and YALTAI. | Input | ALTO XML | |
-| Kapital sentences | Corpora of sentences from Kapital. At least two: main text, footnotes. | Input | CSV | |
-| Communist Manifesto sentences | Corpora of sentences from the Communist Manifesto. | Input | CSV | |
-| DNZ article sentences | Corpora of sentences from DNZ articles. Articles may correspond to multiple corpora. | Input | CSV | |
-| Kapital embeddings | Sentence embeddings for Kapital. For development only. | Intermediate | .npy | |
-| Communist Manifesto embeddings | Sentence embeddings for Communist Manifesto. For development. | Intermediate | .npy | |
-| DNZ embeddings | Sentence embeddings for DNZ articles. For development. | Intermediate | .npy | |
-| Quote Sentence Pairs | Identified quote sentence-pairs. Primarily for development. | Intermediate | CSV | |
-| Quotes Corpus | Identified quotes corpus. | Intermediate | CSV | |
-| Final Citation Corpus | The final corpus of identified citations. | Output | CSV | |
-
-#### Access
-
-*Where will data be stored; back up plan?*
-
-**GitHub.** If the data is small enough, a public GitHub repository will be used. This repository would be used to store the final dataset.
-
-**TigerData.** Copies of all datasets will be stored in TigerData. Assuming GitHub is an option, this will be a copy of the GitHub data repo. If the data is too large, TigerData will be the primary data source for development.
-Directories should be named meaningfully and include a brief readme documenting contents and how data was generated or acquired. We will use GitHub issues for light-weight review of data structuring and documentation.
-
-**GoogleDrive.** The initial text inputs will be stored in the project’s GoogleDrive folder. Any intermediate files that should be reviewed by the research team will also be uploaded to this folder.
-
-#### Archiving and Publication
-
-*What is the plan for publication or archiving of the data? What license will be applied?*
-
-The final dataset will be published on Zenodo and/or the Princeton Data Commons depending on the terms of use / copyright status of the data. If possible, an additional copy will be published on GitHub.
-
-Q: What license will be applied for the data?
-
-## Interface Functionality
-
-*Description of user-facing functionality directly accessible through the user interface and its components*
-
-As possible, there will be two types of interfaces for two different types of users.
-
-### Notebook Application
-
-The primary user interface is a marimo notebook. This interface will provide a graphical interface for selecting and uploading input text files and downloading the system’s output corpus file.
-
-### Command Line
-
-Technical users and the development team will have an alternative interface where the software can be run directly from the terminal.
-
-## Testing Plan
-
-### Unit Testing
-
-Unit testing will be used throughout the development process. There must be unit tests for all python source code excluding marimo notebooks. The target code coverage for python source code is 95%; the target coverage for python test code is 100%.
-
-### Staging
-
-During development, della will be used for a staging environment, as needed. This ensures a consistent shared environment as opposed to each developer’s local environment.
-
-### Ongoing Acceptance Testing
-
-Acceptance testing will be conducted throughout development. Each user-facing feature must have a user story and must undergo acceptance testing before an issue is considered complete. Acceptance testing should normally be done on features that have been implemented, unit tested, gone through code review, and merged to the develop branch. Acceptance testing is expected to occur every iteration where possible and otherwise as soon as possible so that software updates can be released each iteration.
-
-Application (i.e., notebook) testing will ideally run on the target environment. We will use [molab](https://molab.marimo.io/notebooks) initially if helpful while getting started with development. Then, depending on resource requirements, testing will shift either to the tester’s local machine or on della via Open OnDemand.
-
-## Development Conventions
-
-### Software Releases
-
-The goal is to complete a software release each iteration by either the project team meeting or by retro. This high frequency of releases will make the changelog more meaningful. The RSEs responsible for creating and reviewing each release will be assigned evenly across the team. Ideally, no RSE will be responsible for two consecutive releases.
-
-The tool [Repo Review](https://learn.scientific-python.org/development/guides/repo-review/) should be checked before making a release.
-
-### Software Publishing
-
-The software’s python source code should be packaged and published on PyPI iteratively. Ideally, with each release. We will add a GitHub workflow to automate publishing, so it will be triggered with each software release.
-
-### Python Development
-
-#### Package Management
-
-UV will be used for python package management. Any configuration settings will be added to the pyproject.toml.
-
-#### Type Annotations
-
-All python source code (excluding notebooks and test code) must use type annotations. This will be enforced by Ruff’s linter.
-
-#### Formatting
-
-We will use Ruff’s formatter to enforce a consistent code format. The formatter will be applied
-
-#### Linting
-
-We will use Ruff’s linter using the following rule sets:
-
-| **Rule(s)** | **Name** | **Reason** |
-| :---------- | :-------------------- | :---------------------------------------------------- |
-| F | pyflakes | Ruff default |
-| E4, E7, E9 | pycodestyle subset | Ruff default; subset compatible with Ruff’s formatter |
-| I | isort | Import sorting / organization |
-| ANN | flake8-annotations | Checks type annotations |
-| PTH | flake8-use-pathlib | Ensure pathlib usage instead of os.path |
-| B | flake8-bugbear | Flags likely bugs and design problems |
-| D | pydocstyle subset | Checks docstrings |
-| PERF | perflint | Checks for some performance anti-patterns |
-| SIM | flake8-simplify | Looks for code simplification |
-| C4 | flake8-comprehensions | Checks comprehension patterns |
-| RUF | ruff-specific rules | Potentially helpful? |
-| NPY | numpy-specific rules | For checking numpy usage |
-| UP | pyupgrade | Automatically update syntax to newer form |
-
-#### Notebook Development
-
-We will use marimo for any notebook development. The application notebook will be included in the source code package and made available as package command-line executable. If needed, analysis notebooks will be organized in a top-level “notebooks” folder. Any meaningful code (i.e., methods) for all notebooks should be located within the python source code *not* the notebook.
-
-Notebooks will be used both as the application interface and as needed for data analysis for development.
-
-The Marimo app interface notebook will be tested if possible; data analysis notebooks will be excluded from code coverage.
-
-### Documentation
-
-Documentation must be updated before a pull request is merged to the development branch. Documentations will be generated using mkdocs. The linter [prettier](https://github.com/prettier/prettier) will be used for linting all documentation markdown files.
-
-We will implement automated documentation checks on pull requests to develop and main branches (documentation coverage, if supported by mkdocs; otherwise, confirm documentation has been updated)
-
-### Precommit Hooks
-
-We will include precommit hooks for the following actions:
-
-- Run Ruff’s linter and formatter
-- Run [prettier](https://github.com/prettier/prettier) linter on markdown files
-- Run [codespell](https://github.com/codespell-project/codespell) to check for typos within any text
-
-### GitHub Actions
-
-Our repository will include the following GitHubActions:
-
-- Check all unit tests pass
-- 100% code coverage for test code
-- 95% coverage for source code
-- Ruff formatting check: to prevent the occasional commit with improper formatting
-- Check mkdoc documentation coverage
-- Check change log has been updated
-- Python package publication on PyPI (triggered by new release on GitHub)
-
-## Final Acceptance Criteria
-
-*Define the requirements the deliverable must meet to be considered complete and accepted. These criteria should be testable.*
-
-### Functionality
-
-- Using evaluation dataset as a reference. The software must identify at least 90% of the expected quotations (i.e., 90% recall).
- - Tolerance for false positives will be determined in conversation with the research team after reviewing preliminary results. If necessary, we will decrease the recall threshold to avoid too many false positives in the final dataset.
-- Faculty collaborator can successfully run software without assistance.
+- Non-German quote detection
+- Cross-lingual quote detection
+- Paraphrase / indirect quote detection
+- Attribution, citation, and named-reference detection
-## Sign Offs
+The tool is built with an eye towards future expansion to support other languages, and preliminary results do include some cross-lingual quotes in the output, but it has not been tested thoroughly, and sentence corpus creation currently assumes German language.
-Reviewed by: Jeri Wieringa
+#### Assumptions
-Signatures / Date
+- Assumes input content is written in German (support for other languages will be added in future versions)
+- Is optimized for the DNZ-Marx use case
+- Can be adapted to other monolingual and cross-lingual settings, but with no quality guarantees
+- Compatible with any _German_ original (e.g., other texts written by Marx) and reuse texts (e.g., earlier/later issues of DNZ, other journals)
+- Uses sentence embeddings as a core data representation
+- For sentence embedding construction, uses a pre-existing model available through `sentence-transformers` or `huggingface`
+- Developed and tested on Python 3.12
+- Can be installed and run on researcher machines (via installable python package, with support and documentation for use with `uv` and `uv run`)
+- Is OS and architecture agnostic (compatible with OSX, Linux, and Windows operating systems; runs on both x86 and ARM architectures).
+- Data compilation for a research dataset combining quotes can be done without custom development effort using spreadsheets, no-code database, or similar
## References
-[https://zenodo.org/records/14861082](https://zenodo.org/records/14861082)
+Naydan, Mary, Bennett Nagtegaal, Rebecca Sutton Koeser, and Edward Baring. “CDH Project Charter — Citing Marx 2025.” Center for Digital Humanities at Princeton, February 12, 2025. [https://doi.org/10.5281/zenodo.14861082](https://doi.org/10.5281/zenodo.14861082)