GitHub

The purpose of this code is to provide consistent Marc record parsing for deduplication, in order to compare how humans, a machine learning deduplication algorithm, and an implementation of the GoldRush algorithm deduplicate Marc records.

The intention is that the output of the current MarcRecord methods be human-readable and used for the machine learning deduplication algorithm, and the GoldRush methods be used to build a string for literal matching.

The implementation of the GoldRush algorithm is based on the Colorado Alliance MARC record match key generation, documented January 12, 2024.

Decisions

This application will provide two layers of normalization.

First layer of normalization - humans and machine learning algorithm

The first layer of normalization consists of selecting a subset of Marc fields and subfields for human and machine learning algorithm comparison.

This will include showing fields in the vernacular script when available. Since not everyone is familiar with different scripts, these will be presented with both the transliterated information and the vernacular script. The vernacular script is more likely to be accurately matched by both the machine learning algorithm and humans who are familiar with that script, the transliterated script is more likely to be accurately matched by humans who are not familiar with the vernacular script.

Second layer of normalization - GoldRush algorithm

The second layer of normalization will be built on the first layer of normalization, and will be an interpretation of the GoldRush algorithm, intended for exact string matching.

To this end, there will be much more strict string normalization in this layer. Only vernacular versions of fields will be preserved.

Some normalization strongly favors English-language texts - e.g.
- Replacing English-language articles at the beginnings of titles
  - This also seems like it duplicates the 245 second indicator for non-filing characters
- Replacing '&' with 'and'

Using the code

Set up the environment, as described below
Call the main.py python script with arguments for the two MarcXML files you want to compare for - file1 and file2 are required, dir is not required.

python main.py --file1="tests/alma_marc_records_short.xml" --file2="tests/alma_marc_records.xml" --dir="experiments_files_and_output"

If you do not already have settings and training data, it will open an interactive session in your terminal to see whether you, as a human, think two things are duplicates or not, to train the Machine Learning algorithm. Follow the instructions in your terminal
It will output a CSV of all the records you input, with three added columns: a. Cluster ID - all records that it thinks are matches of each other will have the same Cluster ID. If a record does not have a Cluster ID, that means the machine learning algorithm does not think it has any duplicates. b. Link score - how confident the algorithm is that the record belongs to its cluster. The higher the number, the more likely the record is a true match c. source file - which file the record displayed is from

Developing this application

Set-up and install dependencies

Make a .venv

python3 -m venv .venv

activate the environment

. .venv/bin/activate

install dependencies

pip install -r requirements/[environment].txt

pip install -r requirements/development.txt

OR

pip install -r requirements/common.txt

Testing

pytest

Linting

ruff - fast

Formatter - --check flag does not make changes. Run without --check flag for automatic fixing

ruff format . --check

Linter

ruff check .

pylint - slower, does more in-depth checks

Currently excluding checks for documentation - remove these disables once this is remediated

pylint src tests main.py --disable=C0114,C0115,C0116

Name	Name	Last commit message	Last commit date
Latest commit maxkadel Add a script and class to check whether a MarcXML file has duplicates… Feb 21, 2025 b1f9ea8 · Feb 21, 2025 History 25 Commits
.circleci	.circleci	Linting (#7 )	Jan 2, 2025
requirements	requirements	Add a script and class to check whether a MarcXML file has duplicates…	Feb 21, 2025
src	src	Add a script and class to check whether a MarcXML file has duplicates…	Feb 21, 2025
tests	tests	Add a script and class to check whether a MarcXML file has duplicates…	Feb 21, 2025
.gitignore	.gitignore	Add a script and class to check whether a MarcXML file has duplicates…	Feb 21, 2025
README.md	README.md	Add a script and class to check whether a MarcXML file has duplicates…	Feb 21, 2025
main.py	main.py	Add a script and class to check whether a MarcXML file has duplicates…	Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Decisions

First layer of normalization - humans and machine learning algorithm

Second layer of normalization - GoldRush algorithm

Using the code

Developing this application

Set-up and install dependencies

Testing

Linting

About

Releases

Packages

Languages

pulibrary/pymarc_dedupe

Folders and files

Latest commit

History

Repository files navigation

Decisions

First layer of normalization - humans and machine learning algorithm

Second layer of normalization - GoldRush algorithm

Using the code

Developing this application

Set-up and install dependencies

Testing

Linting

About

Resources

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages