Entity Resolution in Unstructured Data

and applications in the analysis of historical documents

by Benjamin van der Burgh

Entity resolution is the process of finding records in one or more datasets that relate to the same entity. In a project that was carried out in collaboration with The National Archives (UK), this problem was studied in the context of historical documents, with the aim of linking together personal information that is scattered throughout different documents.

In this master project the different challenges in performing entity resolution in this specific context were studied: usually only a name appears in a text, which can furthermore contain errors that were introduced in the transcription process. A context-free grammar was used to extract references to people and limited contextual information from the text as tabular records. A classification algorithm, based on pair-wise comparison of the individual fields of the records of two candidate pairs, was used to group candidate pairs into matching and non-matching. An unsupervised feature extraction procedure, using Maximally k-Informative Itemsets, was used to extract topics from the documents in an attempt to improve the performance of the algorithm.

Download

A copy of my thesis can be downloaded from here.

License

This work is licensed under a Creative Commons "Attribution-ShareAlike 4.0 International" license.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
algorithms		algorithms
documents		documents
endmatter		endmatter
fonts		fonts
front		front
frontmatter		frontmatter
graphs		graphs
images		images
plots		plots
tables		tables
xml		xml
.gitignore		.gitignore
1_introduction.tex		1_introduction.tex
2_record_linker.tex		2_record_linker.tex
3_feature_extraction.tex		3_feature_extraction.tex
4_experiments.tex		4_experiments.tex
5_conclusions.tex		5_conclusions.tex
LICENSE.txt		LICENSE.txt
README.md		README.md
beamerthemeuleiden.sty		beamerthemeuleiden.sty
main.pdf		main.pdf
main.tex		main.tex
mybib.bib		mybib.bib
mystyle.sty		mystyle.sty
pgflibrarypgfplots.colorbrewer.code.tex		pgflibrarypgfplots.colorbrewer.code.tex
presentation.pdf		presentation.pdf
presentation.tex		presentation.tex
tikzlibrarycolorbrewer.code.tex		tikzlibrarycolorbrewer.code.tex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Entity Resolution in Unstructured Data

and applications in the analysis of historical documents

Download

License

About

Releases

Packages

Languages

License

benjaminvdb/master_thesis

Folders and files

Latest commit

History

Repository files navigation

Entity Resolution in Unstructured Data

and applications in the analysis of historical documents

Download

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages