Skip to content

benjaminvdb/master_thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Entity Resolution in Unstructured Data

and applications in the analysis of historical documents

by Benjamin van der Burgh

Entity resolution is the process of finding records in one or more datasets that relate to the same entity. In a project that was carried out in collaboration with The National Archives (UK), this problem was studied in the context of historical documents, with the aim of linking together personal information that is scattered throughout different documents.

In this master project the different challenges in performing entity resolution in this specific context were studied: usually only a name appears in a text, which can furthermore contain errors that were introduced in the transcription process. A context-free grammar was used to extract references to people and limited contextual information from the text as tabular records. A classification algorithm, based on pair-wise comparison of the individual fields of the records of two candidate pairs, was used to group candidate pairs into matching and non-matching. An unsupervised feature extraction procedure, using Maximally k-Informative Itemsets, was used to extract topics from the documents in an attempt to improve the performance of the algorithm.

Download

A copy of my thesis can be downloaded from here.

License

This work is licensed under a Creative Commons "Attribution-ShareAlike 4.0 International" license.

About

Master thesis titled "Entity Resolution in Unstructured Data".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages