Codes and metadata used to build the Hungarian corpus in the ParlaMint II project of Clarin ERIC.
The corpus itself will be made publicly available following the close of the project ParlaMint in March 2023 through concordancers. A sample of the data is already available on GitHub.
The codes here are not presented as a pipeline for creating parliamentary corpora, as many manual corrections and validations were needed to ensure the quality of the corpus. We simply intended to make our codes publicly available.
The Metadata folder contains all the metadata we collected for the time period of the project and our corpus (May 2014 – June 2022) of the speakers, the organisations, and the files themselves.
The Bin folder contains all the various python scripts that were used to create the TEI XML version (according to the Parla-CLARIN recommendations) of the texts from the raw txt files gathered from the website of the Hungarian National Assembly.