- Converts Finlex legacy XML formats to ELI and ECLI compliant RDF datasets published in the Semantic Finlex Linked Data service
- Implemented using Scala and SBT
- Configure environment variables in
.env
- Define URL to download source data from and write it to configuration file:
SF_SOURCE_DATA_URL=http://??? echo "sourcedata.url=\"$SF_SOURCE_DATA_URL\"" > src/main/resources/sourcedata.url.conf
- Make source and converted data directory structure
docker-compose run sf-converter makeDataDir
- Download source data
docker-compose run sf-converter downloadSourceData
- Run tests
docker-compose run sf-converter test
- Convert datasets
docker-compose run sf-converter "run --dataset kko-keywords-fi" docker-compose run sf-converter "run --dataset kko-keywords-sv" docker-compose run sf-converter "run --dataset kho-keywords-fi" docker-compose run sf-converter "run --dataset kho-keywords-sv" docker-compose run sf-converter "run --dataset finlex-keywords" docker-compose run sf-converter "run --dataset schema" docker-compose run sf-converter "run --dataset kko" docker-compose run sf-converter "run --dataset kho" docker-compose run sf-converter "run --dataset asd" docker-compose run sf-converter "run --dataset sd"
- Load converted data to TDB and Lucene text index
docker-compose run sf-converter loadDataToTripleStore
- Publish converted data in the service
docker-compose run sf-converter publishData
- To deploy the created database, mount
$SF_DATA_DIR/www/jena
to sf-www.
Publish only the source XML files on Semantic Finlex WWW service.
docker-compose run sf-converter publishData
- validation using shex
- fix failed conversions:
- KKO
- invalid year (sv)
- kko19840136t.xml
- empty dcterms:date (fi, sv)
- kko20010079.xml
- kko20010079.xml
- empty dcterms:issued (fi, sv)
- kko20040110.xml
- invalid dcterms:issued date format (sv)
- kko19810091t.xml
- kko19800113t.xml
- invalid dcterms:date date (sv)
- kko20010066.xml
- kko20010071.xml
- kko20010060.xml
- kko19840211t.xml
- kko19830120t.xml
- invalid year (sv)
- KHO
- missing ECLI identifier (fi)
- kho201301846.xml
- kho199200535.xml
- missing ECLI identifier (sv)
- kho200200005.xml
- duplicate tags (fi)
- kho198801741.xml
- kho198800689.xml
- kho198802673.xml
- kho198804364.xml
- kho198802179.xml
- kho198700912.xml
- kho198700025.xml
- kho198702036.xml
- kho198700967.xml
- kho198702037.xml
- kho198700884.xml
- kho198705900.xml
- kho198700104.xml
- kho198700840.xml
- kho198902937.xml
- kho198903354.xml
- kho198900833.xml
- kho198902044.xml
- kho199301126.xml
- kho199300351.xml
- kho199400880.xml
- kho199402211.xml
- kho199400679.xml
- kho199503219.xml
- kho199504128.xml
- kho199503280.xml
- kho199200535.xml
- kho199002531.xml
- kho199002183.xml
- kho199003720.xml
- kho199100533.xml
- kho199603571.xml
- missing ECLI identifier (fi)
- sd/2016-02-02-010101/sv/2006/fs20060837
- sd/2016-02-02-010101/sv/fs/2006/fs20060837
- fs19360189
- fs19360189
- fs19840514
- fs19840514
- asd/2016-02-02-010101/sv/2006/asd20061412.xml
- asd/2016-02-02-010101/fi/1994/asd19940635.xml
- asd/2016-02-02-010101/fi/1991/asd19911526.xml
- asd/2016-02-02-010101/fi/1991/asd19911594.xml
- KKO
- check handling of the following XML tags
- sv, y, a, table, unref, ev, i, vt, pdf, mo, thead, spanspec, mjl, viite, nu, ete, oikte, oikhuom, vanha, uusi, oikaisu, pl, title, AlaindeksiTeksti, YlaindeksiTeksti, KuvaAsemointi, Kuva, LisatietoTeksti, ko, ValiotsikkoTeksti, UusiNimeke, Muu5Viite, Tyhja, Muu4Viite, Muu3Viite, Muu2Viite, Muu1Viite, AsiakirjaViitteet, LiiteOsa, tbody, colspec, tgroup, Allekirjoittaja, Pykalisto, SaadosOsa, NimekeTeksti, Nimeke, AsiakirjatyyppiKoodi, EduskuntaTunniste, AsiakirjatyyppiNimi, #PCDATA