Skip to content

SemanticComputing/sf-converter

Repository files navigation

Semantic Finlex Converter

  • Converts Finlex legacy XML formats to ELI and ECLI compliant RDF datasets published in the Semantic Finlex Linked Data service
  • Implemented using Scala and SBT

Run the conversion

  1. Configure environment variables in .env
  2. Define URL to download source data from and write it to configuration file:
    SF_SOURCE_DATA_URL=http://???
    echo "sourcedata.url=\"$SF_SOURCE_DATA_URL\"" > src/main/resources/sourcedata.url.conf
  3. Make source and converted data directory structure
    docker-compose run sf-converter makeDataDir
  4. Download source data
    docker-compose run sf-converter downloadSourceData
  5. Run tests
    docker-compose run sf-converter test
  6. Convert datasets
    docker-compose run sf-converter "run --dataset kko-keywords-fi"
    docker-compose run sf-converter "run --dataset kko-keywords-sv"
    docker-compose run sf-converter "run --dataset kho-keywords-fi"
    docker-compose run sf-converter "run --dataset kho-keywords-sv"
    docker-compose run sf-converter "run --dataset finlex-keywords"
    docker-compose run sf-converter "run --dataset schema"
    docker-compose run sf-converter "run --dataset kko"
    docker-compose run sf-converter "run --dataset kho"
    docker-compose run sf-converter "run --dataset asd"
    docker-compose run sf-converter "run --dataset sd"
  7. Load converted data to TDB and Lucene text index
    docker-compose run sf-converter loadDataToTripleStore
  8. Publish converted data in the service
    docker-compose run sf-converter publishData
  9. To deploy the created database, mount $SF_DATA_DIR/www/jena to sf-www.

Publish only source data

Publish only the source XML files on Semantic Finlex WWW service.

docker-compose run sf-converter publishData

@TODO

  • validation using shex
  • fix failed conversions:
    • KKO
      • invalid year (sv)
        • kko19840136t.xml
      • empty dcterms:date (fi, sv)
        • kko20010079.xml
        • kko20010079.xml
      • empty dcterms:issued (fi, sv)
        • kko20040110.xml
      • invalid dcterms:issued date format (sv)
        • kko19810091t.xml
        • kko19800113t.xml
      • invalid dcterms:date date (sv)
        • kko20010066.xml
        • kko20010071.xml
        • kko20010060.xml
        • kko19840211t.xml
        • kko19830120t.xml
    • KHO
      • missing ECLI identifier (fi)
        • kho201301846.xml
        • kho199200535.xml
      • missing ECLI identifier (sv)
        • kho200200005.xml
      • duplicate tags (fi)
        • kho198801741.xml
        • kho198800689.xml
        • kho198802673.xml
        • kho198804364.xml
        • kho198802179.xml
        • kho198700912.xml
        • kho198700025.xml
        • kho198702036.xml
        • kho198700967.xml
        • kho198702037.xml
        • kho198700884.xml
        • kho198705900.xml
        • kho198700104.xml
        • kho198700840.xml
        • kho198902937.xml
        • kho198903354.xml
        • kho198900833.xml
        • kho198902044.xml
        • kho199301126.xml
        • kho199300351.xml
        • kho199400880.xml
        • kho199402211.xml
        • kho199400679.xml
        • kho199503219.xml
        • kho199504128.xml
        • kho199503280.xml
        • kho199200535.xml
        • kho199002531.xml
        • kho199002183.xml
        • kho199003720.xml
        • kho199100533.xml
        • kho199603571.xml
    • sd/2016-02-02-010101/sv/2006/fs20060837
    • sd/2016-02-02-010101/sv/fs/2006/fs20060837
    • fs19360189
    • fs19360189
    • fs19840514
    • fs19840514
    • asd/2016-02-02-010101/sv/2006/asd20061412.xml
    • asd/2016-02-02-010101/fi/1994/asd19940635.xml
    • asd/2016-02-02-010101/fi/1991/asd19911526.xml
    • asd/2016-02-02-010101/fi/1991/asd19911594.xml
  • check handling of the following XML tags
    • sv, y, a, table, unref, ev, i, vt, pdf, mo, thead, spanspec, mjl, viite, nu, ete, oikte, oikhuom, vanha, uusi, oikaisu, pl, title, AlaindeksiTeksti, YlaindeksiTeksti, KuvaAsemointi, Kuva, LisatietoTeksti, ko, ValiotsikkoTeksti, UusiNimeke, Muu5Viite, Tyhja, Muu4Viite, Muu3Viite, Muu2Viite, Muu1Viite, AsiakirjaViitteet, LiiteOsa, tbody, colspec, tgroup, Allekirjoittaja, Pykalisto, SaadosOsa, NimekeTeksti, Nimeke, AsiakirjatyyppiKoodi, EduskuntaTunniste, AsiakirjatyyppiNimi, #PCDATA

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published