Extracts articles from Wikipedia database dumps for embedding into the mwm
map files created by the Organic Maps generator.
Extracted articles are identified by Wikipedia article titles in url or text form (language-specific), and Wikidata QIDs (language-agnostic).
OpenStreetMap (OSM) commonly stores these as wikipedia*=
and wikidata=
tags on objects.
article_processing_config.json
is compiled with the program and should be updated when adding a new language.
It defines article sections that are not important for users and should be removed from the extracted HTML.
There are some tests for basic validation of the file, run them with cargo test
.
Note
In production, wikiparser is run with the maps generator, which is somewhat involved to set up. See Usage with Maps Generator for more info.
To run the wikiparser for development and testing, see below.
First, install the rust language tools
Important
For best performance, use -r
/--release
with cargo build
/run
.
You can run the program from within this directory using cargo run --release --
.
Alternatively, build it with cargo build --release
, which places the binary in ./target/release/om-wikiparser
.
Run the program with the --help
flag to see all supported arguments.
$ cargo run -- --help
A set of tools to extract articles from Wikipedia Enterprise HTML dumps selected by OpenStreetMap tags.
Usage: om-wikiparser <COMMAND>
Commands:
get-tags Extract wikidata/wikipedia tags from an OpenStreetMap PBF dump
check-tags Attempt to parse extracted OSM tags and write errors to stdout in TSV format
get-articles Extract, filter, and simplify article HTML from Wikipedia Enterprise HTML dumps
simplify Apply html simplification to a single article
help Print this message or the help of the given subcommand(s)
Options:
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Note
Each subcommand has additional help.
The main work is done in the get-articles
subcommand.
It takes as inputs:
- A Wikipedia Enterprise JSON dump, decompressed and connected to
stdin
. - A directory to write the extracted articles to, as a CLI argument.
- Any number of filters for the articles:
- Use
--osm-tags
if you have an OSM .pbf file and can use theget-tags
subcommand or theosmconvert
tool. - Use
--wikidata-qids
or--wikipedia-urls
if you have a group of urls or QIDs from another source.
- Use
To test a single language in a specific map region, first get the matching tags for the region with get-tags
:
cargo run -r -- get-tags $REGION_EXTRACT.pbf > region-tags.tsv
Then write the articles to a directory with get-articles
:
tar xzOf $dump | cargo run -r -- get-articles --osm-tags region-tags.tsv $OUTPUT_DIR
To extract Wikipedia tags with the get-tags
subcommand, you need a file in the OSM .pbf
format.
The "planet" file is available directly from OSM but is ~80GB in size; for testing you can try a smaller region's data (called "Extracts") from one of the many providers.
Enterprise HTML dumps, updated twice a month, are publicly accessible.
Warning
Each language's dump is tens of gigabytes in size, and much larger when decompressed. To avoid storing the decompressed data, pipe it directly into the wikiparser as described in Usage.
To test a small number of articles, you can also use the On-Demand API to download them, which has a free tier.
Wikimedia requests no more than 2 concurrent downloads, which the included download.sh
script respects:
If you are reading this on Wikimedia servers, please note that we have rate limited downloaders and we are capping the number of per-ip connections to 2. This will help to ensure that everyone can access the files with reasonable download times. Clients that try to evade these limits may be blocked. Our mirror sites do not have this cap.
See the list of available mirrors for other options. Note that most of them do not include the enterprise dumps; check to see that the other/enterprise_html/runs/
path includes subdirectories with files. The following two mirrors are known to include the enterprise html dumps as of August 2023:
For the wikiparser you'll want the "NS0" "ENTERPRISE-HTML" .json.tar.gz
files.
They are gzipped tar files containing a single file of newline-delimited JSON matching the Wikimedia Enterprise API schema.
The included download.sh
script handles downloading the latest set of dumps in specific languages.
It maintains a directory with the following layout:
<DUMP_DIR>/
├── latest -> 20230701/
├── 20230701/
│ ├── dewiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
│ ├── enwiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
│ ├── eswiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
│ ...
├── 20230620/
│ ├── dewiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
│ ├── enwiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
│ ├── eswiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
│ ...
...
To use with the maps generator, see the run.sh
script and its own help documentation.
It handles extracting the tags, using multiple dumps, and re-running to convert titles to QIDs and extract them across languages.
As an example of manual usage with the maps generator:
- Assuming this program is installed to
$PATH
asom-wikiparser
. - Download the dumps in the desired languages (Use the files with the format
${LANG}wiki-NS0-${DATE}-ENTERPRISE-HTML.json.tar.gz
). SetDUMP_DOWNLOAD_DIR
to the location they are downloaded. - Run a maps build with descriptions enabled to generate the
id_to_wikidata.csv
andwiki_urls.txt
files. - Run the following from within the
intermediate_data
subdirectory of the maps build directory:
# Transform intermediate files from generator.
cut -f 2 id_to_wikidata.csv > wikidata_qids.txt
tail -n +2 wiki_urls.txt | cut -f 3 > wikipedia_urls.txt
# Enable backtraces in errors and panics.
export RUST_BACKTRACE=1
# Set log level to debug
export RUST_LOG=om_wikiparser=debug
# Begin extraction.
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
do
tar xzOf $dump | om-wikiparser get-articles \
--wikidata-qids wikidata_qids.txt \
--wikipedia-urls wikipedia_urls.txt \
--write-new-qids new_qids.txt \
descriptions/
done
# Extract discovered QIDs.
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
do
tar xzOf $dump | om-wikiparser get-articles \
--wikidata-qids new_qids.txt \
descriptions/
done
Alternatively, extract the tags directly from a .osm.pbf
file (referenced here as planet-latest.osm.pbf
):
# Extract tags
om-wikiparser get-tags planet-latest.osm.pbf > osm_tags.tsv
# Begin extraction.
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
do
tar xzOf $dump | om-wikiparser get-articles \
--osm-tags osm_tags.tsv \
--write-new-qids new_qids.txt \
descriptions/
done
# Extract discovered QIDs.
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
do
tar xzOf $dump | om-wikiparser get-articles \
--wikidata-qids new_qids.txt \
descriptions/
done