This is a tool for processing Wikipedia articles and extracting important information from them as plaintext.
- Run
./download.sh
to download all of Wikipedia as a single xml file. This will likely take a long time. - Install Rust on your system via these instructions.
- Install
just
withcargo install just
. - Run
just extract-links
to create the graph of Wikipedia. - Run
just extract-contents
to parse the contents of the Wikipedia articles. This will also take a long time - Run
just extract-subgraph <root article> <degrees of separation>
to produce the list of all articles withindegrees of separation
of the root. For instance, if I wanted to find all articles within 5 degrees of separation from the article for RNA, I would runjust extract-subgraph RNA 5
.