Just replace 'arxiv' in the URL with 'markxiv' to get the markdown
https://arxiv.org/abs/1706.03762 → https://markxiv.org/abs/1706.03762
A minimal web service that mimics arXiv but serves Markdown instead of PDFs/HTML.
Given an arXiv ID, the server:
- Checks a local LRU cache for a converted result
- Fetches the paper’s LaTeX source from arXiv (if available)
- Extracts the archive, picks the main
.texfile, converts it to Markdown using pandoc - Falls back to
pdftotextwhen LaTeX sources are unavailable or pandoc conversion fails - Returns
text/markdown; charset=utf-8
If a paper is PDF-only (no source available) or pandoc conversion fails, the server falls back to pdftotext and returns the extracted Markdown/plain text when that succeeds.
Returned Markdown includes the paper title and abstract prepended at the top.
- Rust toolchain (
cargo,rustc) via rustup - pandoc (for LaTeX → Markdown conversion)
- pdftotext (Poppler CLI, usually packaged as
poppler-utils) - tar (for extracting the arXiv source archive)
Most Linux/macOS environments already include tar. Windows 10+ includes bsdtar as tar.
Recommended: install via rustup.
- Linux/macOS:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # then restart your shell or `source $HOME/.cargo/env` rustup update cargo --version
- Windows (PowerShell):
- Download and run: https://win.rustup.rs
- After install: open a new terminal and run
rustup updateandcargo --version
Alternative (macOS):
brew install rustup
rustup-init- macOS (Homebrew):
brew install pandoc poppler
- Debian/Ubuntu:
sudo apt-get update sudo apt-get install -y pandoc poppler-utils tar
- Fedora:
sudo dnf install -y pandoc poppler-utils tar
- Arch:
sudo pacman -S pandoc poppler-utils tar
- Windows:
- Chocolatey:
choco install pandoc poppler - Scoop (main bucket):
scoop install pandoc poppler - Manual binaries: https://blog.alivate.com.au/poppler-windows/
- MSI installers: https://pandoc.org/installing.html
- Chocolatey:
Verify:
pandoc --version
pdftotext -v# from repo root
cargo build
cargo run
# server listens on 0.0.0.0:8080 by defaultEnvironment variables:
PORT(default8080)MARKXIV_CACHE_CAP(default128) — number of cached papersMARKXIV_INDEX_MD(defaultcontent/index.md) — path to landing page MarkdownMARKXIV_PANDOC_PATH(defaultpandoc) — path to pandoc binaryMARKXIV_CACHE_DIR(default./cache) — on-disk cache root directoryMARKXIV_DISK_CACHE_CAP_BYTES(default0) — on-disk cache size cap in bytes (0 disables disk cache)MARKXIV_SWEEP_INTERVAL_SECS(default600) — background sweeper interval secondsMARKXIV_LOG_PATH— optional absolute or relative path to the log file; takes precedence overMARKXIV_LOG_DIRMARKXIV_LOG_DIR(default./logs) — directory used whenMARKXIV_LOG_PATHis unset; file name defaults tomarkxiv.log
GET /→ serves landing page from Markdown file- Content negotiation:
Accept: text/htmlrenders Markdown to HTML;Accept: text/markdownreturns raw Markdown
- Content negotiation:
GET /health→200 OK, bodyokGET /abs/:id[?refresh=1]→200 OKwithtext/markdown:idcan be a base arXiv id (1601.00001) or versioned (1601.00001v2)?refresh=1bypasses the cache and re-fetches/convert- Response is pure Markdown, prefixed by
# {title}and a##Abstractsection containing the abstract text - Two-tier caching: in-memory LRU first, then on-disk gzip store; cache populated on miss
GET /pdf/:id[?refresh=1]→ same response as/abs/:id, useful for links that expect the/pdf/prefix- Requests like
/pdf/:id.pdfare normalized automatically
- Requests like
Error mapping:
404 Not Found— unknown arXiv id422 Unprocessable Entity— PDF only (no e-print source) and thepdftotextfallback also failed502 Bad Gateway— upstream/network error contacting arXiv500 Internal Server Error— conversion/extraction errors
Run tests (unit + route tests with mocks):
cargo testProject layout:
src/main.rs— server bootstrapsrc/routes.rs— handlers (/,/health,/abs/:id,/pdf/:id)src/state.rs— shared state (LRU cache + clients)src/cache.rs— thin wrapper aroundlru::LruCachesrc/arxiv.rs— arXiv client + metadata fetch via Atom APIsrc/convert.rs— pandoc-based converter + sanitizationsrc/tex_main.rs— heuristic for picking the main.texfile
- Metadata (title, abstract):
https://export.arxiv.org/api/query?id_list=:id(Atom feed), minimal parse of<entry><title>and<summary>. - Source archive:
https://arxiv.org/e-print/:id(tar/tar.gz). 400/403/404 → treated as PDF-only. - Conversion: save archive to temp dir → extract with
tar→ pick main.tex→pandoc -f latex -t gfm→ sanitize. - Fallback: when LaTeX sources are unavailable or pandoc fails, download the PDF and shell out to
pdftotext -raw. - Sanitization: remove entire
<figure>...</figure>blocks and strip all remaining HTML tags from the Markdown output. - Caching: small in-memory LRU for hot entries, plus an on-disk gzip store with size cap and background sweeper that deletes oldest files when over cap.
# health
curl -s http://localhost:8080/health
# landing page (HTML by default)
curl -sI http://localhost:8080/ | grep -i content-type
# landing page as raw Markdown
curl -sH 'Accept: text/markdown' http://localhost:8080/
# fetch a paper (replace with a source-available id)
curl -sH 'Accept: text/markdown' http://localhost:8080/abs/1601.00001
# force refresh (bypass cache)
curl -s http://localhost:8080/abs/1601.00001?refresh=1
# enable disk cache with ~10 GB cap
MARKXIV_DISK_CACHE_CAP_BYTES=$((10*1024*1024*1024)) cargo run- Conversion fidelity depends on pandoc and the paper’s LaTeX structure; complex macros/environments may not convert perfectly.
- Title and abstract are prepended to the Markdown as
# Titleand a##Abstractheading followed by the abstract. - HTML is stripped from the final Markdown; embedded PDF figures are removed.
- Caching is in-memory and optional on-disk; restart clears the in-memory cache.
- For production use, consider timeouts, rate limiting, and persistent caching.
- By default the server writes structured request/error logs to
./logs/markxiv.log; ensure the process user can create that directory. - Override the location with
MARKXIV_LOG_PATH=/abs/path/to/markxiv.logorMARKXIV_LOG_DIR=/var/log/markxiv(generates/var/log/markxiv/markxiv.log). - If the file cannot be opened (e.g., missing write permissions), logging automatically falls back to stderr/journald.
- When running under systemd, confirm the service
User/Groupowns or can write to the configured log directory, e.g.sudo chown -R markxiv:markxiv /var/log/markxiv. - Quick check: verify the service account can write to the cache (or any target directory) via
sudo -u markxiv test -w /mnt/markxiv-cache && echo writable.