This repository was archived by the owner on Sep 9, 2025. It is now read-only.
File tree Expand file tree Collapse file tree 1 file changed +38
-0
lines changed Expand file tree Collapse file tree 1 file changed +38
-0
lines changed Original file line number Diff line number Diff line change 1+ # Wiki document source
2+
3+ Fetching information from wikis is an essential
4+ feature for fine-tuning LLMs on public knowledge.
5+
6+ ## Interfaces
7+
8+ qna.yaml file, ` document ` section:
9+
10+ - Wiki Host: The base URL of a wiki host.
11+ - Page titles: The titles of the Wiki pages to fetch.
12+ - oldid: IDs of old releases.
13+
14+ The qna.yaml file can define single host and multiple spaces and pages,
15+ each with an optional version.
16+
17+ Example of fetch URL:
18+
19+ - https://en.wikipedia.org/w/index.php?title=IBM_Granite&oldid=1235007056&action=raw
20+
21+ Note that oldid is sufficient to reterieve a page:
22+
23+ - https://en.wikipedia.org/w/index.php?oldid=1235007056&action=raw
24+
25+ Page title is used for vaidation.
26+
27+ ## Changes across modules
28+
29+ - [ Schema module] ( https://github.com/instructlab/schema ) defines the structure and validation rules for
30+ the qna.yaml file.
31+ - [ SDG taxonomy module] ( https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/utils/taxonomy.py )
32+ fetches documents
33+ - [ SDG unit tests] ( https://github.com/instructlab/sdg/tree/main/tests )
34+
35+ ## Additional External Packages
36+
37+ - urllib
38+
You can’t perform that action at this time.
0 commit comments