These scripts use Python to search for terms across multiple documentation repositories. (Repositories are assumed to use the metadata formats for docs.microsoft.com.)
-
Make sure you have Python 3 installed. Download from https://www.python.org/downloads.
-
Run
pip install -r requirements.txtto install needed libraries. (If you want to use a virtual environment instead of your global environment, runpython -m venv .envthen.env\scripts\activatebefore runningpip install.)
Inventories are driven by a JSON configuration file. This repo contains a few example configurations in config.json, config_python.json, and config_js.json. You can create additional files are necessary.
-
Specify the repos you want to search in the
contentcollection of the config file. For each element:repois a name for the repo (by convention, we use the GitHub org/repo name).pathis the location of the cloned repo on your local computer. Leavepathblank to skip the repo.urlis the base URL for the published articles of the docset. Theurlis used to auto-generate full URLs in the output files.exclude_foldersis a collection of folder names to omit from the inventory, such asincludesfolders and other folders that aren't actively maintained (such asvs2015in the Visual Studio repo.)
-
In the
inventorysection, specify distinct inventories, each of which generates a separate set of inventory files.nameis a case-insensitive name for the inventory. NOTE: don't use spaces or hyphens in the name, or any other character that's not allowed in a filename. We recommend using letters and numbers.termsis an array of Python regular expressions to use as search terms.
-
By default, the script saves results in an
InventoryDatafolder. You can customize this folder by setting theINVENTORY_RESULTS_FOLDERenvironment variable. -
At a command prompt, run
python take_inventory.py --config <config-file>. Omitting--config <config-file>defaults toconfig.json. -
When the script is complete, you'll see four files in the results folder for each inventory in the config file:
<name>_<date>_<sequential_int>.csvcontains one line per search term instance.<name>_<date>_<sequential_int>-metadata.csv, generated byextract_metadata.py(run automatically fromtake_inventory.py), adds various metadata values extracted from the source files to the results.<name>_<date>_<sequential_int>-consolidated.csv, generated byconsolidate.py(also run automatically), collapses the output fromextract_metadata.pyinto one line per file with a count column for each term and count columns for each classification tag (where the term is found)<name>_<date>_<sequential_int>-scored.csv, generated byscore.py(also run automatically), applies a scoring algorithm to the output fromconsolidate.py--seescore.pyfor the details. The scripts adds a single "score" column to the new output file, and automatically omits any file with a score of zero. The result here is a file that has "articles of interest" for the inventory in question.
The
<sequential_int>value starts at 0001 and is incremented each time you run the script on the same day. This is so subsequent runs on the same day produce distinct output.