Importing repository materials

Much like the corpus texts, repository materials include a file with metadata in header tags as well as the plaintext of the repository material; they additionally include the original repository resource, in .pdf format. The filename of the .pdf must match the filename of the associated .txt file so that the importer can find the .pdf and import it.

1. Check for updates to the metadata maps

Periodically new assignment types, topic types, and college codes will be added to the list of these categories that Crow curates. Before importing new repository materials, check for any changes to these maps at https://drive.google.com/drive/folders/1B0giVT429xQdiY35M__om1lTo-5WuB2t

If there are changes, those can be manually copied over to the mapping that is in code at profiles/corpus/modules/corpus_importer/src/ImporterMap.php

2. Download new repository materials

Currently we used a shared cloud file hosting service to distribute the prepared repository materials. Access to the files is currently provided by Shelley Staples. The files can be downloaded and placed in a directory anywhere in the codebase root.

2a. If necessary, delete existing repository materials.

If all repository materials need to be reimported (for example, if existing files' data has been updated or a change has been made to the schema), run the following command:

lando drush repository-wipe

3. Run the import script

The same command for importing corpus materials is used for importing repository materials; the code inspects the header file and determines whether it is a repository or corpus text based on the presence of the File ID or Student ID header.

lando drush corpus-import path/to/directory

You should see output in the terminal similar to the following:

106_RR_AS_1299_UA
Importing original file New Repository files 3_13_20/filenames/ENGL106/Fall 2019/1025/Language_Awareness/106_RR_RU_1300_UA.pdf
106_RR_RU_1300_UA
Importing original file New Repository files 3_13_20/filenames/ENGL106/Fall 2018/1024/NA/106_RR_AS_1288_UA.pdf
106_RR_AS_1288_UA
Importing original file New Repository files 3_13_20/filenames/ENGL106/Fall 2018/1024/Peer_Review/106_RR_AC_1287_UA.pdf
...

If the importer was unable to find any of the .pdfs, a message will be printed as follows:

4. Index the texts

Go to /admin/config/search/search-api/index/resource_index or run lando drush sapi-i

5. Verify the imported data

Spin up the frontend locally: cd ~/Sites/crow_frontend && ng serve,
Interact with the frontend & verify the corpus numbers & metadata numbers look correct.

6. Deploy to the development server

Follow the steps outlined at https://github.com/writecrow/crow_backend/wiki/Deploying-to-the-server

Provide feedback

Saved searches

Use saved searches to filter your results more quickly