Skip to content

Importing corpus texts

Mark Fullmer edited this page Dec 15, 2024 · 20 revisions

New texts can be added to an existing corpus database, and updated texts can also be imported, assuming the import command is set to overwrite texts already in the system that match the filename, which is used as the corpus text's unique identifier.

However, in some instances, updated texts may contain new filenames, so this update import would not be possible. Therefore, let's first outline the steps for doing a fresh import

Overview of import steps

  1. (ssh <user>@<host> && cd public_html/api && vendor/bin/drush cr && vendor/bin/drush sql-dump > YYYYMMDD.sql) Download the live database from the production server.
  2. lando db-import <filename> (Import that database into the local development site)
  3. Clear data before importing:
  • UA: lando drush cr && lando drush corpus-wipe --institution="147"
  • Purdue: lando drush cr && lando drush corpus-wipe --institution="7" && lando drush corpus-word-wipe && lando drush corpus-lemma-wipe
  • NAU: lando drush cr && lando drush corpus-wipe --institution="382" && lando drush corpus-word-wipe && lando drush corpus-lemma-wipe
  1. lando db-export YYYYMMDD-pre-import.sql (Create a database snapshot of this (i.e., prior to doing the import) to save time retesting if anything should go wrong.)
  2. Locate all corpus text files to be imported in corpus_data
  3. Run count.sh to check for duplicate files
  4. Run find corpus_data -mindepth 1 -type f -name "*.txt" -printf x | wc -c to check that the expected import number will match
  5. Import the texts lando drush corpus-import /app/corpus_data
  6. Check for duplicate text: lando drush corpus-dedupe
  7. Check for missing imports: sh filelist.sh && mv corpus_data/provided_files.txt provided_files.txt && lando drush corpus-dedupe-provided
  8. After this has completed, you can check that the number of texts imported match the expected number from the script output.
  9. lando db-export YYYYMMDD-pre-deploy.sql (Create a database snapshot of this (i.e., prior to doing search indexing) to save time retesting if anything should go wrong.)
  10. Create search indices locally: lando drush cww && lando drush clw && lando drush cwc && lando drush clc
  11. Check assignment map for updates
  12. Navigate to https://api.corporaproject.org/corpus_search and populate the JSON in the frontend to update the cached-for-performance base data
  13. Spin up the frontend locally: cd ~/Sites/crow_frontend && ng serve
  14. Interact with the frontend & verify the corpus numbers & metadata numbers look correct.
  15. Export the database lando db-export YYYYMMDD-deploy.sql (from the remote server, this can be done via mysql -u <username> -D <databasename> < YYYYMMDD-deploy.sql)
  16. SFTP to server.
  17. Import and clear the cache: drush sqlc < YYYYMMDD-deploy.sql && drush cr