Importing corpus texts

New texts can be added to an existing corpus database, and updated texts can also be imported, assuming the import command is set to overwrite texts already in the system that match the filename, which is used as the corpus text's unique identifier.

However, in some instances, updated texts may contain new filenames, so this update import would not be possible. Therefore, let's first outline the steps for doing a fresh import

Overview of import steps

(ssh <user>@<host> && cd public_html/api && vendor/bin/drush cr && vendor/bin/drush sql-dump > YYYYMMDD.sql) Download the live database from the production server.
lando db-import <filename> (Import that database into the local development site)
Clear data before importing:

UA: lando drush cr && lando drush corpus-wipe --institution="147"
Purdue: lando drush cr && lando drush corpus-wipe --institution="7" && lando drush corpus-word-wipe && lando drush corpus-lemma-wipe
NAU: lando drush cr && lando drush corpus-wipe --institution="382" && lando drush corpus-word-wipe && lando drush corpus-lemma-wipe

lando db-export YYYYMMDD-pre-import.sql (Create a database snapshot of this (i.e., prior to doing the import) to save time retesting if anything should go wrong.)
Locate all corpus text files to be imported in corpus_data
Run count.sh to check for duplicate files
Run find corpus_data -mindepth 1 -type f -name "*.txt" -printf x | wc -c to check that the expected import number will match
Import the texts lando drush corpus-import /app/corpus_data
Check for duplicate text: lando drush corpus-dedupe
Check for missing imports: sh filelist.sh && mv corpus_data/provided_files.txt provided_files.txt && lando drush corpus-dedupe-provided
After this has completed, you can check that the number of texts imported match the expected number from the script output.
lando db-export YYYYMMDD-pre-deploy.sql (Create a database snapshot of this (i.e., prior to doing search indexing) to save time retesting if anything should go wrong.)
Create search indices locally: lando drush cww && lando drush clw && lando drush cwc && lando drush clc
Check assignment map for updates
Navigate to https://api.corporaproject.org/corpus_search and populate the JSON in the frontend to update the cached-for-performance base data
Spin up the frontend locally: cd ~/Sites/crow_frontend && ng serve
Interact with the frontend & verify the corpus numbers & metadata numbers look correct.
Export the database lando db-export YYYYMMDD-deploy.sql (from the remote server, this can be done via mysql -u <username> -D <databasename> < YYYYMMDD-deploy.sql)
SFTP to server.
Import and clear the cache: drush sqlc < YYYYMMDD-deploy.sql && drush cr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importing corpus texts

Overview of import steps

Clone this wiki locally