-
Notifications
You must be signed in to change notification settings - Fork 0
Importing corpus texts
Mark Fullmer edited this page Dec 15, 2024
·
20 revisions
New texts can be added to an existing corpus database, and updated texts can also be imported, assuming the import command is set to overwrite texts already in the system that match the filename, which is used as the corpus text's unique identifier.
However, in some instances, updated texts may contain new filenames, so this update import would not be possible. Therefore, let's first outline the steps for doing a fresh import
- (
ssh <user>@<host> && cd public_html/api && vendor/bin/drush cr && vendor/bin/drush sql-dump > YYYYMMDD.sql
) Download the live database from the production server. -
lando db-import <filename>
(Import that database into the local development site) - Clear data before importing:
- UA:
lando drush cr && lando drush corpus-wipe --institution="147"
- Purdue:
lando drush cr && lando drush corpus-wipe --institution="7" && lando drush corpus-word-wipe && lando drush corpus-lemma-wipe
- NAU:
lando drush cr && lando drush corpus-wipe --institution="382" && lando drush corpus-word-wipe && lando drush corpus-lemma-wipe
-
lando db-export YYYYMMDD-pre-import.sql
(Create a database snapshot of this (i.e., prior to doing the import) to save time retesting if anything should go wrong.) - Locate all corpus text files to be imported in
corpus_data
- Run
count.sh
to check for duplicate files - Run
find corpus_data -mindepth 1 -type f -name "*.txt" -printf x | wc -c
to check that the expected import number will match - Import the texts
lando drush corpus-import /app/corpus_data
- Check for duplicate text:
lando drush corpus-dedupe
- Check for missing imports:
sh filelist.sh && mv corpus_data/provided_files.txt provided_files.txt && lando drush corpus-dedupe-provided
- After this has completed, you can check that the number of texts imported match the expected number from the script output.
-
lando db-export YYYYMMDD-pre-deploy.sql
(Create a database snapshot of this (i.e., prior to doing search indexing) to save time retesting if anything should go wrong.) - Create search indices locally:
lando drush cww && lando drush clw && lando drush cwc && lando drush clc
- Check assignment map for updates
- Navigate to https://api.corporaproject.org/corpus_search and populate the JSON in the frontend to update the cached-for-performance base data
- Spin up the frontend locally:
cd ~/Sites/crow_frontend && ng serve
- Interact with the frontend & verify the corpus numbers & metadata numbers look correct.
- Export the database
lando db-export YYYYMMDD-deploy.sql
(from the remote server, this can be done viamysql -u <username> -D <databasename> < YYYYMMDD-deploy.sql
) - SFTP to server.
- Import and clear the cache:
drush sqlc < YYYYMMDD-deploy.sql && drush cr