Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,6 @@
.bundle
.idea
*.iml
spec/data/sample_data
hathi_upd*
hathi_full*
.env
.devenv
archive
Expand Down
2 changes: 1 addition & 1 deletion Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ GIT
PATH
remote: .
specs:
hathifiles_database (0.4.1)
hathifiles_database (0.5.0)
date_named_file
dotenv
ettin
Expand Down
34 changes: 14 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,29 +50,23 @@ These are intended to be run under Docker for development purposes.

```
exe
├── catchup
├── daily_run
├── hathifiles_database_clear_everything_out
├── hathifiles_database_convert
├── hathifiles_database_full
├── hathifiles_database_full_update
├── hathifiles_database_update
└── swap_production_and_reindex
└── hathifiles_database_full_update
```
These are exported by the `gemspec` as the gem's executables.
- `catchup` _deprecated_ loads multiple `upd` files
- `daily_run` _deprecated_ (contains hardcoded paths) loads today's `upd` file
- `hathifiles_database_clear_everything_out` interactive script to reinitialize the database
- `hathifiles_database_convert` _deprecated_ interactive script to dump `hathifiles` database to tab-delimited files
- `hathifiles_database_full` _deprecated_ load a single `full` hathifile
This is exported by the `gemspec` as the gem's executable.

- `hathifiles_database_full_update` the preferred date-independent method for loading `full` and `upd` hathifiles
- `hathifiles_database_update` _deprecated_ load a single `upd` hathifile
- `swap_production_and_reindex` _deprecated_ swaps tables between `hathifiles` and `hathifiles_reindex` databases

`swap_production_and_reindex` used to be part of the workflow for clearing and rebuilding the
production database from an auxiliary database. With Argo Workflows we should no longer need to
do this as `hathifiles_database_full_update` should be touching only the changed/deleted rows
in the `full` monthly hathifile.

## Environment Variables
- Default Database Credentials -- override by passing keyword arguments to the `DB::Connection` initializer.
- `MARIADB_HATHIFILES_RW_USERNAME`
- `MARIADB_HATHIFILES_RW_PASSWORD`
- `MARIADB_HATHIFILES_RW_HOST`
- `MARIADB_HATHIFILES_RW_DATABASE`
- Filesystem
- `HATHIFILES_DIR` path to hathifiles archive
- Other
- `PUSHGATEWAY` Prometheus push gateway URL

## Pitfalls

Expand Down
12 changes: 4 additions & 8 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,10 @@ services:
test:
build: .
environment:
# Used by dumper.rb
HATHIFILES_MYSQL_USER: "ht_rights"
HATHIFILES_MYSQL_PASSWORD: "ht_rights"
HATHIFILES_MYSQL_HOST: "mariadb"
HATHIFILES_MYSQL_DATABASE: "ht"
# Used by connection.rb
# TODO: construct this based on the above variables
HATHIFILES_MYSQL_CONNECTION: "mysql2://ht_rights:ht_rights@mariadb/ht"
MARIADB_HATHIFILES_RW_USERNAME: "ht_rights"
MARIADB_HATHIFILES_RW_PASSWORD: "ht_rights"
MARIADB_HATHIFILES_RW_HOST: "mariadb"
MARIADB_HATHIFILES_RW_DATABASE: "ht"
HATHIFILES_DIR: "/usr/src/app/spec/data"
PUSHGATEWAY: http://pushgateway:9091
volumes:
Expand Down
58 changes: 0 additions & 58 deletions exe/catchup

This file was deleted.

16 changes: 0 additions & 16 deletions exe/daily_run

This file was deleted.

22 changes: 0 additions & 22 deletions exe/hathifiles_database_clear_everything_out

This file was deleted.

28 changes: 0 additions & 28 deletions exe/hathifiles_database_convert

This file was deleted.

32 changes: 0 additions & 32 deletions exe/hathifiles_database_full

This file was deleted.

63 changes: 16 additions & 47 deletions exe/hathifiles_database_full_update
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@

$LOAD_PATH.unshift "../lib"

require "cgi"
require "dotenv"
require "logger"
require "pathname"
Expand All @@ -18,30 +17,7 @@ require "hathifiles_database"
envfile = Pathname.new(__dir__).parent + ".env"
Dotenv.load(envfile)

# This is the "right" way to do the connection if there is a chance the password
# will contain non-URI-safe characters (as is likely to be the case).
# We are careful not to let the URI::InvalidURIError backtrace get logged since
# it can disclose the password.
# In future we should have a HathifilesDatabase::DB::Connection implementation that
# passes the individual ENV bits to Sequel, then we can deprecate the use of a connection
# string/URI.

# See https://github.com/hathitrust/rights_database/blob/main/lib/rights_database/db.rb
# for a representative implementation.

mysql_user = ENV["HATHIFILES_MYSQL_USER"]
mysql_password = CGI.escape ENV["HATHIFILES_MYSQL_PASSWORD"]
mysql_host = ENV["HATHIFILES_MYSQL_HOST"]
mysql_database = ENV["HATHIFILES_MYSQL_DATABASE"]
connection_uri = "mysql2://#{mysql_user}:#{mysql_password}@#{mysql_host}/#{mysql_database}"

begin
connection = HathifilesDatabase.new(connection_uri)
rescue URI::InvalidURIError
Logger.new($stderr).fatal("invalid URI in database connection string")
exit 1
end

connection = HathifilesDatabase.new
hathifiles = HathifilesDatabase::Hathifiles.new(
hathifiles_directory: ENV["HATHIFILES_DIR"],
connection: connection
Expand All @@ -54,19 +30,21 @@ tracker = PushMetrics.new(
logger: connection.logger
)

Dir.mktmpdir do |tempdir|
# `missing_full_hathifiles` returns an Array with zero or one element
# since only the most recent monthly file (if any) is of interest.
#
# We always process the full file first, then any updates.
# Whether or not this is strictly necessary (the update released
# on the same day as the full file may be superfluous), this is how
# `hathitrust_catalog_indexer` does it.
connection.logger.info "full hathifiles: #{hathifiles.missing_full_hathifiles}"
if hathifiles.missing_full_hathifiles.any?
hathifile = File.join(ENV["HATHIFILES_DIR"], hathifiles.missing_full_hathifiles.first)
connection.logger.info "processing monthly #{hathifile}"
HathifilesDatabase::MonthlyUpdate.new(
# `missing_full_hathifiles` returns an Array with zero or one element
# since only the most recent monthly file (if any) is of interest.
#
# We always process the full file first, then any updates.
# Whether or not this is strictly necessary (the update released
# on the same day as the full file may be superfluous), this is how
# `hathitrust_catalog_indexer` does it.
missing_hathifiles = hathifiles.missing_full_hathifiles + hathifiles.missing_update_hathifiles

connection.logger.info "hathifiles to process: #{missing_hathifiles}"
missing_hathifiles.each do |hathifile|
Dir.mktmpdir do |tempdir|
hathifile = File.join(ENV["HATHIFILES_DIR"], hathifile)
connection.logger.info "processing #{hathifile}"
HathifilesDatabase::DeltaUpdate.new(
connection: connection,
hathifile: hathifile,
output_directory: tempdir
Expand All @@ -75,14 +53,5 @@ Dir.mktmpdir do |tempdir|
tracker.on_batch { |_t| connection.logger.info tracker.batch_line }
end
end
connection.logger.info "updates: #{hathifiles.missing_update_hathifiles}"
hathifiles.missing_update_hathifiles.each do |hathifile|
hathifile = File.join(ENV["HATHIFILES_DIR"], hathifile)
connection.logger.info "processing update #{hathifile}"
connection.update_from_file(hathifile) do |records_inserted|
tracker.increment records_inserted
tracker.on_batch { |_t| connection.logger.info tracker.batch_line }
end
end
end
tracker.log_final_line
29 changes: 0 additions & 29 deletions exe/hathifiles_database_update

This file was deleted.

Loading
Loading