The scripts in this repository are currently deployed to a web server (on Amazon EC2), where they are run on a regular schedule to extract data from Versionista and load it into a running instance of web-monitoring-db.
This deployment is very simple and consists of:
- A git clone of this repository
- An Amazon EBS volume (basically virtual hard drive) for storing intermediary data before uploading to S3/Google Cloud/web-monitoring-db (this is not required, but using it keeps data available for archival purposes even if the server is shut down).
- A set of shell scripts that set up environment variables used to configure the scripts
- A cron script that uses the above environment scripts
- A crontab that runs the above cron script
On the server’s filesystem, this generally looks like:
/
├─┬ data/ # the mount point for the EBS volume
| └── versionista/ # a directory we have permission to write/read
└─┬ home/
└─┬ ubuntu/ # user home directory (doesn't have to be ubuntu)
└─┬ web-monitoring-versionista-scraper/ # the repo clone
├── [checked out files]
├── versionista-archive-key.json # Google cloud key file for uploading to Cloud Storage
├── .env.vesionista1 # An environment script for the "versionista1" account
├── .env.vesionista2 # An environment script for the "versionista1" account
└── cron-archive # A shell script that gets run by cron
Unlike Amazon services, Google Cloud requires a key file for managing credentials. This is the versionista-archive-key.json
file in the file hierarchy above. To create one:
- Go to the Google Cloud console
- Select “IAM & Admin” → “Service Accounts” from the left-hand menu
- Click “Create Service Account” at the top of the screen
- Give your service account any name you like
- Under “Role,” select “Storage” → “Storage Object Admin”
- Check “Furnish a new private key” and select “JSON” for the key type
- Click “create”
- A JSON file should automatically be downloaded
- Upload the JSON file you got from the above step to your server. You can rename it if you like.
In our deployment, environment scripts are named .env.[account name]
, e.g. .env.versionista1
. You can name them anything you like, though. These should be a copy of the .env.sample
script in this repository, but with all the values properly filled in.
Make sure that the GOOGLE_STORAGE_KEY_FILE
variable points to versionista-archive-key.json
(or whatever you have named it).
In the example above, this is the cron-archive
script. It exists merely to be run by cron
, load the appropriate configuration environment script, and then run the bin/scrape-versionista-and-upload
script. In our deployment, it looks something like:
#!/bin/bash
# Load configuration
source "$HOME/web-monitoring-versionista-scraper/.env.$1"
# Load appropriate Node.js runtime via NVM
export NVM_DIR="$HOME/.nvm"
source "$NVM_DIR/nvm.sh"
nvm use 6 > /dev/null
# Run the scraper and upload results
$HOME/web-monitoring-versionista-scraper/bin/scrape-versionista-and-upload --after $2 --output $3
This script takes 3 arguments so that cron
can run it with different configurations:
- The name of the configuration environment script to load, e.g.
versionista1
. - The number of hours to cover in the scraper run
- Where to store the scraped data on disk (this includes raw diffs, raw versions, and JSON files containing metadata about the versions and diffs)
Finally, set up cron
to run the above script. Run crontab -e
to configure cron. In production, we have a crontab that looks something like:
0,30 * * * * /home/ubuntu/web-monitoring-versionista-scraper/cron-archive versionista1 0.75 /data/versionista 2>> /var/log/cron-versionista.log
15,45 * * * * /home/ubuntu/web-monitoring-versionista-scraper/cron-archive versionista2 0.75 /data/versionista 2>> /var/log/cron-versionista.log
That runs the cron-archive
script every 30 minutes for each account. For the “versionista1” account, it runs on the hour and at half-past; for the “versionista2” account, at a quarter after and a quarter to the hour.
Updating the deployment is simplistic: just SSH into the server, go to the web-monitoring-versionista-scraper
directory, and git pull
. This isn’t an ideal process (it’s not very secure), but is what we currently have.