SRPPM#

Tracking Scottish Rail Performance

Introduction##

This project is my first 100 Days of Code project.

It is designed to (and mostly does) the following:

spot the publication of 4-weekly PDF performance reports (it still doesn't - being run manually)
download new ones,
use the PDFTables API to convert to CSV,
extract headline data, and 4 detailed measures for each Scottish Railway station,
write all data to a SQLite database
make that publicly available.

You'll need [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-2. soup) installed
You'll need an API Key from PdfTables
Rather than write your own wrapper for the API, install this package
Create a secrets.py file to put your API key in, and have a line in it as follows: pdftables_key = "xxxxxxxx"

##Progress so far##

Given the starting URL, the scraper finds the link to the current performance report in PDF.
The programme notes the file name (as it contains info on the year and period which we use later) and downloads a copy. Update as of scraperv1.4 - it now deals with shifting naming patterns which appeared in p10 of 2016-17.
The programme invokes the PDFTables API, sending the PDF and gets returned a CSV file which is given the same file name but with the correct CSV suffix.
We then parse the CSV, locating the necessary bits of data, writes these to nested lists, which it the sorts alphabetically by station before writing these to a plain text file as CSV. This is useful for those who cannot use the SQLite database.
It stores the data in three linked tables in a SQLite database
I've moved the code which creates or drops the tables if they exist to a function, created a call to the function in the main programme body, so that it can easily be commented out to avoid deleting existing data.
I've now recoded the main extraction process to work on tables with an extra blank column which appeared in P9 and P10 or 2016-17.

I might also

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
performance-display-p1617-06.pdf		performance-display-p1617-06.pdf
performance-display-p1617-07.pdf		performance-display-p1617-07.pdf
performance-display-p1617-08.csv		performance-display-p1617-08.csv
performance-display-p1617-08.pdf		performance-display-p1617-08.pdf
performance-display-p1617-09.csv		performance-display-p1617-09.csv
performance-display-p1617-09.pdf		performance-display-p1617-09.pdf
performance-display-p1617-11.csv		performance-display-p1617-11.csv
performance-display-p1617-11.pdf		performance-display-p1617-11.pdf
performance_display_p1617_10.csv		performance_display_p1617_10.csv
performance_display_p1617_10.pdf		performance_display_p1617_10.pdf
scraper_v1.4.py		scraper_v1.4.py
train_perf.sqlite		train_perf.sqlite
y1617_p08_output.csv		y1617_p08_output.csv
y1617_p09_output.csv		y1617_p09_output.csv
y1617_p10_output.csv		y1617_p10_output.csv
y1617_p11_output.csv		y1617_p11_output.csv