Skip to content

A Python script to create reports from individual PDFs. Backchecks for missing reports (by ID) and recursively collates and reduces using Ghostscript.

Notifications You must be signed in to change notification settings

celaxodon/PDF_collator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

PDF_collator

The PDF_collator program was designed to programmatically locate, organize and collate PDF files with numerical naming schemes. Created as a stopgap measure to augment the lack of PDF collating and printing in a Laboratory Information Management System (LIMS), the script gradually grew too far and fast to be easily maintained as a shell script.

The Python rewrite is intended to make the code easier to maintain and understand, and to improve the performance and error handling.

Features

As of June, 2015, the script has been designed to do the following:

  • Checks that necessary remote filesystems have been mounted.

  • Validates file names for coc and pdf report files.

    Valid Chain of Custody names are as follows: * '123456coc.pdf' * '123456acoc.pdf' ("a" can be swapped for "b", "c", or "d") * '123456-460coc.pdf' * '123456a-460acoc.pdf' * 'QC123-456coc.pdf' ("QC" can be swapped for "WP" or "SP")

    Valid PDF report names are as follows: * '123456pg1.pdf' (pg1 can go as high as pg9) * 'WP123-456pg1.pdf' (as above, "WP" can be swapped for "QC or "SP")

  • Strips automatically generated prefixes from files ("job_#### "), where "#" can be any number from 1 to 10000.

  • Searches for and collects Chain of Custody (COC) PDFs, which should match with various LIMS-produced PDFs.

  • Reverse matches for the full range of PDFs, as indicated by the COC naming scheme. If the full range is not found, a warning is given to the user, but the user has the option of proceeding with the collation anyway.

  • Uses Ghostscript to collate and compress reports.

  • Disposes of files after successful collation (User's Trash), and moves reports to a specified location.

Design Notes

Ghostscript was chosen as an alternative to the previous solution, which used an Apple script. The apple script, part of Apple's automator software, used pypdf2 and was inefficient at reducing PDF size. Ghostscript also holds the advantage of being able to rotate PDFs automatically.

For referring to the Applescript Automator's PDF script, see /System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py

At times, using this script caused collated PDFs to balloon in size instead of shrink.

Limitations

Due to the automated, numerical nature of the reverse check, the program cannot account for missing PDFs if there are multiple pages but one is missing; it can only account for numbers it knows should exist, but cannot find. For example, if 557000pg1.pdf is present, but 557000pg2.pdf is missing, the program has no way to determining a file is missing. Similarly, WP/SP/QC reports have multiple pages, but the reverse check is not aware, and has no way of determining how many pages per number exist.

Variables

GLOBAL VARIABLES:

  • FIN_REPORTS - Directory to deposit completed reports.
  • REVD_REPORTS - Directory where PDFs (parts of reports) are deposited.
  • AUS_COCS - Directory where Chain of Custodies from Austin are kept.
  • CORP_COCS - Directory where Chain of Custodies from Corpus are kept.
  • PT_COCS - Directory where PT sample chains are kept.
  • BILLINGS - Directory where copies of completed reports should be
    delivered to. Reports in this directory act as a signal to the billings department telling them to go ahead and bill for the work.

Return values and other variables:

  • bad_names - Variable returned by name_check() function. Used to inform the user that errors were found in the names of some Chain of Custodies in one of the COC directories.

  • coc_list - Variable returned by name_check() function. Used by the find_coc() function to search for CoCs matching PDF names.

  • good_pdf_names - Variable returned by strip_chars() function. A sanitized list of PDF names that all match the expected patterns.

  • bad_pdf_names - Variable returned by strip_chars(). A list of PDF names that do not match the expected patterns.

  • pdf_stack - Copy of good_pdf_names and used as a stack. After each run of collator() function, the stack is reduced in a couple of cases:

    1. When a matching COC is found;
    2. When a back-check is performed from the CoC's ranges, all successfully back-checked PDFs are removed from the stack.
  • coc_tuple - A tuple consisting of sets of CoCs from each directory. Used by find_coc() function to return the full path of the matching CoC.

  • missing_coc_list - A list of COCs that could not be found, but should be present, based on PDF names. Returned by aggregator() function.

  • report_dict - A dictionary consisting of the report name (used as a key), the coc path for the report, a list of associated PDFs (generated by back- checking), and a list of missing PDFs, if any.

    'report_name': {'coc': '/path/to/coc.pdf',

    'pdfs': ['1.pdf', '2.pdf', ...], 'missing_pdfs' = ['3.pdf', ...]}

Ghostscript usage

Ghostscript is fast and accurate, but doesn't take input well. Both piping input into GS and an array with PDF titles were attempted, but both failed. So far, the only syntax gs recognizes for multiple pdf inputs is to either list them, as so:

<gs script here> pdf1.pdf pdf2.pdf ...

or

<gs script> files*pdf

  • NOTE: Double check that output file locations can be specified.

  • Script from Bash version of PDF_collator:

    gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -dAutoRotatePages=/PageByPage -sOutputFile="$FILENAME" ./*.pdf 2>/dev/null;

    • -dBATCH -- Exit after last file, rather than going into an interactive reading postscript commands.
    • -dNOPAUSE -- No pause after page.
    • -q -- Quiet mode; suppress messages.
    • -sDEVICE=pdfwrite -- Selects the output device ghostscript should use. Here, the output device is a pdfwriter.
    • -dAutoRotatePages=/PageByPage --
    • -sOutputFile=$FILENAME -- Designate a file name to write to - -o is now a shorthand for this, I think

The following options are compression-related, but haven't been tested yet:

  • -dEmbedAllFonts=true -- Ensures that the fonts you used in creating the pdf are used by whomever views the pdf. A full copy of the entire charset is embedded (INCREASES SIZE)
  • -dSubsetFonts=false -- This option will embed a subset of the font character sets in your pdf - only the characters that are displayed in the PDF, though.
  • -dPDFSETTINGS=/screen -- screen-view quality only (72 dpi)
    /ebook -- low quality (150 dpi images) /printer -- high quality (300 dpi images) /prepress -- high quality (300 dpi images, preserves colors) /default -- almost the same as /screen.
  • -dOPTIMIZE=true --
  • -dCompatibilityLevel=x.x -- Adobe's PDF specification... - 1.4 -- for font embedding - 1.6 -- for OpenType font embedding
  • -dAutoFilterColorImages=false --
  • -dColorImageFilter=/FlateEncode -- lossless compression?

Usage

PDF_collator.py [OPTIONS] -- Collates reviewed data PDFs with their matching Chain of Custody files. Final reports are filed for both billing and delivery to clients.

Options still need to be written in code and in the docs. No cleanup is needed anymore -- got rid of hidden and temporary folders from the bash version.

To do:

  • Logging

  • Cleanup of files after successful collation - How do we know it worked? - Move target files to the trash

  • Compression stats and report generation at end - Function should take a file and return its size (KB/MB/GB) -- both of the

    below return size in bytes

  • Testing
    • Fix tests (some fail due to temporary folder mechanics)

About

A Python script to create reports from individual PDFs. Backchecks for missing reports (by ID) and recursively collates and reduces using Ghostscript.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages