Autor: Michaela Macková ([email protected])
The developed applications implement a tool that checks for typographical errors that often occur in theses and analyzes it's content for quick evaluation on it's length. Two applications were created: one web-based and the other command-line executable. Both applications were developed in Python and use the PyMuPDF library to process PDF documents.
This program is a part of a bachelor's thesis and later extended as Project Practice.
The developed web application is freely available at https://theseschecker.eu.pythonanywhere.com/.
The input of this application is a PDF file containing the technical report that will be checked and analyzed. The output is the same file with graphical indications of any identified mistakes and information about the text and pictures of the document. The edited PDF is displayed directly on the web page, with errors marked using PDF annotations.
The input to this application is one or more PDF files, each containing a technical report that will be checked. Using the available arguments, you can set which checks are performed and whether embedded PDF files (located inside the PDF documents) are treated as images during the checks. The output consists of the provided PDF files, each accompanied by a graphical indication of any identified mistakes, with errors marked using PDF annotations, and text files containing information about text and pictures for each input PDF file.
If you want to learn more and know Czech you can read my thesis at https://www.vut.cz/studenti/zav-prace/detail/144733.
Abstract:
The main goal of this work is to create an application that checks technical reports and marks all the found errors with PDF annotations. The technical documentation of this thesis breaks down the structure of a PDF file, commonly found mistakes in graduate theses, web development using the Django framework and discusses existing libraries for editing PDF documents. The resulting application is implemented in Python and is accessible as a web tool with the help of the Django framework. The developed solution recognizes six mostly typographical errors frequently found in graduate theses. The mistakes found are visually marked and the edited PDF file is then displayed directly on the web page. The resulting tool is freely available and helps students and supervisors to correct the technical reports the students create.
Before running the program, you must install all the packages on which the application depends. For easier installation, files have been created containing the packages and their versions that can be used with the pip install
command. These files are stored in the root folder. Both versions of Theses Checker (the web tool and the command line program) were developed in the Python programming language version 3.10.
Other versions have not been tested.
To install dependencies for the web tool, use the following command:
> pip install -r requirements_web.txt
To install dependencies for the command-line executable, use the following command:
> pip install -r requirements.txt
After installing the dependencies, you’ll need to make a few adjustments before using either application for the first time.
- For the web application to work properly, the
theses_checker.py
,chapter_info.py
,chapter_info_advanced.py
andstandard_pages.py
files must be located in thesrc\web\theses_checker\bl\
folder (their original location). - The next step is creating a
.env
file insrc\web\
folder. This file contains:- [required] The secret key that should be set before this application is published. The file must contain a line starting with
SECRET_KEY=
followed by the newly generated secret key. The example below contains the base value of the secret key, but this value must be manually changed to maintain security. This secret key can be generated, for example, at Djecrety. - [required] The debug configuration that should be set to
True
in production. The file must contain a line starting withDEBUG=
followed byTrue
orFalse
. This variable is used to specify whether the application will run in development mode or production mode. (Static files such asstyle.css
andscript.js
may not function correctly in production mode on the local server.) - [required] The operating system name on which this tool is running. The file must contain a line starting with
OPERATING_SYSTEM=
followed by eitherWindows
orLinux
. Other types are not supported. - The maximum storage space available (in bytes) for the whole repository. The file must contain a line starting with
MAX_STORAGE_SPACE=
followed by a number. If it is not stated in.env
file the maximum storage space is determined by the system. (WARNING: only for Linux, for Windows ignored)
- [required] The secret key that should be set before this application is published. The file must contain a line starting with
- The tool creates and stores new PDF and JSON files, for our developed strategies on how to delete these files see section 4. For web server with small storage space
.env
file examples:
SECRET_KEY=django-insecure-8%7#%6m22)=2**4c50n1h-&_!z_&3os6r+0g3_0eofna9mlkx*
DEBUG=False
OPERATING_SYSTEM=Windows
SECRET_KEY=django-insecure-8%7#%6m22)=2**4c50n1h-&_!z_&3os6r+0g3_0eofna9mlkx*
DEBUG=True
OPERATING_SYSTEM=Linux
MAX_STORAGE_SPACE=536870912000
Note: It is important not to assign values in the quotation in .env
file
- In order for the command line program to work properly, the
theses_checker.py
,chapter_info.py
,chapter_info_advanced.py
andstandard_pages.py
files must be located in the%CMD%\theses_checker_package\
folder. - These files are originally located in the
src\web\theses_checker\bl\
folder. - For easier use (while following the originally set hierarchy)
copy_theses_checker_package.ps1
orcopy_theses_checker_package.sh
scripts can be used to copy these files. These scripts are located inside thesrc\cmd\
folder. - Next ensure that
check.py
file is inside%CMD%\
folder. %CMD%
path originally representssrc\cmd\
.
Expected file hierarchy:
%CMD%
├── theses_checker_package
│ ├── __init__.py
│ ├── chapter_info_advanced.py
│ ├── chapter_info.py
│ ├── standard_pages.py
│ └── theses_checker.py
├── __init__.py
├── check.py
├── copy_theses_checker_package.ps1 [optional]
└── copy_theses_checker_package.sh [optional]
To start the server locally for the web tool, use this command (used primarily for debugging purposes):
> python manage.py runserver
To execute this program, use the following command:
> python check.py [ARG]… in_file [in_file]…
Command description: Makes a new pdf file called '*_annotated.pdf' in the folder, where this program is saved. If no check flag is given, everything will be checked.
Available arguments are:
-h
or--help
--embedded_PDF
- if used, embedded PDF files will be treated as part of the PDF; otherwise, they will be considered as images-o
or--overflow
- performs overflow check-i
or--image_width
- performs image width check-H
or--Hyphen
- performs hyphen check-t
or--TOC
- performs table of content section check-s
or--space_bracket
- performs space before left bracket check-e
or--empty_chapter
- performs text between titles check-b
or--bad_reference
- performs bad reference check (finding '??' in text - usually found in PDFs exported from LaTeX)
The application can be used as follows:
> python check.py -h
> python check.py file.pdf
> python check.py file1.pdf file2.pdf file3.pdf
> python check.py -o -t file.pdf file2.pdf
> python check.py file.pdf -H -s -b
In case server has small storage space there were developed two strategies on how to delete annotated PDF files:
Note: This script does not run by itself. For it to work you need to schedule a job (for example as cron job).
This script is located in src\web\
folder and is named periodicDeleteFiles.sh
for Linux systems or periodicDeleteFiles.ps1
for Windows systems. When this script is run, it deletes all PDF files located in in .\files\
or .\static\
folder and all JSON files located in .\files\json
folder that are older than specified period (originally set to 12h). To change this period, simply change the value (in seconds) in Period
variable.
This script can by run on Linux by this command:
$ bash periodicDeleteFiles.sh
This script can by run on Windows by this command:
> powershell '& periodicDeleteFiles.sh'
Note: By using this method, web cannot be used on mobile devices - PDF is deleted before user can download resource from server.
This option relies on user's web browser to store PDF in temporary storage. PDF file is sent as (a part of) a HTTP response and than deleted from server.
There are a few steps to set up if you want to use this option:
- in
src\web\theses_checker\views.py
uncomment methodview_annotated
(lines 67-81) - in
src\web\theses_checker\urls.py
uncomment/add new path tourlpatterns
list:path('view/<str:pdf_name>', views.view_annotated, name='view_annotated')
- in
src\web\templates\theses_checker\annotated.html
set<iframe>
source (src) to following (as seen in comment in file). By using this source, there is no need to load static in template anymore"{% url 'view_annotated' pdf_name=pdf_name %}"
- overflow check doesn't work for two-sided papers (padding on odd pages is different than padding on even pages)
- some files (when user leaves mid request?) stay in
static
folder - when error is thrown during file processing, files stay in
files
folder - in some cases chapter titles are not recognized