Code review requested for scythe_refactor.py#7
Code review requested for scythe_refactor.py#7ojas-uls-dev wants to merge 23 commits intofirst-run-fixesfrom
Conversation
ctgraham
left a comment
There was a problem hiding this comment.
Here is the read-through of comments and change requests which we discussed earlier. I only looked at scythe_refactor.py and auth.py.
auth.py
Outdated
| page = httpx.get(conf.login_uri, headers=user_agent_headers) | ||
| content = page.text | ||
| if page.status_code == 403: | ||
| print("Most likely Cloudflare issue.") |
There was a problem hiding this comment.
For error messaging, let's print to STDERR or to a logger object.
| # namespaces={'r': et.xpath("namespace-uri()")})[0] | ||
| print(uname_form_name, passwd_form_name, form_action_uri) | ||
|
|
||
| post_ret = requests.post(requests.compat.urljoin(conf.login_uri,form_action_uri), data={uname_form_name: uname, |
There was a problem hiding this comment.
A bespoke POST here is highly risky. It doesn't use the same User Agent string as the GET, and it doesn't handle CSRF or form tokens. Using a tool like BeautifulSoup would be more robust.
There was a problem hiding this comment.
Re-read through Beautiful Soup documentation and code, and it does not seem to provide functionality to fetch urls. For complex interactions with a webpage, we might use Selenium which would basically run a headless browser instance.
There was a problem hiding this comment.
Yes, Beautiful Soup won't interact with the request/response, but it could provide a framework to interrogate the form to find the fields which will be part of the form submission, e.g.:
https://www.scaler.com/topics/cyber-security/attack-webforms-beautifulsoup-requests-python/
Selenium would provide a richer (and heavier) framework. A primary distinction would be the support for javascript. It may be shortsighted, but I would lean toward the lighter approach. Form authentication is a secondary feature of this tool, and adding all of Selenium feels like a heavy set of dependencies.
scythe_refactor.py
Outdated
| except (FileNotFoundError, ValueError): | ||
| print("State file not found or invalid format. " | ||
| "Defaulting to yesterday.") | ||
| last_run_date = date.today() |
There was a problem hiding this comment.
If no prior run was detected, the script should start from the beginning of the archive (omiting from= in the OAI-PMH request).
scythe_refactor.py
Outdated
| cookie -> a login form is submitted to a user provided website, | ||
| and response is set as header | ||
| """ | ||
| if config["login_type"] == "basic": |
There was a problem hiding this comment.
What if both authentications are needed? That is the state of a Hyku instance which is not yet public.
There was a problem hiding this comment.
It is also plausible that no authentication is needed for a public endpoint.
| """ | ||
| records = scythe_client.list_records( | ||
| metadata_prefix = metadata_format, | ||
| from_ = config["last_run"], until = config["today"]) |
There was a problem hiding this comment.
Is the underscore on from here deliberate? Perhaps to ignore parameter for testing?
There was a problem hiding this comment.
_ is to disambiguate from keyword from as in from lib import name. It is a part of the python library interacting with the oaipmh endpoint
scythe_refactor.py
Outdated
| pathjoin = os.path.join | ||
|
|
||
| identifier = record.header.identifier | ||
| identifier = identifier.replace(":", "_") |
There was a problem hiding this comment.
Is this replacement just for Windows compatibility?
| ) | ||
| metadata_file_path = pathjoin(record_path, | ||
| f"{identifier}.{metadata_format}") | ||
| if not os.path.exists(record_path): |
There was a problem hiding this comment.
Why clear the directory here, and introduce the update tracking, rather than just clearing the directory in main() at the point of moving from record to record?
This branch has been through a lot. But scythe_refactor.py represents a feature complete way to extract data from an OAI-PMH endpoint to Filesystem. Review requested for scythe_refactor.py and auth.py.