Skip to content

Code review requested for scythe_refactor.py#7

Open
ojas-uls-dev wants to merge 23 commits intofirst-run-fixesfrom
rewrite
Open

Code review requested for scythe_refactor.py#7
ojas-uls-dev wants to merge 23 commits intofirst-run-fixesfrom
rewrite

Conversation

@ojas-uls-dev
Copy link

@ojas-uls-dev ojas-uls-dev commented Feb 16, 2026

This branch has been through a lot. But scythe_refactor.py represents a feature complete way to extract data from an OAI-PMH endpoint to Filesystem. Review requested for scythe_refactor.py and auth.py.

Copy link
Member

@ctgraham ctgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the read-through of comments and change requests which we discussed earlier. I only looked at scythe_refactor.py and auth.py.

auth.py Outdated
page = httpx.get(conf.login_uri, headers=user_agent_headers)
content = page.text
if page.status_code == 403:
print("Most likely Cloudflare issue.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For error messaging, let's print to STDERR or to a logger object.

# namespaces={'r': et.xpath("namespace-uri()")})[0]
print(uname_form_name, passwd_form_name, form_action_uri)

post_ret = requests.post(requests.compat.urljoin(conf.login_uri,form_action_uri), data={uname_form_name: uname,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bespoke POST here is highly risky. It doesn't use the same User Agent string as the GET, and it doesn't handle CSRF or form tokens. Using a tool like BeautifulSoup would be more robust.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-read through Beautiful Soup documentation and code, and it does not seem to provide functionality to fetch urls. For complex interactions with a webpage, we might use Selenium which would basically run a headless browser instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Beautiful Soup won't interact with the request/response, but it could provide a framework to interrogate the form to find the fields which will be part of the form submission, e.g.:
https://www.scaler.com/topics/cyber-security/attack-webforms-beautifulsoup-requests-python/

Selenium would provide a richer (and heavier) framework. A primary distinction would be the support for javascript. It may be shortsighted, but I would lean toward the lighter approach. Form authentication is a secondary feature of this tool, and adding all of Selenium feels like a heavy set of dependencies.

except (FileNotFoundError, ValueError):
print("State file not found or invalid format. "
"Defaulting to yesterday.")
last_run_date = date.today()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no prior run was detected, the script should start from the beginning of the archive (omiting from= in the OAI-PMH request).

cookie -> a login form is submitted to a user provided website,
and response is set as header
"""
if config["login_type"] == "basic":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if both authentications are needed? That is the state of a Hyku instance which is not yet public.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also plausible that no authentication is needed for a public endpoint.

"""
records = scythe_client.list_records(
metadata_prefix = metadata_format,
from_ = config["last_run"], until = config["today"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the underscore on from here deliberate? Perhaps to ignore parameter for testing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ is to disambiguate from keyword from as in from lib import name. It is a part of the python library interacting with the oaipmh endpoint

pathjoin = os.path.join

identifier = record.header.identifier
identifier = identifier.replace(":", "_")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this replacement just for Windows compatibility?

)
metadata_file_path = pathjoin(record_path,
f"{identifier}.{metadata_format}")
if not os.path.exists(record_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why clear the directory here, and introduce the update tracking, rather than just clearing the directory in main() at the point of moving from record to record?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants