Add User-Agent Header to Jsoup Connections in Transforms #1668
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request...
Description
In transforms, currently a source's url is fetched without specifying user-agent headers. This small PR adds
.userAgent("Mozilla")
to the line fetching theDocument
of the url through the Jsoup connection. I hardcoded the value as I saw elsewhere in the codebase doing the same practice. This may be improved by allowing the user-agent to be specified in the configs as part of the transform.Purpose
When fetching sources in transforms, some servers may block (e.g. 403 Forbidden) due to missing user-agent headers. To fix, set the user-agent to "Mozilla" for the Jsoup connection before fetching the website.
This allows roundabout loading from sources that block requests with missing user-agent headers to work.*
*Assuming they accept "Mozilla" as a valid user-agent header. For the source I'm using, it does.
Relevant Issue(s)
N/A (not sure if I should have created an issue first)