Skip to content

Add artworks parser #316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

mattseddon
Copy link

@mattseddon mattseddon commented Apr 6, 2025

This is my submission for the coding challenge.

Approach

I have used your nokolexbor parser and CSS classes to identify the artworks section within the HTML. As some images appear to be lazy loaded, I first search for them within the document (using script tags containing lazy loaded images based on regex) and generate a hash of {id => img}. I then parse the artworks out of the HTML and use the hash contents as a priority before falling back to the data-src attribute of each img. The rest of the required information seemed readily available in the HTML.

I took this approach instead of using a headless browser to load the images (i.e. run the embedded JavaScript) because I feel it is more efficient and it avoids adding another external dependency to the project. We would need a benchmark for the first assumption if this was production code (instead of a coding challenge).

The main drawback I see to this approach is that it may be brittle. It has been a long time since I have done any scraping. I am not sure how often CSS classes will change on these page types and/or how often the underlying JavaScript gets updated. Changes to either will break the parser. I would definitely want a test directly against Google running often to confirm that everything is still working.

Things that are missing/next steps:

  • pre-commit hooks
  • CI
  • make it work against real urls.

Additional info

This is the first time (ever) that I've written Ruby, I used rbenv and rubocop for the project. I took on board most of the "suggestions" from rubocop due to my lack of experience with the language. I contemplated adding sorbet but ultimately decided against it because I could not see any trace in nokolexbor.

I have left some comments inline.

Thanks for taking the time to review.

ARTWORK_CSS_CLASS = '.iELo6'
EXTENSION_CSS_CLASS = '.cxzHyb'
NAME_CSS_CLASS = '.pgNMRc'
IMG_CSS_CLASS = '.taFZJe'
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[C] I have these hoisted up here so that they are easy to identify and change.

end

def decode_hex(str)
str.gsub(/\\x([0-9A-Fa-f]{2})/) { [Regexp.last_match(1)].pack('H2') }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[C] There may be a language/built-in library feature that can do this more elegantly but I was unable to find it.

describe GoogleArtworksParser do
subject { described_class.parse }

fixtures.each do |fixture_name, fixture|
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[C] I leaned on the relevant tests here.

I'm not too sure about generating tests via a loop. Rather than second-guess myself, I put the PR up. We can discuss whether or not this is standard practice in Ruby.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. We are setting up our tests this way.

@mattseddon mattseddon force-pushed the seddon/add-google-artworks-parser branch from c07f258 to a9a6969 Compare April 6, 2025 04:43
@mattseddon mattseddon force-pushed the seddon/add-google-artworks-parser branch from a9a6969 to dcdaf5f Compare April 6, 2025 04:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants