Add artworks parser #316

mattseddon · 2025-04-06T00:57:17Z

This is my submission for the coding challenge.

Approach

I have used your nokolexbor parser and CSS classes to identify the artworks section within the HTML. As some images appear to be lazy loaded, I first search for them within the document (using script tags containing lazy loaded images based on regex) and generate a hash of {id => img}. I then parse the artworks out of the HTML and use the hash contents as a priority before falling back to the data-src attribute of each img. The rest of the required information seemed readily available in the HTML.

I took this approach instead of using a headless browser to load the images (i.e. run the embedded JavaScript) because I feel it is more efficient and it avoids adding another external dependency to the project. We would need a benchmark for the first assumption if this was production code (instead of a coding challenge).

The main drawback I see to this approach is that it may be brittle. It has been a long time since I have done any scraping. I am not sure how often CSS classes will change on these page types and/or how often the underlying JavaScript gets updated. Changes to either will break the parser. I would definitely want a test directly against Google running often to confirm that everything is still working.

Things that are missing/next steps:

pre-commit hooks
CI
make it work against real urls.

Additional info

This is the first time (ever) that I've written Ruby, I used rbenv and rubocop for the project. I took on board most of the "suggestions" from rubocop due to my lack of experience with the language. I contemplated adding sorbet but ultimately decided against it because I could not see any trace in nokolexbor.

I have left some comments inline.

Thanks for taking the time to review.

…an gogh html

mattseddon · 2025-04-06T00:58:24Z

lib/google_artworks_parser.rb

+    ARTWORK_CSS_CLASS = '.iELo6'
+    EXTENSION_CSS_CLASS = '.cxzHyb'
+    NAME_CSS_CLASS = '.pgNMRc'
+    IMG_CSS_CLASS = '.taFZJe'


[C] I have these hoisted up here so that they are easy to identify and change.

mattseddon · 2025-04-06T00:59:04Z

lib/google_artworks_parser.rb

+    end
+
+    def decode_hex(str)
+      str.gsub(/\\x([0-9A-Fa-f]{2})/) { [Regexp.last_match(1)].pack('H2') }


[C] There may be a language/built-in library feature that can do this more elegantly but I was unable to find it.

mattseddon · 2025-04-06T01:01:55Z

spec/google_artworks_parser_spec.rb

+describe GoogleArtworksParser do
+  subject { described_class.parse }
+
+  fixtures.each do |fixture_name, fixture|


[C] I leaned on the relevant tests here.

I'm not too sure about generating tests via a loop. Rather than second-guess myself, I put the PR up. We can discuss whether or not this is standard practice in Ruby.

No worries. We are setting up our tests this way.

implement google artworks parser that correctly parses the provided v…

4c32de2

…an gogh html

mattseddon commented Apr 6, 2025

View reviewed changes

mattseddon force-pushed the seddon/add-google-artworks-parser branch from c07f258 to a9a6969 Compare April 6, 2025 04:43

add extra fixtures and rework tests

dcdaf5f

mattseddon force-pushed the seddon/add-google-artworks-parser branch from a9a6969 to dcdaf5f Compare April 6, 2025 04:46

remove unused dependency from gemfile lock file

d3fc6de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add artworks parser #316

Add artworks parser #316

mattseddon commented Apr 6, 2025 •

edited

Loading

mattseddon Apr 6, 2025

mattseddon Apr 6, 2025

mattseddon Apr 6, 2025

andypple83 Apr 18, 2025

Add artworks parser #316

Are you sure you want to change the base?

Add artworks parser #316

Conversation

mattseddon commented Apr 6, 2025 • edited Loading

Approach

Things that are missing/next steps:

Additional info

mattseddon Apr 6, 2025

Choose a reason for hiding this comment

mattseddon Apr 6, 2025

Choose a reason for hiding this comment

mattseddon Apr 6, 2025

Choose a reason for hiding this comment

andypple83 Apr 18, 2025

Choose a reason for hiding this comment

mattseddon commented Apr 6, 2025 •

edited

Loading