MVP Attribute Extractor with test file #323
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR completes (most of) the code challenge by extracting structured data from the provided HTML file using Nokogiri. The goal was to return an array of hashes containing artwork details from a specific section of the Google search results page. My solution has limitations, which I have listed below the example output section, as well as ways that it can be improved.
My primary goal was to return the correct array of structured data as requested — while writing as little code and infrastructure as necessary. I intentionally avoided overengineering in order to keep the solution simple, focused, and easy to follow.
This meant:
If the problem were to be expanded or reused across multiple result types or inputs, I would consider refactoring for flexibility and testability. But for this case, I chose to keep it lean and direct.
What I Did
div
containing artwork images and metadataname
: Combined title textextension
: Extracted from structured divslink
: Google search link fromahref
tagimage
: Base64 from thesrc
attribute within theimg
tagExample Usage
Notes
The image URLs in the provided in the
tags are truncated; further work is needed to find the entire Base64 code elsewhere in the file.
The current solution works for the basic structure of the provided results page, but it is not tested against variations or alternate result layouts.
Some logic is currently hardcoded (e.g., targeting specific class names or content structure), and could be improved by dynamically parsing surrounding elements or using more flexible selectors.
Whitespace and text normalization were necessary due to the heavily indented source file; .text.strip.gsub(/\s+/, ' ') was used to clean values.
Opportunities for Improvement
Dynamically detect section boundaries (e.g., using heading labels instead of fixed class traversal)
Improve robustness by generalizing selectors or adding structure detection
Find and extract the full Image from the file
Add tests for parsing variations and edge cases if more examples are available