MVP Attribute Extractor with test file #323

mary-hopkins · 2025-04-23T17:48:17Z

Summary

This PR completes (most of) the code challenge by extracting structured data from the provided HTML file using Nokogiri. The goal was to return an array of hashes containing artwork details from a specific section of the Google search results page. My solution has limitations, which I have listed below the example output section, as well as ways that it can be improved.

My primary goal was to return the correct array of structured data as requested — while writing as little code and infrastructure as necessary. I intentionally avoided overengineering in order to keep the solution simple, focused, and easy to follow.

This meant:

Using a single-pass parser without building a large class structure
Relying on clear, minimal logic rather than general-purpose utilities or DSLs
Prioritizing readability and correctness for the specific challenge scope

If the problem were to be expanded or reused across multiple result types or inputs, I would consider refactoring for flexibility and testability. But for this case, I chose to keep it lean and direct.

What I Did

Parsed the provided HTML file using Nokogiri
Located the div containing artwork images and metadata
Extracted and normalized the following fields for each artwork:
- name: Combined title text
- extension: Extracted from structured divs
- link: Google search link from ahref tag
- image: Base64 from the src attribute within the img tag
Cleaned and standardized whitespace and formatting in text fields
Returned a final array of hashes, matching the expected output shape

Example Usage

# Pull the project and run bundle install to get Nokogiri and RSpec working. 

# Make sure the necessary files are referenced correctly:
html_file = File.open(File.expand_path('../../files/van-gogh-paintings.html', __FILE__))
expected_array = JSON.parse(File.read(File.expand_path('../../files/expected-array.json', __FILE__)), symbolize_names: true)

# Initialize the Extractor with the HTML file and then call the method to get the array
extractor = AttributeExtractor.new(html_file: html_file)
json_of_artworks = extractor.get_artworks_attributes_hash

# Expected Output
json_of_artworks = [
  {
    :name=>"The Starry Night", 
    :extentions=>["1889"], 
    :link=>"/search?sca_esv=c2e426814f4d07e9&gl=us&hl=en&q=The+Starry+Night&stick=H4sIAAAAAAAAAONgFuLQz9U3MI_PNVLiBLFMzC3jC7WUspOt9Msyi0sTc-ITi0qQmJnFJVbl-UXZxYtYBUIyUhWCSxKLiioV_DLTM0oAdKX0-E4AAAA&sa=X&ved=2ahUKEwjK-K-JwLWKAxXcQTABHePpOFoQtq8DegQIMxAD", 
    :image=>"data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="
  },
  ...
]

Notes

The image URLs in the provided in the tags are truncated; further work is needed to find the entire Base64 code elsewhere in the file.
The current solution works for the basic structure of the provided results page, but it is not tested against variations or alternate result layouts.
Some logic is currently hardcoded (e.g., targeting specific class names or content structure), and could be improved by dynamically parsing surrounding elements or using more flexible selectors.
Whitespace and text normalization were necessary due to the heavily indented source file; .text.strip.gsub(/\s+/, ' ') was used to clean values.

Opportunities for Improvement

Dynamically detect section boundaries (e.g., using heading labels instead of fixed class traversal)
Improve robustness by generalizing selectors or adding structure detection
Find and extract the full Image from the file
Add tests for parsing variations and edge cases if more examples are available

semi working solution with test

16009c7

andypple83 closed this Apr 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MVP Attribute Extractor with test file #323

MVP Attribute Extractor with test file #323

Uh oh!

mary-hopkins commented Apr 23, 2025

Uh oh!

Uh oh!

MVP Attribute Extractor with test file #323

MVP Attribute Extractor with test file #323

Uh oh!

Conversation

mary-hopkins commented Apr 23, 2025

Summary

What I Did

Example Usage

Notes

Opportunities for Improvement

Uh oh!

Uh oh!