Skip to content

MVP Attribute Extractor with test file #323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mary-hopkins
Copy link

Summary

This PR completes (most of) the code challenge by extracting structured data from the provided HTML file using Nokogiri. The goal was to return an array of hashes containing artwork details from a specific section of the Google search results page. My solution has limitations, which I have listed below the example output section, as well as ways that it can be improved.

My primary goal was to return the correct array of structured data as requested — while writing as little code and infrastructure as necessary. I intentionally avoided overengineering in order to keep the solution simple, focused, and easy to follow.

This meant:

  • Using a single-pass parser without building a large class structure
  • Relying on clear, minimal logic rather than general-purpose utilities or DSLs
  • Prioritizing readability and correctness for the specific challenge scope

If the problem were to be expanded or reused across multiple result types or inputs, I would consider refactoring for flexibility and testability. But for this case, I chose to keep it lean and direct.

What I Did

  • Parsed the provided HTML file using Nokogiri
  • Located the div containing artwork images and metadata
  • Extracted and normalized the following fields for each artwork:
    • name: Combined title text
    • extension: Extracted from structured divs
    • link: Google search link from ahref tag
    • image: Base64 from the src attribute within the img tag
  • Cleaned and standardized whitespace and formatting in text fields
  • Returned a final array of hashes, matching the expected output shape

Example Usage

# Pull the project and run bundle install to get Nokogiri and RSpec working. 

# Make sure the necessary files are referenced correctly:
html_file = File.open(File.expand_path('../../files/van-gogh-paintings.html', __FILE__))
expected_array = JSON.parse(File.read(File.expand_path('../../files/expected-array.json', __FILE__)), symbolize_names: true)

# Initialize the Extractor with the HTML file and then call the method to get the array
extractor = AttributeExtractor.new(html_file: html_file)
json_of_artworks = extractor.get_artworks_attributes_hash

# Expected Output
json_of_artworks = [
  {
    :name=>"The Starry Night", 
    :extentions=>["1889"], 
    :link=>"/search?sca_esv=c2e426814f4d07e9&gl=us&hl=en&q=The+Starry+Night&stick=H4sIAAAAAAAAAONgFuLQz9U3MI_PNVLiBLFMzC3jC7WUspOt9Msyi0sTc-ITi0qQmJnFJVbl-UXZxYtYBUIyUhWCSxKLiioV_DLTM0oAdKX0-E4AAAA&sa=X&ved=2ahUKEwjK-K-JwLWKAxXcQTABHePpOFoQtq8DegQIMxAD", 
    :image=>""
  },
  ...
] 

Notes

  • The image URLs in the provided in the tags are truncated; further work is needed to find the entire Base64 code elsewhere in the file.

  • The current solution works for the basic structure of the provided results page, but it is not tested against variations or alternate result layouts.

  • Some logic is currently hardcoded (e.g., targeting specific class names or content structure), and could be improved by dynamically parsing surrounding elements or using more flexible selectors.

  • Whitespace and text normalization were necessary due to the heavily indented source file; .text.strip.gsub(/\s+/, ' ') was used to clean values.

Opportunities for Improvement

  • Dynamically detect section boundaries (e.g., using heading labels instead of fixed class traversal)

  • Improve robustness by generalizing selectors or adding structure detection

  • Find and extract the full Image from the file

  • Add tests for parsing variations and edge cases if more examples are available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant