You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wrote this up to answer somebody's question in the past year, but I can't remember who or where. I think it's a good example of moving from a short-but-specific-solution to a longer-but-general-solution and hopefully teaches folks about custom XPath handlers and XPath queries along the way.
#! /usr/bin/env ruby## TODO: put this in the nokogiri.org tutorials#require"nokogiri"html=<<~EOF <html> <body> <h1>My Fakepedia Page</h1> <div id="bodyContent"> <div id="mw-content-text"> <div class="mw-parser-output"> <div id="toc">...</div> <h2>Background</h2> <p>This is uninteresting content and you don't want to scrape it.</p> <h2>Good Stuff</h2> <p>This is the good stuff.</p> <p>You really want to scrape just this section.</p> <h2>Unrelated Stuff</h2> <p>This is where the author has gone off on a tangent.</p> <h2>References</h2> <p>Snoozapalooza.</p> </div> </div> </div>EOFdoc=Nokogiri::HTML(html)## solution 1 - simple XPath, process results in Ruby## i think you will agree, this is ugly code and makes a lot of# implicit assumptions about the structure of the document.## don't do this. better alternatives are provided below.#node_set=doc.css("div.mw-parser-output").children# look forward until we get to the h2 that we wantstart_index=0while !(node_set[start_index].name == "h2" && node_set[start_index].content == "Good Stuff")start_index += 1endstart_index += 1# look forward until we get to the next h2end_index=start_indexwhilenode_set[end_index + 1].name != "h2"end_index += 1end# slice the node setputsnode_set[start_index..end_index]puts"-----"## solution 2 - using an XPath function to perform set intersection## this is much cleaner code, but still makes an assumption about the# structure of the document.## a better alternative is provided below#classXPathIntersectiondefself.intersection(set1,set2)set1 & set2# in ruby, return the intersection of the NodeSetsendendxpath_query=<<~EOX intersection(//h2[text()='Good Stuff']/following-sibling::*, //h2[text()='Unrelated Stuff']/preceding-sibling::*)EOXputsdoc.xpath(xpath_query,XPathIntersection)puts"-----"## solution 3 - write a method to introspect on the document and use# more XPath queries to find the section boundary and return only the# nodes within the section.## note that it works:# - for any header level (h1, h2, h3, et al)# - even if the header is the last one in the section# - only requires knowing the text of the header you care about## it uses:# - Node#path which returns an XPath query that points just to this node# - Node#name which returns the tag of the node (e.g., "h2", "div")#classXPathHeaderSectiondefself.header_section(node_set)document=node_set.documentheader=node_set.first# grab siblings that follow the target headerfollowing_siblings_query="#{header.path}/following-sibling::*"following_siblings=document.xpath(following_siblings_query)# check if there's a next header of the same type that's a siblingnext_header_query="#{header.path}/following-sibling::#{header.name}"next_header=document.at_xpath(next_header_query)ifnext_headerpreceding_siblings_query="#{next_header.path}/preceding-sibling::*"preceding_siblings=document.xpath(preceding_siblings_query)following_siblings & preceding_siblings# xpath intersectionelsefollowing_siblingsendendendputsXPathHeaderSection.header_section(doc.xpath("//h2[text()='Good Stuff']"))# note that you can also call this method as an XPath functionputsdoc.xpath("header_section(//h2[text()='Good Stuff'])",XPathHeaderSection)
The text was updated successfully, but these errors were encountered:
I wrote this up to answer somebody's question in the past year, but I can't remember who or where. I think it's a good example of moving from a short-but-specific-solution to a longer-but-general-solution and hopefully teaches folks about custom XPath handlers and XPath queries along the way.
The text was updated successfully, but these errors were encountered: