-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NUTCH-1749 Optionally exclude title from content field #285
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steeveb972 Thanks for your contribution!
I've added some comments about your implementation that we should address before merging. Also, we need to fix the conflicts with the master branch.
@@ -1002,7 +1002,7 @@ | |||
|
|||
<!-- target: ant-eclipse-download =================================== --> | |||
<target name="ant-eclipse-download" description="--> downloads the ant-eclipse binary."> | |||
<get src="http://downloads.sourceforge.net/project/ant-eclipse/ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2" | |||
<get src="http://freefr.dl.sourceforge.net/project/ant-eclipse/ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a good reason to change this URL.
* append all the content text found beneath the DOM node to the | ||
* <code>StringBuffer</code>. | ||
* | ||
* <code>StringBuffer</code> without the mentioned element names in the <code>Set</code>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We return the textual content, not the element names. It should be something:
without the text from the excluded elements
boolean abortOnNestedAnchors) { | ||
if (getTextHelper(sb, node, abortOnNestedAnchors, 0)) { | ||
private boolean getText(StringBuffer sb, Node node, | ||
boolean abortOnNestedAnchors, Set<String> excludedElementNames) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting
boolean abort = false; | ||
NodeWalker walker = new NodeWalker(node); | ||
Set<String> lcExcludedElementNames = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid duplicating the exclusion set. This method is executed many times. We could use a TreeSet
when it is invoked, delegating the comparison to the Set
implementation.
* append all the content text found beneath the DOM node to the | ||
* <code>StringBuffer</code>. | ||
* | ||
* <code>StringBuffer</code> without the mentioned element names in the <code>Set</code>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the previous comment.
Hi @steeveb972 this seems like a really useful patch. Are you able to update it and we try to get it into master? |
Hi, it has been a while ;-) I will have a look at it this week-end |
Hello,
Following the description and the comment provided in this jira, I propose a patch to add the ability to exclude some tags in HTML and Tika parsers.
Regards,
Steeve