Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-1749 Optionally exclude title from content field #285

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

steeveb972
Copy link

Hello,

Following the description and the comment provided in this jira, I propose a patch to add the ability to exclude some tags in HTML and Tika parsers.

Regards,

Steeve

Copy link
Member

@jorgelbg jorgelbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steeveb972 Thanks for your contribution!
I've added some comments about your implementation that we should address before merging. Also, we need to fix the conflicts with the master branch.

@@ -1002,7 +1002,7 @@

<!-- target: ant-eclipse-download =================================== -->
<target name="ant-eclipse-download" description="--> downloads the ant-eclipse binary.">
<get src="http://downloads.sourceforge.net/project/ant-eclipse/ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2"
<get src="http://freefr.dl.sourceforge.net/project/ant-eclipse/ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a good reason to change this URL.

* append all the content text found beneath the DOM node to the
* <code>StringBuffer</code>.
*
* <code>StringBuffer</code> without the mentioned element names in the <code>Set</code>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We return the textual content, not the element names. It should be something:

without the text from the excluded elements

boolean abortOnNestedAnchors) {
if (getTextHelper(sb, node, abortOnNestedAnchors, 0)) {
private boolean getText(StringBuffer sb, Node node,
boolean abortOnNestedAnchors, Set<String> excludedElementNames) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting

boolean abort = false;
NodeWalker walker = new NodeWalker(node);
Set<String> lcExcludedElementNames = new HashSet<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid duplicating the exclusion set. This method is executed many times. We could use a TreeSet when it is invoked, delegating the comparison to the Set implementation.

* append all the content text found beneath the DOM node to the
* <code>StringBuffer</code>.
*
* <code>StringBuffer</code> without the mentioned element names in the <code>Set</code>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the previous comment.

@lewismc
Copy link
Member

lewismc commented Jan 8, 2021

Hi @steeveb972 this seems like a really useful patch. Are you able to update it and we try to get it into master?

@steeveb972
Copy link
Author

Hi, it has been a while ;-) I will have a look at it this week-end

@lewismc lewismc changed the title fix for NUTCH-1749 contributed by steeveb972 NUTCH-1749 Optionally exclude title from content field Jan 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants