de.l3s.boilerpipe.filters.english
Class KeepLargestFulltextBlockFilter

java.lang.Object
  extended by de.l3s.boilerpipe.filters.english.KeepLargestFulltextBlockFilter
All Implemented Interfaces:
BoilerpipeFilter

public final class KeepLargestFulltextBlockFilter
extends java.lang.Object
implements BoilerpipeFilter

Keeps the largest TextBlock only (by the number of words). In case of more than one block with the same number of words, the first block is chosen. All discarded blocks are marked "not content" and flagged as DefaultLabels.MIGHT_BE_CONTENT. As opposed to KeepLargestBlockFilter, the number of words are computed using HeuristicFilterBase.getNumFullTextWords(TextBlock), which only counts words that occur in text elements with at least 9 words and are thus believed to be full text. NOTE: Without language-specific fine-tuning (i.e., running the default instance), this filter may lead to suboptimal results. You better use KeepLargestBlockFilter instead, which works at the level of number-of-words instead of text densities.

Author:
Christian Kohlsch??tter

Field Summary
static KeepLargestFulltextBlockFilter INSTANCE
           
 
Constructor Summary
KeepLargestFulltextBlockFilter()
           
 
Method Summary
protected static int getNumFullTextWords(TextBlock tb)
           
protected static int getNumFullTextWords(TextBlock tb, float minTextDensity)
           
 boolean process(TextDocument doc)
          Processes the given document doc.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INSTANCE

public static final KeepLargestFulltextBlockFilter INSTANCE
Constructor Detail

KeepLargestFulltextBlockFilter

public KeepLargestFulltextBlockFilter()
Method Detail

process

public boolean process(TextDocument doc)
                throws BoilerpipeProcessingException
Description copied from interface: BoilerpipeFilter
Processes the given document doc.

Specified by:
process in interface BoilerpipeFilter
Parameters:
doc - The TextDocument that is to be processed.
Returns:
true if changes have been made to the TextDocument.
Throws:
BoilerpipeProcessingException

getNumFullTextWords

protected static int getNumFullTextWords(TextBlock tb)

getNumFullTextWords

protected static int getNumFullTextWords(TextBlock tb,
                                         float minTextDensity)