Class WikipediaTokenizerImpl
- java.lang.Object
-
- org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl
-
class WikipediaTokenizerImpl extends java.lang.Object
JFlex-generated tokenizer that is aware of Wikipedia syntax.
-
-
Field Summary
Fields Modifier and Type Field Description static int
ACRONYM
static int
ALPHANUM
static int
APOSTROPHE
static int
BOLD
static int
BOLD_ITALICS
static int
CATEGORY
static int
CATEGORY_STATE
static int
CITATION
static int
CJ
static int
COMPANY
private int
currentTokType
static int
DOUBLE_BRACE_STATE
static int
DOUBLE_EQUALS_STATE
static int
EMAIL
static int
EXTERNAL_LINK
static int
EXTERNAL_LINK_STATE
static int
EXTERNAL_LINK_URL
static int
FIVE_SINGLE_QUOTES_STATE
static int
HEADING
static int
HOST
static int
INTERNAL_LINK
static int
INTERNAL_LINK_STATE
static int
ITALICS
static int
NUM
private int
numBalanced
private int
numLinkToks
private int
numWikiTokensSeen
private int
positionInc
static int
STRING
static int
SUB_HEADING
static int
THREE_SINGLE_QUOTES_STATE
static java.lang.String[]
TOKEN_TYPES
static int
TWO_SINGLE_QUOTES_STATE
private int
yychar
the number of characters up to the start of the matched textprivate int
yycolumn
the number of characters from the last newline up to the start of the matched textstatic int
YYEOF
This character denotes the end of filestatic int
YYINITIAL
lexical statesprivate int
yyline
number of newlines encountered up to the start of the matched textprivate static int[]
ZZ_ACTION
Translates DFA states to action switch labels.private static java.lang.String
ZZ_ACTION_PACKED_0
private static int[]
ZZ_ATTRIBUTE
ZZ_ATTRIBUTE[aState] contains the attributes of stateaState
private static java.lang.String
ZZ_ATTRIBUTE_PACKED_0
private static int
ZZ_BUFFERSIZE
initial size of the lookahead bufferprivate static char[]
ZZ_CMAP
Translates characters to character classesprivate static java.lang.String
ZZ_CMAP_PACKED
Translates characters to character classesprivate static java.lang.String[]
ZZ_ERROR_MSG
private static int[]
ZZ_LEXSTATE
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integerprivate static int
ZZ_NO_MATCH
private static int
ZZ_PUSHBACK_2BIG
private static int[]
ZZ_ROWMAP
Translates a state to a row index in the transition tableprivate static java.lang.String
ZZ_ROWMAP_PACKED_0
private static int[]
ZZ_TRANS
The transition table of the DFAprivate static java.lang.String
ZZ_TRANS_PACKED_0
private static int
ZZ_UNKNOWN_ERROR
private boolean
zzAtBOL
zzAtBOL == true iff the scanner is currently at the beginning of a lineprivate boolean
zzAtEOF
zzAtEOF == true iff the scanner is at the EOFprivate char[]
zzBuffer
this buffer contains the current text to be matched and is the source of the yytext() stringprivate int
zzCurrentPos
the current text position in the bufferprivate int
zzEndRead
endRead marks the last character in the buffer, that has been read from inputprivate boolean
zzEOFDone
denotes if the user-EOF-code has already been executedprivate int
zzFinalHighSurrogate
The number of occupied positions in zzBuffer beyond zzEndRead.private int
zzLexicalState
the current lexical stateprivate int
zzMarkedPos
the textposition at the last accepting stateprivate java.io.Reader
zzReader
the input deviceprivate int
zzStartRead
startRead marks the beginning of the yytext() string in the bufferprivate int
zzState
the current state of the DFA
-
Constructor Summary
Constructors Constructor Description WikipediaTokenizerImpl(java.io.Reader in)
Creates a new scanner
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description int
getNextToken()
Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.int
getNumWikiTokensSeen()
Returns the number of tokens seen inside a category or link, etc.int
getPositionIncrement()
(package private) void
getText(CharTermAttribute t)
Fills Lucene token with the current token text.(package private) void
reset()
(package private) int
setText(java.lang.StringBuilder buffer)
void
yybegin(int newState)
Enters a new lexical stateint
yychar()
char
yycharat(int pos)
Returns the character at position pos from the matched text.void
yyclose()
Closes the input stream.int
yylength()
Returns the length of the matched text region.void
yypushback(int number)
Pushes the specified amount of characters back into the input stream.void
yyreset(java.io.Reader reader)
Resets the scanner to read from a new input stream.int
yystate()
Returns the current lexical state.java.lang.String
yytext()
Returns the text matched by the current regular expression.private boolean
zzRefill()
Refills the input buffer.private void
zzScanError(int errorCode)
Reports an error that occured while scanning.private static int[]
zzUnpackAction()
private static int
zzUnpackAction(java.lang.String packed, int offset, int[] result)
private static int[]
zzUnpackAttribute()
private static int
zzUnpackAttribute(java.lang.String packed, int offset, int[] result)
private static char[]
zzUnpackCMap(java.lang.String packed)
Unpacks the compressed character translation table.private static int[]
zzUnpackRowMap()
private static int
zzUnpackRowMap(java.lang.String packed, int offset, int[] result)
private static int[]
zzUnpackTrans()
private static int
zzUnpackTrans(java.lang.String packed, int offset, int[] result)
-
-
-
Field Detail
-
YYEOF
public static final int YYEOF
This character denotes the end of file- See Also:
- Constant Field Values
-
ZZ_BUFFERSIZE
private static final int ZZ_BUFFERSIZE
initial size of the lookahead buffer- See Also:
- Constant Field Values
-
YYINITIAL
public static final int YYINITIAL
lexical states- See Also:
- Constant Field Values
-
CATEGORY_STATE
public static final int CATEGORY_STATE
- See Also:
- Constant Field Values
-
INTERNAL_LINK_STATE
public static final int INTERNAL_LINK_STATE
- See Also:
- Constant Field Values
-
EXTERNAL_LINK_STATE
public static final int EXTERNAL_LINK_STATE
- See Also:
- Constant Field Values
-
TWO_SINGLE_QUOTES_STATE
public static final int TWO_SINGLE_QUOTES_STATE
- See Also:
- Constant Field Values
-
THREE_SINGLE_QUOTES_STATE
public static final int THREE_SINGLE_QUOTES_STATE
- See Also:
- Constant Field Values
-
FIVE_SINGLE_QUOTES_STATE
public static final int FIVE_SINGLE_QUOTES_STATE
- See Also:
- Constant Field Values
-
DOUBLE_EQUALS_STATE
public static final int DOUBLE_EQUALS_STATE
- See Also:
- Constant Field Values
-
DOUBLE_BRACE_STATE
public static final int DOUBLE_BRACE_STATE
- See Also:
- Constant Field Values
-
STRING
public static final int STRING
- See Also:
- Constant Field Values
-
ZZ_LEXSTATE
private static final int[] ZZ_LEXSTATE
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer
-
ZZ_CMAP_PACKED
private static final java.lang.String ZZ_CMAP_PACKED
Translates characters to character classes- See Also:
- Constant Field Values
-
ZZ_CMAP
private static final char[] ZZ_CMAP
Translates characters to character classes
-
ZZ_ACTION
private static final int[] ZZ_ACTION
Translates DFA states to action switch labels.
-
ZZ_ACTION_PACKED_0
private static final java.lang.String ZZ_ACTION_PACKED_0
- See Also:
- Constant Field Values
-
ZZ_ROWMAP
private static final int[] ZZ_ROWMAP
Translates a state to a row index in the transition table
-
ZZ_ROWMAP_PACKED_0
private static final java.lang.String ZZ_ROWMAP_PACKED_0
- See Also:
- Constant Field Values
-
ZZ_TRANS
private static final int[] ZZ_TRANS
The transition table of the DFA
-
ZZ_TRANS_PACKED_0
private static final java.lang.String ZZ_TRANS_PACKED_0
- See Also:
- Constant Field Values
-
ZZ_UNKNOWN_ERROR
private static final int ZZ_UNKNOWN_ERROR
- See Also:
- Constant Field Values
-
ZZ_NO_MATCH
private static final int ZZ_NO_MATCH
- See Also:
- Constant Field Values
-
ZZ_PUSHBACK_2BIG
private static final int ZZ_PUSHBACK_2BIG
- See Also:
- Constant Field Values
-
ZZ_ERROR_MSG
private static final java.lang.String[] ZZ_ERROR_MSG
-
ZZ_ATTRIBUTE
private static final int[] ZZ_ATTRIBUTE
ZZ_ATTRIBUTE[aState] contains the attributes of stateaState
-
ZZ_ATTRIBUTE_PACKED_0
private static final java.lang.String ZZ_ATTRIBUTE_PACKED_0
- See Also:
- Constant Field Values
-
zzReader
private java.io.Reader zzReader
the input device
-
zzState
private int zzState
the current state of the DFA
-
zzLexicalState
private int zzLexicalState
the current lexical state
-
zzBuffer
private char[] zzBuffer
this buffer contains the current text to be matched and is the source of the yytext() string
-
zzMarkedPos
private int zzMarkedPos
the textposition at the last accepting state
-
zzCurrentPos
private int zzCurrentPos
the current text position in the buffer
-
zzStartRead
private int zzStartRead
startRead marks the beginning of the yytext() string in the buffer
-
zzEndRead
private int zzEndRead
endRead marks the last character in the buffer, that has been read from input
-
yyline
private int yyline
number of newlines encountered up to the start of the matched text
-
yychar
private int yychar
the number of characters up to the start of the matched text
-
yycolumn
private int yycolumn
the number of characters from the last newline up to the start of the matched text
-
zzAtBOL
private boolean zzAtBOL
zzAtBOL == true iff the scanner is currently at the beginning of a line
-
zzAtEOF
private boolean zzAtEOF
zzAtEOF == true iff the scanner is at the EOF
-
zzEOFDone
private boolean zzEOFDone
denotes if the user-EOF-code has already been executed
-
zzFinalHighSurrogate
private int zzFinalHighSurrogate
The number of occupied positions in zzBuffer beyond zzEndRead. When a lead/high surrogate has been read from the input stream into the final zzBuffer position, this will have a value of 1; otherwise, it will have a value of 0.
-
ALPHANUM
public static final int ALPHANUM
- See Also:
- Constant Field Values
-
APOSTROPHE
public static final int APOSTROPHE
- See Also:
- Constant Field Values
-
ACRONYM
public static final int ACRONYM
- See Also:
- Constant Field Values
-
COMPANY
public static final int COMPANY
- See Also:
- Constant Field Values
-
EMAIL
public static final int EMAIL
- See Also:
- Constant Field Values
-
HOST
public static final int HOST
- See Also:
- Constant Field Values
-
NUM
public static final int NUM
- See Also:
- Constant Field Values
-
CJ
public static final int CJ
- See Also:
- Constant Field Values
-
INTERNAL_LINK
public static final int INTERNAL_LINK
- See Also:
- Constant Field Values
-
EXTERNAL_LINK
public static final int EXTERNAL_LINK
- See Also:
- Constant Field Values
-
CITATION
public static final int CITATION
- See Also:
- Constant Field Values
-
CATEGORY
public static final int CATEGORY
- See Also:
- Constant Field Values
-
BOLD
public static final int BOLD
- See Also:
- Constant Field Values
-
ITALICS
public static final int ITALICS
- See Also:
- Constant Field Values
-
BOLD_ITALICS
public static final int BOLD_ITALICS
- See Also:
- Constant Field Values
-
HEADING
public static final int HEADING
- See Also:
- Constant Field Values
-
SUB_HEADING
public static final int SUB_HEADING
- See Also:
- Constant Field Values
-
EXTERNAL_LINK_URL
public static final int EXTERNAL_LINK_URL
- See Also:
- Constant Field Values
-
currentTokType
private int currentTokType
-
numBalanced
private int numBalanced
-
positionInc
private int positionInc
-
numLinkToks
private int numLinkToks
-
numWikiTokensSeen
private int numWikiTokensSeen
-
TOKEN_TYPES
public static final java.lang.String[] TOKEN_TYPES
-
-
Method Detail
-
zzUnpackAction
private static int[] zzUnpackAction()
-
zzUnpackAction
private static int zzUnpackAction(java.lang.String packed, int offset, int[] result)
-
zzUnpackRowMap
private static int[] zzUnpackRowMap()
-
zzUnpackRowMap
private static int zzUnpackRowMap(java.lang.String packed, int offset, int[] result)
-
zzUnpackTrans
private static int[] zzUnpackTrans()
-
zzUnpackTrans
private static int zzUnpackTrans(java.lang.String packed, int offset, int[] result)
-
zzUnpackAttribute
private static int[] zzUnpackAttribute()
-
zzUnpackAttribute
private static int zzUnpackAttribute(java.lang.String packed, int offset, int[] result)
-
getNumWikiTokensSeen
public final int getNumWikiTokensSeen()
Returns the number of tokens seen inside a category or link, etc.- Returns:
- the number of tokens seen inside the context of wiki syntax.
-
yychar
public final int yychar()
-
getPositionIncrement
public final int getPositionIncrement()
-
getText
final void getText(CharTermAttribute t)
Fills Lucene token with the current token text.
-
setText
final int setText(java.lang.StringBuilder buffer)
-
reset
final void reset()
-
zzUnpackCMap
private static char[] zzUnpackCMap(java.lang.String packed)
Unpacks the compressed character translation table.- Parameters:
packed
- the packed character translation table- Returns:
- the unpacked character translation table
-
zzRefill
private boolean zzRefill() throws java.io.IOException
Refills the input buffer.- Returns:
false
, iff there was new input.- Throws:
java.io.IOException
- if any I/O-Error occurs
-
yyclose
public final void yyclose() throws java.io.IOException
Closes the input stream.- Throws:
java.io.IOException
-
yyreset
public final void yyreset(java.io.Reader reader)
Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to ZZ_INITIAL. Internal scan buffer is resized down to its initial length, if it has grown.- Parameters:
reader
- the new input stream
-
yystate
public final int yystate()
Returns the current lexical state.
-
yybegin
public final void yybegin(int newState)
Enters a new lexical state- Parameters:
newState
- the new lexical state
-
yytext
public final java.lang.String yytext()
Returns the text matched by the current regular expression.
-
yycharat
public final char yycharat(int pos)
Returns the character at position pos from the matched text. It is equivalent to yytext().charAt(pos), but faster- Parameters:
pos
- the position of the character to fetch. A value from 0 to yylength()-1.- Returns:
- the character at position pos
-
yylength
public final int yylength()
Returns the length of the matched text region.
-
zzScanError
private void zzScanError(int errorCode)
Reports an error that occured while scanning. In a wellformed scanner (no or only correct usage of yypushback(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen". If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.). Usual syntax/scanner level error handling should be done in error fallback rules.- Parameters:
errorCode
- the code of the errormessage to display
-
yypushback
public void yypushback(int number)
Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method- Parameters:
number
- the number of characters to be read again. This number must not be greater than yylength()!
-
getNextToken
public int getNextToken() throws java.io.IOException
Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.- Returns:
- the next token
- Throws:
java.io.IOException
- if any I/O-Error occurs
-
-