public class HyphenationCompoundWordTokenFilter extends CompoundWordTokenFilterBase
TokenFilter
that decomposes compound words found in many Germanic languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation grammar and a word dictionary to achieve this.
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, onlyLongestMatch, tokens
Constructor and Description |
---|
HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
HyphenationTree hyphenator,
java.util.Set dictionary) |
HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
HyphenationTree hyphenator,
java.util.Set dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch) |
HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
HyphenationTree hyphenator,
java.lang.String[] dictionary) |
HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
HyphenationTree hyphenator,
java.lang.String[] dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch) |
Modifier and Type | Method and Description |
---|---|
protected void |
decomposeInternal(org.apache.lucene.analysis.Token token) |
static HyphenationTree |
getHyphenationTree(java.io.File hyphenationFile)
Create a hyphenator tree
|
static HyphenationTree |
getHyphenationTree(org.xml.sax.InputSource hyphenationSource)
Create a hyphenator tree
|
static HyphenationTree |
getHyphenationTree(java.io.Reader hyphenationReader)
Create a hyphenator tree
|
static HyphenationTree |
getHyphenationTree(java.lang.String hyphenationFilename)
Create a hyphenator tree
|
addAllLowerCase, createToken, decompose, incrementToken, makeDictionary, makeLowerCaseCopy, next, next, reset
getOnlyUseNewAPI, setOnlyUseNewAPI
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString
public HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input, HyphenationTree hyphenator, java.lang.String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
input
- the TokenStream
to processhyphenator
- the hyphenation pattern tree to use for hyphenationdictionary
- the word dictionary to match againstminWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output
streammaxSubwordSize
- only subwords shorter than this get to the output
streamonlyLongestMatch
- Add only the longest matching subword to the streampublic HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input, HyphenationTree hyphenator, java.lang.String[] dictionary)
input
- the TokenStream
to processhyphenator
- the hyphenation pattern tree to use for hyphenationdictionary
- the word dictionary to match againstpublic HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input, HyphenationTree hyphenator, java.util.Set dictionary)
input
- the TokenStream
to processhyphenator
- the hyphenation pattern tree to use for hyphenationdictionary
- the word dictionary to match against. If this is a CharArraySet
it must have set ignoreCase=false and only contain
lower case strings.public HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input, HyphenationTree hyphenator, java.util.Set dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
input
- the TokenStream
to processhyphenator
- the hyphenation pattern tree to use for hyphenationdictionary
- the word dictionary to match against. If this is a CharArraySet
it must have set ignoreCase=false and only contain
lower case strings.minWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output
streammaxSubwordSize
- only subwords shorter than this get to the output
streamonlyLongestMatch
- Add only the longest matching subword to the streampublic static HyphenationTree getHyphenationTree(java.lang.String hyphenationFilename) throws java.lang.Exception
hyphenationFilename
- the filename of the XML grammar to loadjava.lang.Exception
public static HyphenationTree getHyphenationTree(java.io.File hyphenationFile) throws java.lang.Exception
hyphenationFile
- the file of the XML grammar to loadjava.lang.Exception
public static HyphenationTree getHyphenationTree(java.io.Reader hyphenationReader) throws java.lang.Exception
hyphenationReader
- the reader of the XML grammar to load fromjava.lang.Exception
public static HyphenationTree getHyphenationTree(org.xml.sax.InputSource hyphenationSource) throws java.lang.Exception
hyphenationSource
- the InputSource pointing to the XML grammarjava.lang.Exception
protected void decomposeInternal(org.apache.lucene.analysis.Token token)
decomposeInternal
in class CompoundWordTokenFilterBase
Copyright © 2000-2018 Apache Software Foundation. All Rights Reserved.