Package org.apache.tika.parser.txt
Class Icu4jEncodingDetector
- java.lang.Object
-
- org.apache.tika.parser.txt.Icu4jEncodingDetector
-
- All Implemented Interfaces:
Serializable,org.apache.tika.detect.EncodingDetector
public class Icu4jEncodingDetector extends Object implements org.apache.tika.detect.EncodingDetector
- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description Icu4jEncodingDetector()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description Charsetdetect(InputStream input, org.apache.tika.metadata.Metadata metadata)List<String>getIgnoreCharsets()intgetMarkLimit()intgetMarkLimt()booleanisStripMarkup()voidsetIgnoreCharsets(List<String> charsetsToIgnore)voidsetMarkLimit(int markLimit)How far into the stream to read for charset detection.voidsetStripMarkup(boolean stripMarkup)Whether or not to attempt to strip html-ish markup from the stream before sending it to the underlying detector.
-
-
-
Method Detail
-
detect
public Charset detect(InputStream input, org.apache.tika.metadata.Metadata metadata) throws IOException
- Specified by:
detectin interfaceorg.apache.tika.detect.EncodingDetector- Throws:
IOException
-
isStripMarkup
public boolean isStripMarkup()
-
setStripMarkup
@Field public void setStripMarkup(boolean stripMarkup)
Whether or not to attempt to strip html-ish markup from the stream before sending it to the underlying detector.The underlying detector may still apply its own stripping if this is set to
false.- Parameters:
stripMarkup- whether or not to attempt to strip markup before sending the stream to the underlying detector
-
getMarkLimit
public int getMarkLimit()
-
setMarkLimit
@Field public void setMarkLimit(int markLimit)
How far into the stream to read for charset detection. Default is 12000.- Parameters:
markLimit-
-
getMarkLimt
public int getMarkLimt()
-
-