|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectde.dfki.lt.tools.tokenizer.JTok
JTok
is a low level tokenizer tool that recognizes
paragraphs, sentences, tokens, punctuation, numbers, abbreviations,
etc.
Nested Class Summary | |
class |
JTok.OpenClosePunctFlag
This inner class is used as a wrapper for a boolean primitive value to allow call-by-reference with it. |
Field Summary | |
static java.lang.String |
BORDER_ANNO
This is the annotation key for sentences and paragraph borders. |
static java.lang.String |
CLASS_ANNO
This is the annotation key for the token class. |
static java.lang.String |
P_BORDER
This is the annotation value for paragraph borders. |
static java.lang.String |
TU_BORDER
This is the annotation value for text unit borders. |
Constructor Summary | |
JTok(java.util.Properties configProps)
This creates a new instance of JTok using
the properties in configProps . |
Method Summary | |
LanguageResource |
getLanguageResource(java.lang.String aLanguage)
This returns the LanguageResource for the given language
if available |
boolean |
isAncestor(java.lang.String tag1,
java.lang.String tag2,
java.lang.String aLanguage)
This checks if the class of a token with tag tag1 is
ancestor in the class hierarchy of the class of a token with tag
tag2 or if the token classes are equal in the token
class hierarchy for aLanguage . |
static void |
main(java.lang.String[] args)
This main method must be used with two or three arguments: - a file name for the document to tokenize - the language of the document - an optional encoding to use (default is ISO-8859-1) Supported languages are: de, en, it |
AnnotatedString |
tokenize(java.lang.String anInputText,
java.lang.String aLanguage)
This takes a String that contains the text to
tokenize and parses it for aLanguage . |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public static final java.lang.String CLASS_ANNO
public static final java.lang.String BORDER_ANNO
public static final java.lang.String TU_BORDER
public static final java.lang.String P_BORDER
Constructor Detail |
public JTok(java.util.Properties configProps)
JTok
using
the properties in configProps
.
configProps
- a Properties
object that contains
data about the supported languages
InitializationException
- if initialization failsMethod Detail |
public LanguageResource getLanguageResource(java.lang.String aLanguage) throws LanguageNotSupportedException
LanguageResource
for the given language
if available
aLanguage
- a String
with the language
LanguageResource
LanguageNotSupportedException
- if no language resource
is available for this languagepublic AnnotatedString tokenize(java.lang.String anInputText, java.lang.String aLanguage)
String
that contains the text to
tokenize and parses it for aLanguage
. It returns an
instance of AnnotatedString
that contains the identified
paragraphs with their text units and tokens.
anInputText
- a String
with the text to analyseaLanguage
- a String
with the language to use
AnnotatedString
ProcessingException
- if input data causes an error
e.g. if language is not supportedpublic boolean isAncestor(java.lang.String tag1, java.lang.String tag2, java.lang.String aLanguage) throws ProcessingException
tag1
is
ancestor in the class hierarchy of the class of a token with tag
tag2
or if the token classes are equal in the token
class hierarchy for aLanguage
.
tag1
- a String
with a token class tagtag2
- a String
with a token class tagaLanguage
- a String
with the language
boolen
ProcessingException
- if tags cannot be mapped to a
token classpublic static void main(java.lang.String[] args)
args
- an array of String
s with the arguments
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |