de.dfki.lt.tools.tokenizer
Class JTok

java.lang.Object
  extended byde.dfki.lt.tools.tokenizer.JTok

public class JTok
extends java.lang.Object

JTok is a low level tokenizer tool that recognizes paragraphs, sentences, tokens, punctuation, numbers, abbreviations, etc.

Version:
$Id: JTok.java,v 1.6 2005/04/12 08:47:37 steffen Exp $
Author:
Joerg Steffen, DFKI

Nested Class Summary
 class JTok.OpenClosePunctFlag
          This inner class is used as a wrapper for a boolean primitive value to allow call-by-reference with it.
 
Field Summary
static java.lang.String BORDER_ANNO
          This is the annotation key for sentences and paragraph borders.
static java.lang.String CLASS_ANNO
          This is the annotation key for the token class.
static java.lang.String P_BORDER
          This is the annotation value for paragraph borders.
static java.lang.String TU_BORDER
          This is the annotation value for text unit borders.
 
Constructor Summary
JTok(java.util.Properties configProps)
          This creates a new instance of JTok using the properties in configProps.
 
Method Summary
 LanguageResource getLanguageResource(java.lang.String aLanguage)
          This returns the LanguageResource for the given language if available
 boolean isAncestor(java.lang.String tag1, java.lang.String tag2, java.lang.String aLanguage)
          This checks if the class of a token with tag tag1 is ancestor in the class hierarchy of the class of a token with tag tag2 or if the token classes are equal in the token class hierarchy for aLanguage.
static void main(java.lang.String[] args)
          This main method must be used with two or three arguments: - a file name for the document to tokenize - the language of the document - an optional encoding to use (default is ISO-8859-1) Supported languages are: de, en, it
 AnnotatedString tokenize(java.lang.String anInputText, java.lang.String aLanguage)
          This takes a String that contains the text to tokenize and parses it for aLanguage.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CLASS_ANNO

public static final java.lang.String CLASS_ANNO
This is the annotation key for the token class.

See Also:
Constant Field Values

BORDER_ANNO

public static final java.lang.String BORDER_ANNO
This is the annotation key for sentences and paragraph borders.

See Also:
Constant Field Values

TU_BORDER

public static final java.lang.String TU_BORDER
This is the annotation value for text unit borders.

See Also:
Constant Field Values

P_BORDER

public static final java.lang.String P_BORDER
This is the annotation value for paragraph borders.

See Also:
Constant Field Values
Constructor Detail

JTok

public JTok(java.util.Properties configProps)
This creates a new instance of JTok using the properties in configProps.

Parameters:
configProps - a Properties object that contains data about the supported languages
Throws:
InitializationException - if initialization fails
Method Detail

getLanguageResource

public LanguageResource getLanguageResource(java.lang.String aLanguage)
                                     throws LanguageNotSupportedException
This returns the LanguageResource for the given language if available

Parameters:
aLanguage - a String with the language
Returns:
a LanguageResource
Throws:
LanguageNotSupportedException - if no language resource is available for this language

tokenize

public AnnotatedString tokenize(java.lang.String anInputText,
                                java.lang.String aLanguage)
This takes a String that contains the text to tokenize and parses it for aLanguage. It returns an instance of AnnotatedString that contains the identified paragraphs with their text units and tokens.
This method is thread-safe.

Parameters:
anInputText - a String with the text to analyse
aLanguage - a String with the language to use
Returns:
an AnnotatedString
Throws:
ProcessingException - if input data causes an error e.g. if language is not supported

isAncestor

public boolean isAncestor(java.lang.String tag1,
                          java.lang.String tag2,
                          java.lang.String aLanguage)
                   throws ProcessingException
This checks if the class of a token with tag tag1 is ancestor in the class hierarchy of the class of a token with tag tag2 or if the token classes are equal in the token class hierarchy for aLanguage.

Parameters:
tag1 - a String with a token class tag
tag2 - a String with a token class tag
aLanguage - a String with the language
Returns:
a boolen
Throws:
ProcessingException - if tags cannot be mapped to a token class

main

public static void main(java.lang.String[] args)
This main method must be used with two or three arguments: - a file name for the document to tokenize - the language of the document - an optional encoding to use (default is ISO-8859-1) Supported languages are: de, en, it

Parameters:
args - an array of Strings with the arguments