XmlTokenizer (TotalCrossSDK 6.1.0 API)

java.lang.Object
- totalcross.xml.XmlTokenizer

Direct Known Subclasses:: DumpXml, XmlReader

public class XmlTokenizer
extends java.lang.Object

A Tokenizer for XML input. In non-strict mode (default), it recognizes HTML constructs as well, e.g.: unquoted attributes value, unterminated references, etc.

Four "tokenize" methods are provided: one takes a byte[] array; another takes a byte[] array with offset and count; another one for an HTML document which is embedded within an HTTP stream; and the last takes a (byte) Stream.

Tokenization events are reported via overridable methods:

foundStartOfInput
foundStartTagName
foundEndTagName
foundEndEmptyTag
foundCharacterData
foundCharacter
foundAttributeName
foundAttributeValue
foundComment
foundProcessingInstruction
foundDeclaration
foundReference
foundEndOfInput

Some of these methods pass the parameters pertinent to the kind of tokenized events: tag name, attribute name and value... These values are only valid for the time the event is reported. Never assume that, after returning from a "foundXxx" method, the information that was reported is still available! Persistent values are however provided through the "getAbsoluteOffset()" method, which returns the absolute offset of the current parameters of the foundXxxx method.

Typical invocation

 class XmlTokenizerTest
 {
    static class MyXmlTokenizer extends XmlTokenizer
    {
       public void foundStartOfInput(byte buffer[], int offset, int count)
       {
          Vm.debug("Start: " + new String(buffer, offset, count));
       }

       public void foundStartTagName(byte buffer[], int offset, int count)
       {
          Vm.debug("StartTagName: " + new String(buffer, offset, count));
       }

       public void foundEndTagName(byte buffer[], int offset, int count)
       {
          Vm.debug("EndTagName: " + new String(buffer, offset, count));
       }

       public void foundEndEmptyTag()
       {
          Vm.debug("EndEmptyTag");
       }

       public void foundCharacterData(byte buffer[], int offset, int count)
       {
          Vm.debug("Content: " + new String(buffer, offset, count));
       }

       public void foundCharacter(char charFound)
       {
          Vm.debug("Content Ref  |" + charFound + '|');
       }

       public void foundAttributeName(byte buffer[], int offset, int count)
       {
          Vm.debug("AttributeName: " + new String(buffer, offset, count));
       }

       public void foundAttributeValue(byte buffer[], int offset, int count, byte dlm)
       {
          Vm.debug("AttributeValue: " + new String(buffer, offset, count));
       }

       public void foundEndOfInput(int count)
       {
          Vm.debug("Ended: " + count + " bytes parsed.");
       }
    }

    public static void testMe()
    {
       String input = "<p>Hello<i>World!</i></p>";
       MyXmlTokenizer xtk = new MyXmlTokenizer();
       try
       {
          xtk.tokenize(input.getBytes());
       }
       catch (SyntaxException ex)
       {
          Vm.debug(ex.getMessage());
       }
    }
 }

Note: A Tokenizer is not a Parser. The correctness of the tag structure (stack) is not examined.
Ex: the dangling markup "<foo><bar>opop</foo>" is syntactically valid.
As a result, a Tokenizer can work on document fragments.

Constructor Summary

Constructors
Modifier Constructor and Description

protected XmlTokenizer()

Constructors
Modifier	Constructor and Description
`protected`	`XmlTokenizer()`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`disableReferenceResolution(boolean disable)` Turn off or on the automatic resolution of references.
`protected void`	`foundAttributeName(byte[] input, int offset, int count)` Method called when an attribute name has been found.
`protected void`	`foundAttributeValue(byte[] input, int offset, int count, byte dlm)` Method called when an attribute value has been found.
`protected void`	`foundCharacter(char charFound)` Method called when a character has been found in the contents, which is resulting from a character reference resolution.
`protected void`	`foundCharacterData(byte[] input, int offset, int count)` Method called when a character data content has been found.
`protected void`	`foundComment(byte[] input, int offset, int count)` Method called when a comment has been found.
`protected void`	`foundDeclaration(byte[] input, int offset, int count)` Method called when a declaration has been found.
`protected void`	`foundEndEmptyTag()` Method called when an empty-tag has been found.
`protected void`	`foundEndOfInput(int count)` Method called when the end of the input was found, and the tokenization is about to end.
`protected void`	`foundEndTagName(byte[] input, int offset, int count)` Method called when an end-tag has been found.
`protected void`	`foundInvalidData(byte[] input, int offset, int count)` Method called when invalid data was found.
`protected void`	`foundProcessingInstruction(byte[] input, int offset, int count)` Method called when a processing instruction has been found.
`protected void`	`foundReference(byte[] input, int offset, int count)` Method called when a reference been found in content.
`protected void`	`foundStartOfInput(byte[] input, int offset, int count)` Method called before to start tokenizing.
`protected void`	`foundStartTagName(byte[] input, int offset, int count)` Method called when a start-tag has been found.
`int`	`getAbsoluteOffset()` Get the absolute offset of the data parameters of the currently reported event.
`int`	`hashCode(byte[] input, int offset, int count)` Returns the hashcode of the given bytes.
`boolean`	`isDataCDATA()` Tell if the data which is currently reported by foundCharacterData is `CDATA` versus `PCDATA`.
`static char`	`resolveCharacterReference(byte[] input, int offset, int count)` Resolve a numeric or named character reference.
`protected void`	`setCdataContents(byte[] input, int offset, int count)` Declare the input to be CDATA, until the end tag of the element `tagName` is found.
`void`	`setStrictlyXml(boolean toSet)` Set or unset the strict XML mode of the parser.
`void`	`tokenize(byte[] input)` Tokenize an array of bytes.
`void`	`tokenize(byte[] input, int offset, int count)` Tokenize an array of bytes.
`void`	`tokenize(Stream input)` Tokenize a stream
`void`	`tokenize(Stream input, byte[] buffer, int start, int end, int pos)` Tokenize an already buffered Stream.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail
- XmlTokenizer
```
protected XmlTokenizer()
```

Method Detail

tokenize
```
public final void tokenize(byte[] input,
                           int offset,
                           int count)
                    throws SyntaxException
```
Tokenize an array of bytes.

Parameters:

input - byte array to tokenize

offset - position of the first byte in the array

count - number of bytes to tokenize

Throws:

SyntaxException

tokenize
```
public final void tokenize(byte[] input)
                    throws SyntaxException
```
Tokenize an array of bytes.

Parameters:

input - byte array to tokenize

Throws:

SyntaxException

tokenize

public final void tokenize(Stream input)
                    throws SyntaxException,
                           IOException

Tokenize a stream

Parameters:: input - stream to tokenize
Throws:: SyntaxException; IOException

tokenize
```
public final void tokenize(Stream input,
                           byte[] buffer,
                           int start,
                           int end,
                           int pos)
                    throws SyntaxException,
                           IOException
```
Tokenize an already buffered Stream.
Versus the general method above, this tokenize method requires more arguments. It should be used when the HTML document is embedded within an HTTP stream.

Parameters:

input - stream to tokenize

buffer - buffer already filled with bytes read from the input stream

start - starting position in the buffer

end - ending position in the buffer

pos - read position of the byte at offset 0 in the buffer

Throws:

SyntaxException

IOException

resolveCharacterReference
```
public static final char resolveCharacterReference(byte[] input,
                                                   int offset,
                                                   int count)
```
Resolve a numeric or named character reference. See XML Predefined Entities

Parameters:

input - byte array which describes the reference

offset - position of the first byte in the array

count - number of bytes of the reference

Returns:

the resulting character, or '\uffff' (not a unicode character) if the conversion could not be done

getAbsoluteOffset
```
public final int getAbsoluteOffset()
```
Get the absolute offset of the data parameters of the currently reported event.

Returns:

the absolute offset of the data parameters of the currently reported event.

setCdataContents
```
protected final void setCdataContents(byte[] input,
                                      int offset,
                                      int count)
```
Declare the input to be CDATA, until the end tag of the element tagName is found.
This settings permits to handle character data. For example, when the <Script> tag is reported the derived class call this method: skipToEndOf("SCRIPT"); before to return. From this point, all input is reported as data until </SCRIPT>is found.
Note: The Tokenizer is a low level class and does not register the tag name. Therefore, this method must be called at each time the caller wants to suprress markup recognition until the end tag is found.

Parameters:

input - byte array containing the name of the element the end tag of which ends the character data

offset - position of the first character in the array

count - number of relevant bytes

isDataCDATA
```
public final boolean isDataCDATA()
```
Tell if the data which is currently reported by foundCharacterData is CDATA versus PCDATA.
In ISO 8879 (SGML) terminology, CDATA describes "non displayable" data, as, for instance, data that is the contents of a SCRIPT element. It differs from "regular data" as, for instance, data that is the contents of a P element is named PCDATA (Parsed Character Data)

setStrictlyXml
```
public final void setStrictlyXml(boolean toSet)
```
Set or unset the strict XML mode of the parser.
By default, the parser will allow most commonly used HTML constructs.

Parameters:

toSet - if true, set the strict XML mode; if false, allows HTML constructs.

disableReferenceResolution
```
public final void disableReferenceResolution(boolean disable)
```
Turn off or on the automatic resolution of references.
References are normally solved, and reported via foundCharacter(char). When automatic resolution is turned off, foundReference(byte[],int,int) is called instead. By default, automatic resolution of references is on, and foundReference(byte[],int,int) is not called.
This option should be set before starting the tokenization. See foundReference(byte[],int,int) for more details.

Parameters:

disable - boolean: if true automatic resolution of references is turned off, otherwise, it is turned on.

foundStartOfInput
```
protected void foundStartOfInput(byte[] input,
                                 int offset,
                                 int count)
```
Method called before to start tokenizing.
Derived class may override this method, for doing whatever appropriate housekeeping (sniffing at the encoding, etc.)

Parameters:

input - byte array containing the first bytes of the input about to be tokenized

offset - position of the first byte to be tokenized

count - number of bytes to be tokenized

foundStartTagName
```
protected void foundStartTagName(byte[] input,
                                 int offset,
                                 int count)
```
Method called when a start-tag has been found.
Derived class may override this method.

Parameters:

input - byte array containing the name of the tag that started

offset - position of the first character of the tag name in the array

count - number of bytes the tag name is made of

foundEndTagName
```
protected void foundEndTagName(byte[] input,
                               int offset,
                               int count)
```
Method called when an end-tag has been found.
Derived class may override this method.

Parameters:

input - byte array containing the name of the tag that ended

offset - position of the first character of the tag name in the array

count - number of bytes the tag name is made of

foundEndEmptyTag
```
protected void foundEndEmptyTag()
```
Method called when an empty-tag has been found.
This method is called just after all events related to the starting tag have been reported. The implied tag name is the one of the starting tag (e.g.: the most recently reported start tag.)
Derived class may override this method.
Example:
```
 
   <FOO A=B> generates:
   - foundStartTagName("FOO");
   - foundAttributeName("A");
   - foundAttributeValue("B");
   - foundEndEmptyTag();
 
```

foundCharacterData
```
protected void foundCharacterData(byte[] input,
                                  int offset,
                                  int count)
```
Method called when a character data content has been found.
Derived class may override this method.

Parameters:

input - byte array containing the character data that was found

offset - position of the first character data in the array

count - number of bytes the character data content is made of

foundCharacter
```
protected void foundCharacter(char charFound)
```
Method called when a character has been found in the contents, which is resulting from a character reference resolution.
Derived class may override this method.

Parameters:

charFound - resolved character - if the character is invalid, this value is set to '\uffff', which is not a unicode character.

See Also:

foundReference(byte[],int,int)

foundAttributeName
```
protected void foundAttributeName(byte[] input,
                                  int offset,
                                  int count)
```
Method called when an attribute name has been found.
Derived class may override this method.

Parameters:

input - byte array containing the attribute name

offset - position of the first character of the attribute name in the array

count - number of bytes the attribute name is made of

foundAttributeValue
```
protected void foundAttributeValue(byte[] input,
                                   int offset,
                                   int count,
                                   byte dlm)
```
Method called when an attribute value has been found.
Derived class may override this method.

Parameters:

input - byte array containing the attribute value

offset - position of the first character of the attribute value in the array

count - number of bytes the attribute value is made of

dlm - delimiter that started the attribute value (' or "). '\0' if none

foundComment
```
protected void foundComment(byte[] input,
                            int offset,
                            int count)
```
Method called when a comment has been found.
Derived class may override this method.

Parameters:

input - byte array containing the comment (without the  delimiters)

offset - position of the first character of the comment in the array

count - number of bytes the comment is made of

foundProcessingInstruction
```
protected void foundProcessingInstruction(byte[] input,
                                          int offset,
                                          int count)
```
Method called when a processing instruction has been found.
Derived class may override this method.

Parameters:

input - byte array containing the processing instruction (without the <? and ?> delimiters)

offset - position of the first character of the processing instruction in the array

count - number of bytes the processing instruction is made of

foundDeclaration
```
protected void foundDeclaration(byte[] input,
                                int offset,
                                int count)
```
Method called when a declaration has been found.
Derived class may override this method.

Parameters:

input - byte array containing the declaration (without the <! and > delimiters)

offset - position of the first character of the declaration in the array

count - number of bytes the declaration is made of

foundReference

protected void foundReference(byte[] input,
                              int offset,
                              int count)

Method called when a reference been found in content.

It can be either a named or numeric character reference, or an entity reference. Given the several syntaxes of reference, no verification is made a priori on the validity of the "name" of the reference.

For conveniency, a static method: resolveCharacterReference(byte[],int,int) allows to convert the character reference into its UCS-2 encoded value.

Note: foundReference is called only if disableReferenceResolution(boolean disable) has been called first, with disable set to true. If not, then foundReference is never called, and foundCharacter(char) is called instead. This design permits to easily handle simple XML documents — only predefined named character entities, and numeric character entities — and documents which have user-defined internal/external entities. This is explained below.

When working with a set of externally defined entities, issue disableReferenceResolution(true) to turn off automatic reference resolution. Then, your code in foundReference could make a quick check to see if the found reference is numeric. If it is numeric — it starts with a # character — call resolveCharacterReference; if it is not a numeric reference, checks if the reference belongs to the known list of defined entities for the parsed document. If it does, do the substitution; if not, call resolveCharacterReference, because it could be one of the XML Predefined Entities

By default, each character reference is naturally reported via foundCharacter(char), which, again, supersedes the foundReference notification.

Derived class may override this method.

Parameters:: input - byte array containing the reference name; offset - position of the first character of the reference name in the array; count - number of bytes the reference name is made of
See Also:: setStrictlyXml(boolean toSet)

foundInvalidData
```
protected void foundInvalidData(byte[] input,
                                int offset,
                                int count)
```
Method called when invalid data was found. This is often due to a bad tag syntax.
Derived class may override this method.

Parameters:

input - byte array containing the invalid data

offset - position of the first character of the invalid data in the array

count - number of bytes the invalidData is made of

foundEndOfInput
```
protected void foundEndOfInput(int count)
```
Method called when the end of the input was found, and the tokenization is about to end.
Derived class may override this method.

Parameters:

count - number of bytes parsed

hashCode

public int hashCode(byte[] input,
                    int offset,
                    int count)

Returns the hashcode of the given bytes.

Since:: TotalCross 1.25

Class XmlTokenizer

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

XmlTokenizer

Method Detail

tokenize

tokenize

tokenize

tokenize

resolveCharacterReference

getAbsoluteOffset

setCdataContents

isDataCDATA

setStrictlyXml

disableReferenceResolution

foundStartOfInput

foundStartTagName

foundEndTagName

foundEndEmptyTag

foundCharacterData

foundCharacter

foundAttributeName

foundAttributeValue

foundComment

foundProcessingInstruction

foundDeclaration

foundReference

foundInvalidData

foundEndOfInput

hashCode