public class XmlTokenizer
extends java.lang.Object
Four "tokenize" methods are provided: one takes a byte[] array; another takes a byte[] array with offset and count; another one for an HTML document which is embedded within an HTTP stream; and the last takes a (byte) Stream.
Tokenization events are reported via overridable methods:
Some of these methods pass the parameters pertinent to the kind of tokenized events: tag name, attribute name and value... These values are only valid for the time the event is reported. Never assume that, after returning from a "foundXxx" method, the information that was reported is still available! Persistent values are however provided through the "getAbsoluteOffset()" method, which returns the absolute offset of the current parameters of the foundXxxx method.
Typical invocation
class XmlTokenizerTest { static class MyXmlTokenizer extends XmlTokenizer { public void foundStartOfInput(byte buffer[], int offset, int count) { Vm.debug("Start: " + new String(buffer, offset, count)); } public void foundStartTagName(byte buffer[], int offset, int count) { Vm.debug("StartTagName: " + new String(buffer, offset, count)); } public void foundEndTagName(byte buffer[], int offset, int count) { Vm.debug("EndTagName: " + new String(buffer, offset, count)); } public void foundEndEmptyTag() { Vm.debug("EndEmptyTag"); } public void foundCharacterData(byte buffer[], int offset, int count) { Vm.debug("Content: " + new String(buffer, offset, count)); } public void foundCharacter(char charFound) { Vm.debug("Content Ref |" + charFound + '|'); } public void foundAttributeName(byte buffer[], int offset, int count) { Vm.debug("AttributeName: " + new String(buffer, offset, count)); } public void foundAttributeValue(byte buffer[], int offset, int count, byte dlm) { Vm.debug("AttributeValue: " + new String(buffer, offset, count)); } public void foundEndOfInput(int count) { Vm.debug("Ended: " + count + " bytes parsed."); } } public static void testMe() { String input = "<p>Hello<i>World!</i></p>"; MyXmlTokenizer xtk = new MyXmlTokenizer(); try { xtk.tokenize(input.getBytes()); } catch (SyntaxException ex) { Vm.debug(ex.getMessage()); } } }
Note: A Tokenizer is not a Parser. The correctness of the
tag structure (stack) is not examined.
Ex: the dangling markup
"<foo><bar>opop</foo>" is syntactically valid.
As
a result, a Tokenizer can work on document fragments.
Modifier | Constructor and Description |
---|---|
protected |
XmlTokenizer() |
Modifier and Type | Method and Description |
---|---|
void |
disableReferenceResolution(boolean disable)
Turn off or on the automatic resolution of references.
|
protected void |
foundAttributeName(byte[] input,
int offset,
int count)
Method called when an attribute name has been found.
|
protected void |
foundAttributeValue(byte[] input,
int offset,
int count,
byte dlm)
Method called when an attribute value has been found.
|
protected void |
foundCharacter(char charFound)
Method called when a character has been found in the contents, which is resulting from a character reference resolution.
|
protected void |
foundCharacterData(byte[] input,
int offset,
int count)
Method called when a character data content has been found.
|
protected void |
foundComment(byte[] input,
int offset,
int count)
Method called when a comment has been found.
|
protected void |
foundDeclaration(byte[] input,
int offset,
int count)
Method called when a declaration has been found.
|
protected void |
foundEndEmptyTag()
Method called when an empty-tag has been found.
|
protected void |
foundEndOfInput(int count)
Method called when the end of the input was found, and the tokenization is
about to end.
|
protected void |
foundEndTagName(byte[] input,
int offset,
int count)
Method called when an end-tag has been found.
|
protected void |
foundInvalidData(byte[] input,
int offset,
int count)
Method called when invalid data was found.
|
protected void |
foundProcessingInstruction(byte[] input,
int offset,
int count)
Method called when a processing instruction has been found.
|
protected void |
foundReference(byte[] input,
int offset,
int count)
Method called when a reference been found in content.
|
protected void |
foundStartOfInput(byte[] input,
int offset,
int count)
Method called before to start tokenizing.
|
protected void |
foundStartTagName(byte[] input,
int offset,
int count)
Method called when a start-tag has been found.
|
int |
getAbsoluteOffset()
Get the absolute offset of the data parameters of the currently
reported event.
|
int |
hashCode(byte[] input,
int offset,
int count)
Returns the hashcode of the given bytes.
|
boolean |
isDataCDATA()
Tell if the data which is currently reported by foundCharacterData is
CDATA versus PCDATA . |
static char |
resolveCharacterReference(byte[] input,
int offset,
int count)
Resolve a numeric or named character reference.
|
protected void |
setCdataContents(byte[] input,
int offset,
int count)
Declare the input to be CDATA, until the end tag of the element
tagName is found. |
void |
setStrictlyXml(boolean toSet)
Set or unset the strict XML mode of the parser.
|
void |
tokenize(byte[] input)
Tokenize an array of bytes.
|
void |
tokenize(byte[] input,
int offset,
int count)
Tokenize an array of bytes.
|
void |
tokenize(Stream input)
Tokenize a stream
|
void |
tokenize(Stream input,
byte[] buffer,
int start,
int end,
int pos)
Tokenize an already buffered Stream.
|
public final void tokenize(byte[] input, int offset, int count) throws SyntaxException
input
- byte array to tokenizeoffset
- position of the first byte in the arraycount
- number of bytes to tokenizeSyntaxException
public final void tokenize(byte[] input) throws SyntaxException
input
- byte array to tokenizeSyntaxException
public final void tokenize(Stream input) throws SyntaxException, IOException
input
- stream to tokenizeSyntaxException
IOException
public final void tokenize(Stream input, byte[] buffer, int start, int end, int pos) throws SyntaxException, IOException
Versus the general method above, this tokenize method requires more arguments. It should be used when the HTML document is embedded within an HTTP stream.
input
- stream to tokenizebuffer
- buffer already filled with bytes read from the input streamstart
- starting position in the bufferend
- ending position in the bufferpos
- read position of the byte at offset 0 in the bufferSyntaxException
IOException
public static final char resolveCharacterReference(byte[] input, int offset, int count)
input
- byte array which describes the referenceoffset
- position of the first byte in the arraycount
- number of bytes of the referencepublic final int getAbsoluteOffset()
protected final void setCdataContents(byte[] input, int offset, int count)
tagName
is found.
This settings permits to handle character data. For example, when
the <Script> tag is reported the derived class call this method:
skipToEndOf("SCRIPT");
before to return. From this
point, all input is reported as data until </SCRIPT>
is
found.
Note: The Tokenizer is a low level class and does not register the tag name. Therefore, this method must be called at each time the caller wants to suprress markup recognition until the end tag is found.
input
- byte array containing the name of the element the end tag of
which ends the character dataoffset
- position of the first character in the arraycount
- number of relevant bytespublic final boolean isDataCDATA()
CDATA
versus PCDATA
.
In ISO 8879 (SGML) terminology, CDATA
describes
"non displayable" data, as, for instance, data that is the
contents of a SCRIPT
element. It differs from
"regular data" as, for instance, data that is the contents of
a P
element is named PCDATA
(Parsed
Character Data)
public final void setStrictlyXml(boolean toSet)
By default, the parser will allow most commonly used HTML constructs.
toSet
- if true, set the strict XML mode; if false, allows HTML
constructs.public final void disableReferenceResolution(boolean disable)
References are normally solved, and reported via
foundCharacter(char)
. When automatic
resolution is turned off,
foundReference(byte[],int,int)
is called
instead. By default, automatic resolution of references is on,
and foundReference(byte[],int,int)
is not called.
This option should be set before starting the tokenization. See
foundReference(byte[],int,int)
for more details.
disable
- boolean: if true
automatic resolution of
references is turned off, otherwise, it is turned on.protected void foundStartOfInput(byte[] input, int offset, int count)
Derived class may override this method, for doing whatever appropriate housekeeping (sniffing at the encoding, etc.)
input
- byte array containing the first bytes of the input about to
be tokenizedoffset
- position of the first byte to be tokenizedcount
- number of bytes to be tokenizedprotected void foundStartTagName(byte[] input, int offset, int count)
Derived class may override this method.
input
- byte array containing the name of the tag that startedoffset
- position of the first character of the tag name in the arraycount
- number of bytes the tag name is made ofprotected void foundEndTagName(byte[] input, int offset, int count)
Derived class may override this method.
input
- byte array containing the name of the tag that endedoffset
- position of the first character of the tag name in the arraycount
- number of bytes the tag name is made ofprotected void foundEndEmptyTag()
This method is called just after all events related to the starting tag have been reported. The implied tag name is the one of the starting tag (e.g.: the most recently reported start tag.)
Derived class may override this method.
Example:
<FOO A=B> generates: - foundStartTagName("FOO"); - foundAttributeName("A"); - foundAttributeValue("B"); - foundEndEmptyTag();
protected void foundCharacterData(byte[] input, int offset, int count)
Derived class may override this method.
input
- byte array containing the character data that was foundoffset
- position of the first character data in the arraycount
- number of bytes the character data content is made ofprotected void foundCharacter(char charFound)
Derived class may override this method.
charFound
- resolved character - if the character is invalid, this value
is set to '\uffff', which is not a unicode character.foundReference(byte[],int,int)
protected void foundAttributeName(byte[] input, int offset, int count)
Derived class may override this method.
input
- byte array containing the attribute nameoffset
- position of the first character of the attribute name in the
arraycount
- number of bytes the attribute name is made ofprotected void foundAttributeValue(byte[] input, int offset, int count, byte dlm)
Derived class may override this method.
input
- byte array containing the attribute valueoffset
- position of the first character of the attribute value in the
arraycount
- number of bytes the attribute value is made ofdlm
- delimiter that started the attribute value (' or "). '\0' if
noneprotected void foundComment(byte[] input, int offset, int count)
Derived class may override this method.
input
- byte array containing the comment (without the
<!--
and -->
delimiters)offset
- position of the first character of the comment in the arraycount
- number of bytes the comment is made ofprotected void foundProcessingInstruction(byte[] input, int offset, int count)
Derived class may override this method.
input
- byte array containing the processing instruction (without the
<?
and ?>
delimiters)offset
- position of the first character of the processing instruction
in the arraycount
- number of bytes the processing instruction is made ofprotected void foundDeclaration(byte[] input, int offset, int count)
Derived class may override this method.
input
- byte array containing the declaration (without the
<!
and >
delimiters)offset
- position of the first character of the declaration in the
arraycount
- number of bytes the declaration is made ofprotected void foundReference(byte[] input, int offset, int count)
It can be either a named or numeric character reference, or an entity reference. Given the several syntaxes of reference, no verification is made a priori on the validity of the "name" of the reference.
For conveniency, a static method:
resolveCharacterReference(byte[],int,int)
allows
to convert the character reference into its UCS-2 encoded value.
Note: | foundReference is called only if
disableReferenceResolution(boolean disable) has
been called first, with disable set to true .
If not, then foundReference is never called, and
foundCharacter(char) is called instead. This
design permits to easily handle simple XML documents — only
predefined named character entities, and numeric character entities
— and documents which have user-defined internal/external
entities. This is explained below. |
When working with a set of externally defined entities, issue
disableReferenceResolution(true)
to turn off automatic
reference resolution. Then, your code in foundReference
could make a quick check to see if the found reference is
numeric. If it is numeric — it starts with a #
character — call resolveCharacterReference; if it is not
a numeric reference, checks if the reference belongs to the known list
of defined entities for the parsed document. If it does, do the
substitution; if not, call resolveCharacterReference, because
it could be one of the XML Predefined
Entities
By default, each character reference is naturally reported via
foundCharacter(char)
, which, again, supersedes
the foundReference
notification.
Derived class may override this method.
input
- byte array containing the reference nameoffset
- position of the first character of the reference name in the
arraycount
- number of bytes the reference name is made ofsetStrictlyXml(boolean toSet)
protected void foundInvalidData(byte[] input, int offset, int count)
Derived class may override this method.
input
- byte array containing the invalid dataoffset
- position of the first character of the invalid data in the
arraycount
- number of bytes the invalidData is made ofprotected void foundEndOfInput(int count)
Derived class may override this method.
count
- number of bytes parsedpublic int hashCode(byte[] input, int offset, int count)