txt2xml

Overview

txt2xml is a simple Java library for parsing arbitrarily structured text input into well-formed XML output as SAX, DOM, JDOM, or through an OutputStream. The project was inspired by Using SAX to Read Other Formats by Claude Duguay in XML Magazine March 2002.

txt2xml is useful in integration problems in which a variety of text formats need to be handled in a uniform manner: XML is a reasonable common ground since it is well supported by APIs and tooling.

The strategy in txt2xml is to allow "Processors" to parse the input, writing XML as they go, and allowing matched fragments to be further processed by sub-processors. Processors can be configured to repeat across an entire input text, or to match only once before passing control to a subsequent Processor.

There is a simple configuration mechanism that allows a conversion to be easily configured in an XML document: see below for examples.

Output of the resulting XML is handled by Drivers in a number of ways including creation of a DOM or JDOM DOcument, driving a SAX ContentHandler, or output as text to an OutputStream.

A simple Swing GUI is provided that allows interactive use of txt2xml to define a configuration and see the resulting conversion of a source text to XML. This GUI uses my Scope framework to provide an MVC implementation.

A simple command line application is also provided:

> java org.txt2xml.cli.Batch config.xml sample.txt
> type sample.txt.xml
<?xml version="1.0" encoding="UTF-8"?>
<txt2xml>
    <line>
        <field>1</field>
        <field>2</field>
        <field>3</field>
    </line>
    <line>
        <field>5</field>
        <field>6</field>
        <field>7</field>
    </line>
</txt2xml>
>

Example

To turn a "comma separated values" file into XML, configure txt2xml as follows:

<txt2xml>

    <!-- Processor to split into lines -->
    <processor type="RegexDelimited">
        <element>line</element>
        <regex>\n</regex>

        <!-- Sub-processor to process each line by splitting at commas -->
        <processor type="RegexDelimited">
            <element>field</element>
            <regex>,\\s*</regex>
        </processor>

    </processor>

</txt2xml>
This will act on the following comma separated values text:
1, 2, 3
5, 6, 7
To produce the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<txt2xml>
    <line>
        <field>1</field>
        <field>2</field>
        <field>3</field>
    </line>
    <line>
        <field>5</field>
        <field>6</field>
        <field>7</field>
    </line>
</txt2xml>

License

The source for txt2xml is released under the BSD Open Source License, which allows commercial development projects built with txt2xml to be distributed under non-Open Source licenses.

Feedback

Feedback on any aspect of txt2xml is welcome. Offers of help on any aspect of txt2xml is even more welcome! I'm Steve Meyfroidt, based in London, UK. My email address is: smeyfroi@users.sourceforge.net.

Copyright (C) 2000-2002 Steve Meyfroidt