Sunday, March 8, 2009

Text serialization

I'm working at adding more internal interfaces to the IPort hierarchy, which is used by OutputPort and InputPort (and hence IOEnvironment).
I'm adding a StAX interface in preparation for a true binary event based pipe.
As I add more options for interfaces the problem of both interop and serialization get worse. I'll leave Interop out for now.
But for Serialization ... I'd like a common way of specifying text serialization options. Right now its prety much left up to the individual commands, but thats just not right.
For example xpwd, xls, xcat, xquery etc may use different internal interfaces (SAX , Saxon, DOM , StAX etc). If they end up in a text file or stdout (text) then the serialization may differ. For example they may or may not omit the xml declaration, use indentation, do namespace fixups etc. Its all pretty hog-wild.
I've tried to standardize on a common serialization format but its not manageable in the long run. Users really need a way to force the serialization options explicitly, both on a command basis and globally.

When I first started xmlsh I imagined some kind of common "output filter",
either explicitly as a pipe like
command | xformat -options

then maybe implicitly as a common filter on the output of all commands.
Lately I've been thinking about building this in deeper as a set of properties that are inherited via the environment variables. a "serialization" property set.
xproc has a well defined set of these paramateres, borrowed from xquery.
There's a lot to these. The problem is related to the multiple interfaces.
Not all interfaces and all API's support all serialization parameters.
A simple example, StAX doesnt have a property to avoid writing out the xml declaration, although you can fake it by avoiding writeStartDocument(), but there's no global way to set it. Similarly StAX doesnt have a "indentation" property, although you can fake it with a filter. SAX, DOM and Saxon all have different sets of properties they can support (Saxon being the most rich, or atleast closest aligned with xquery and xslt).

So how to implement this ?
At the user level I think an XSERIALIZATION variable makes sense, this can be inherited and overwritten for child process/commands.
But internally ... I have yet to figure out a way to consistently apply this property. It may be I have to filter all output through a serialization pipe/stream ... which adds unnecessary performance problems in the cases where its unneeded.

Ideas very welcome !

4 comments:

  1. I ended up implementing global serialization with shell options with set. example

    set -indent

    turns on output indentation

    set +indent

    turns off output indentation

    ReplyDelete
  2. Way to buck the backwards-sign ksh option trend. ;)

    ReplyDelete
  3. in shell its all backwards
    - means turn on
    + means turn off
    0 is true
    !=0 is false

    maybe if I just make true mean false it will all make sense !

    ReplyDelete
  4. Hm, that does fit...

    ostream << expression

    command >> outputfile

    Touche.

    ReplyDelete

Due to comment spam, moderation is turned on. I will approve all non-spam comments.