Tuesday, January 13, 2009

Forking Input

I'm about to embark on a significant feature enhancement/change to xmlsh to support xproc. Xproc requires that streams (pipes) be able to "fork". That is, the input to a step (command) may have to be copied and sent to multiple places, including expressions used for argument construction. This could be done by reading the input into an XML variable then passing that around to the various places that need it, but I'd like to keep the generated xmlsh script as "natural" as possible and wherever possible to preserve the ability to stream. In xproc, "natural" means prety much everything has access to (potentially a copy of) the input stream. But in xmlsh, streams are read similar to the unix shells, where any reader of the input consumes it.

I'm debuting if I should add explicit syntax to cause a stream fork, or possibly for implicitly. The reason that unix shells consume input is somewhat dependent on the unix OS pipe and file semantics. its not enirely clear if preserving this notion is important in xmlsh.

For example suppose I wanted to run both an xquery and xpath on the standard input.

xquery '//foo'
xpath '//foo'

The first command (currently) actually consumes all the input and the second command fails. Is that best ? Maybe xproc has a good point. Suppose the above commands didnt consume the input. That way both the xquery and xpath could read a 'copy' of the input stream and produce results. Would this be more useful ?

An alternative is to provide an explict syntax to provide forking. For example

| xquery '//foo'
xpath '//foo'

In this case I invent the "|" synatx with no leading command to mean "fork the input". that way its explicit that xquery gets a copy of the input and xpath then consumes it.

A similar problem comes in with command expansion. such as

echo $(xpath '//foo')

right now the xpath gets a null input stream because otherwise it would consume all the input. But suppose that command substitution also got forked copies of the input ?
Something like this is actually required by xproc ( the with-option tags must be able to read from the standard input). Again, I could implement this all by reading the entire input into a variable at the beginning then echo'ing it all over the place, but its a compelling idea to natively support stream forking. Not only would there be some convenience in script authoring, but some optimizations could be done which would be hard if the forking was explicit.

4 comments:

  1. Just realized that if I dont add explicit syntax to cause forking then I have to do static analysis on an entire block (script or component of a pipe) to determine when NOT to fork or copy ... yuck.

    ReplyDelete
  2. I think that the most reasonable approach would be to have the blocks branch streams explicitly. The implementation would be fairly straightforward, although I'm thinking of how it might be optimized for memory usage given a certain buffer size + a few vastly varying stream positions... maybe just temporary files?

    Anyways, why not have a StreamFork command, and be able to use ksh-like file handle redirection to point stdin at your favorite copy of the input stream?

    Of course I suppose the scripts themselves could take care of this by explicitly using their own temporary files.

    Keep me up to date on how you solve this!

    ReplyDelete
  3. Good question about the "StreamFork" command. I was thinking of a command like "fork" that would work in some ways like the unix "tee" command. But to avoid having to fork to files (because at some point these streams may be XML documents not byte streams) I need to invent a named-pipe. I think I need to do that anyway but I didnt want to have to tackle 2 big things at once.
    And this still doesnt solve the problem of multiple readers.

    e.g.

    fork >(named-pipe)
    xquery '//foo' <(named-pipe)
    xpath '//foo' <(named-pipe) # Does this get a copy ?

    ReplyDelete
  4. 1. Temporary file-backed solution (hmmmm, be careful that you are guaranteed an EOF somewhere along the line!)

    Sure, just rewind() the stream each time it's redirected to something. Or, xpath just gets stdin that's already at EOF if you don't branch it.

    2. Automagic analysis of the block and branching accordingly

    named-pipe and named-pipe point to different file descriptors, and there's a thread keeping its buffers full. Again, sounds like it could be ugly if it's a big file, and like a crash if it's an endless stream/enormous file. Maybe it's not a common case, but it sounds like an interesting DoS...

    ReplyDelete

Due to comment spam, moderation is turned on. I will approve all non-spam comments.