Thursday, April 30, 2009

Preparing for Alpha 2

I am preparing a snapshot release which will become "a2", aka release 0.0.2.0
I believe I am largely complete with mucking around with the syntax and core feature set of xmlsh and am getting nearer the refinement and enhancement stages. The Alpha 2 release will more formally announce that I'm close to freezing the language specs. Alpha 2 will focus on tying up the last bits of loose ends on the feature and language core and preparing for Beta. Beta 1 will formally freeze the language specs and begin freezing the core command set and focus on bug fixes and stability for a true release.

ETA ... as long as it takes :)

Wednesday, April 15, 2009

XML Event Pipeline - Initial Results

At long last I got the code in shape so that I could implement "native" piping in xmlsh using a binary event queue instead of serializing to text. From the beginning this has been my goal but its been more difficult then I thought. A primary complication is that not all commands input or output XML, unlike say XProc which can only have XML in the pipes, xmlsh requires working with text streams.
For example "echo foo | cat" should work just as well as "xcat foo | xcat".

I finally got it to work. Truly event driven pipes that can stream both text and XML events (StAX events). All the tests finally passed and I was feeling really good, until I ran performance tests. Ug. The event pipes are about 2x slower then text pipes ! Including the overhead of serializing and parsing the text format ! This is totally shocking to me as I had always presumed that the majority of the overhead would be in serializing and parsing XML text. But nope, turns out that the overhead of creating the event objects to stick in the pipe are less efficient then serializing to text then parsing the text on the other end. Certainly this is a consequence of a particular implementation, not a general statement. But still, its not a result I expected and performance analysis doesn't have any huge smoking guns. The biggest issue seems to be in the StAX event creation. I didn't fully realize until this that the XMLEvent object is non-symetric. That is, when writing events you don't write the same events as you receive. You write out StartElement , Namespace , Attribute, but when you read events they are consolidated into one StartElement event. The overhead during write of consolidating these events turns out to be much bigger then serializing them to text and re-parsing the text.

I'm going to put this aside for a while, whilst I contemplate the error of my thinking. For now if you want to experiment with the XML pipeline, the code is checked in, and you can enable them by setting the "XPIPE" environment variable. I will probably change that in the future.

Wednesday, April 8, 2009

MarkLogic integration

I'm working on a simple integration to MarkLogic servers in xmlsh. I now have the basics working which given how ML is structured should be enough to do almost anything. The following commands are now supported (in my dev environment).
list - list all documents in the DB
put - put a document to the DB
get - get a document from the DB
del - delete a document from the DB
query - run an 'ad hoc' xquery on the ML server
invoke - run a stored xquery on the ML server

interestingly, list, del and get are implemented as trivial xsh scripts calling query.

I plan on publishing this shortly and would love any input on how best to integrate with marklogic services, I am new to the ML server so dont know the "best practices".

Sunday, April 5, 2009

Named ports

The latest version of xmlsh has support for "Named Ports".
The original incentive was to be able to more closely map to xproc port semantics, even though I may end up not using them for xproc. I decided to include them anyway because I think this is one area that the unix shells are deficient, or too tied to legacy decisions.

In unix, processes access environment supplied streams by file number (0,1,2 ...).
There are by default 3 standard ones (stdin,stdout,stderr) corresponding to 0,1,2.
The unix shells give access to these via the standard IO redirection cmd > file, cmd < file etc. If you want access to anything except the first 2 you have to use a numeric modifier. e.g. cmd 2>err . You can do this with any file descriptor like cmd 30>file 40<file2 . I've always thought that this was a bit of a hack. Its certainly very closely tied to the OS. Ported versions of shells often cant support this syntax for anything but the predefined 3. Java runtimes have no direct portable way to do this either. Even if it did, it wouldn't work in xmlsh because shell instances are not run as separate sub processes. It could be done in xmlsh by keeping a mapping table of redirected ports that the shell redirected so that internal commands could access.

Instead I choose to go with named ports like xproc uses. I've completed the plumbing to support this but not yet happy with the syntax. The part I'm happy with is this. For any command you can augment a redirection with (port). For example
cmd (output)>file1 (alternate)>file2 (input)<file3

Inside cmd, the XCommand (via XEnvironment) exposes these ports as named ports and you can get at them via getInput(name) or getOutput(port).

But suppose you want to specify the name of the port to the command and not have it confused with the name of a file. As an experiment I've added general support for port naming so that any command which wants a stream can pass in the "filename" and it can access Files, URLs , Variables, Ports and expressions interchangeably. For example (and largely for testing) I have implemented xidentity to support optional filenames so you can do

xidentity file
xidentity http://url
xidentity {variable}
xidentity $xml_variable
xidentity <[ <doc/> ]>


and now with ports you can do

xidentity "(in)" (in)<file




Note the "(in)" is quoted. By the time xidentity gets its, its simply the string (in) but to get it through the parser you have to quote it. I find this less then ideal. But if I remove ( from being a magic token then it breaks all sorts of things, including the ability to distinguish between arguments and IO redirection.
( I.E in "cmd (in)<file" does cmd get 1 arg or 0 ? )

However the ability to pass in port names in the same context as filenames or URI's is very compelling. Alternative suggestions welcome. Maybe an extended scheme ? like port:name e.g.

xidentity port:in (in)<file

Suggestions and comments welcome