XMLSH: 2009

Wednesday, December 2, 2009

Version 1.0.0.0 released

Better late then never. Over 2 years after its conception I've released what I consider a "release" quality version so have named it 1.0.0.0

This also triggers a point where I am going to try extra hard to not introduce any non-backwards compatible changes, and ideally limit syntax type changes to a bare minimum.

I now consider xmlsh ready for production use. It has no 'known' bugs but of course there are bugs, I just don't know about them. It is used in production systems (has been for over a year).

The latest changes include

* updated to Saxon 9.2 (HE)

* tie command that ties xquery to variable expansion

* much improved serialization options to all XML commands uniformly

* J2EE Servlet code (tested in Tomcat)

* Embedding in your own code

* xmlsh Ant task

Friday, November 27, 2009

Looking for a "Map" or "tie" syntax

I'm trying really hard, but failing, to NOT add more syntax to xmlsh.

The problem I'm trying to solve is the clumbsiness of accessing properties, or name/value pairs,

in particular when serialized to a kind of properties file.

With the advent of the xproperties command, you can now read a standard Java properties file (in either text or xml form) and assign it to a variable.

Suppose you have a simple properties file

a=b

This parses into XML (using the standard Java Properties API) as

<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd" []>
<properties version="1.0">
<entry key="a">b</entry>
</properties>

Read into a variable as

props=$<(xproperties -in file.properties)

You can access the value of key "a" with the expression

<[ $props/properties/entry[key="a"]/string() ]>

Am I the only one who finds this extremely cumbersome ?

I would like to do something like one of these

$props.a

$props[a]

$props["a"]

or even

${props:"a"}

But any of these syntax sugars requires the shell "know" the structure of $props and do something magic with it. I really dont like the idea of extending the core syntax to handle a particular schema, even one as common as Java Properties. Suppose for example I wanted Map instead so I could put non-string values ... Properties wont cut it.

MarkLogic has a xquery function for that (map:map()) and various map:* methods to get at the values. I'd like something like this but built into the shell and expandable to arbitrary XML schemas. What to do ?

The latest thought I had was to borrow somethign from Perl (gasp). The tie funciton. "Bind" a variable to an expression so that you could define your own shortcuts with map or array-like syntax.

Suppose for example I could do something like

xbind props '<[ //entry[key=$key]/string() ]>'

then magically make

$props["a"] or maybe ${props:a} or ${props[a] } invoke the bound expression and produce "b"

You could then use this mechanism to create array or map-like notation out of ANY schema.

Is this worth the effort ? When does adding syntax actually start to deter from a language instead of add to it ?

Thursday, November 26, 2009

New release of xmlsh and marklogic

Yesterday I released a new release of xmlsh and marklogic extensions.

Most notable new feature in xmlsh is the prevelence of serialization options to almost all commands. For example now xcat,xquery,xslt etc can change the output method for its invocation without changing it globaly, example

xcat -output-method html

I have spent many laborous hours updating the xmlsh documentation wiki (www.xmlsh.org) to reflect these updates as well as to try to standardize on a markup for options. My previous attempts at formatting options was horrid, now options are in a tabular format.

Improvements to MarkLogic extension are coming slowly as I am using MarkLogic more myself so find useful new features which I want to become commands. Notable is

* option to ml:list to list only the contents of a particular directory

* new ml:listdir to list directories ( invisible to ml:list)

* new ml:deldir to delete directories

Friday, November 20, 2009

xslt1.0 and servlets

Two major accomplishments this week.

I've had to start processing "SPL" Files from the FDA. Turns out the XSLT the FDA publishes is (how to say this nicely) .. "Not the highest quality".

It wont process at all with saxon 9 due to errors. The errors exist in saxon 6 but are warnings so they were ignored. There were so many that I couldnt easily fix the XSL file so I needed to process it with saxon 6. I was able to implement an "xslt1" command which uses the saxon6 jar in the same JVM as saxon 9 ! Quite a feat ... so now xmlsh has xslt 1.0 and 2.0 commands built in.

In addition I needed to run XSLT from MarkLogic. Since its not natively supported the suggested workaround is using a servlet and http-post the xml to the servelet. Xmlsh to the rescue ! I wrote a simple Tomcat servlet for xmlsh. There were some weird problems with it taking over stdin/stdout so I had to imrove on the assumption about taking over System.in and System.out in a container environment. Also had to buffer the input and output or else strange things happened occasionally depending on the document size, but now its working !
Expect to see a xmlsh servlet, either as a seperate package or builtin to the core distro, I havent decided yet.

Saturday, November 14, 2009

Embedded commands or roll your own ?

Some commands benifit from being integrated into xmlsh, even if they are 'fairly easy' to do *with* xmlsh but not embedded. An example is schematron. Schematron is easily implemented as simple 4 line xmlsh script. But the fact is you have to figure that out. You need to download the schematron xsl files, and figure out how to call them, in what order and how to pass the temporary files around. It took me an hour to figure it it. For that reason I included schematron as a single command in xmlsh even though its not 'magic', it makes life easier.

But what about something thats truely trivial to do without it being "embedded". An example, a user asked for a "html to xml" command using something like tidy or tagsoup.

I just downloaded tagsoup.jar and discovered that it runs perfectly 'out of the box' with a single xmsh command. Assuming you have the jar file downloaded, this command runs tagsoup

jcall -cp tagsoup-1.2.jar org.ccil.cowan.tagsoup.CommandLine file

You (or me) could wrap this into a 1 line script "tagsoup" that does this.

Is it worth embedding this into xmlsh? The advantage is that you dont have to find this jar file, and put it somewhere and reference it. The disadvantage is that *I* have to include this jar file in the distribution just for 1 command, which however useful, is 'just another java call'.

Does this warrent 'first class citizenship' ? Where do I stop ? I could pass the buck and make it an 'extension module' but in fact thats about as hard on the user as just getting this jar file.

Plus there's documentation. When I embed a command I need to document it. That means copying the docs from tagsoup (or referencing them).

Is this good or bad ?

I want xmlsh to include all the necessary, and even useful, tools for common xml processing. On the otherhand I dont want it to be the entire universe of software in one bundle.

How to draw the line ? Inquiring minds want to know !

xmlsh 0.1.0.6 released with xproperties

Thanks to a suggestion by a xmlsh user, I implemented xproperties today and releaesd a new version of xmlsh with this new command and some bug fixes.

http://www.xmlsh.org/CommandXProperties

Also released an update to the MarkLogic extension with support for rename. Now that I'm experimenting more with MarkLogic, expect more improvements in this extension module.

Thursday, October 29, 2009

xmlsh 0.1.0.5 released with ant task

Released xmlsh 0.1.0.5 today to sourceforge.

A major enhancement (which took minor work !) was ant support. There is now an XmlshTask ant task which can be used to call xmlsh from within ant using the same JVM.

See http://www.xmlsh.org/AntTask

I also cleaned up a problem where System.in,out,error was being closed on exit which effects xmlsh from being used nicely when embedded in other java programs (like ant !)

Wednesday, September 30, 2009

xmlsh 0.1.0.3 released with try/catch/finally

I have just released (posted to sourceforge) xmlsh release 0.1.0.3.

This release implements a java-like try/catch/finally/throw syntax.

See http://www.xmlsh.org/ExceptionHandling

Monday, August 31, 2009

Upcoming - xmlsh docs in docbook

Thanks to some excellent mentoring, great patience, and help from Dave Pawson (including a start at a Wiki-docbook converter), I've decided to embark on converting the xmlsh docs to docbook format.

I'm going to start with the hard one, the command "man page" documents, and go from there. Advise and volunteers welcome !

Friday, August 28, 2009

sourceforge ... why ?

When I frst started xmlsh I chose Sourceforge for hosting. Why ? I admit one of the main reasons was a sense of 'legitimacy'. It seemed to me all the "Good" Open Source projects were on sourceforge. (of course so are a lot of "Bad" ones) ... but by picking Sourceforge I felt people might "trust" the open source nature of the project then say by hosting it on my own server. Until recently, for example, SF refused to let you delete files. Thats a good thing right ? you cant retract your project once its published. Your making a commitment to Open Source by being on Sourceforge. Right.

I also thought I might use some of SF's collaborative features such as Wikki, bug tracking, groups etc.

Well 2 years later and I'm less enchanted. SF is seriously thrashing. For the last 2 weeks I've only intermittantly been able to checkin to svn. The analitics is spotty, sometimes works sometimes doesnt. I dont use the Bug Tracking (and neither do my users). Noone's subscribed to the mailing list or used the forums. And ... its getting slower and more ad-laden every day. If you click to download you're forced to view a 20+ second ad before you can download. Not that I'm complaining ... for a "free" service what do I expect ?

But this last trouble with SVN is bothering me. I found several other open tickets with the same error report over 3 weeks old. It seems SF is collapsing from its own weight and maybe I better abandon the ship before the last rat beats the captin. Or maybe "really soon now" its going to turn into the awsome site it once was ...

Do I hold my breath or jump ship ?
Do people really think as I thought that SF adds a sence of "legitimacy" to an OS project ?

Is it worth the fact it doesnt work and their tech support isnt supporting ?

Curious minds want to know.

Thursday, August 27, 2009

xmlsh extension functions in xpath/xquery/xslt

Thanks to Kurt Cagle for the suggestion, I've implemented extension functions in xpath/xquery/xslt and the builtin <[ ... ]> notation.

The one function is "eval". This allows you to call into xmlsh from within any xpath expression and return the result (stdout) of the command back.
Right now I have set it so you have to manually declare the namespace, saxon style.
Examples:

declare namespace xmlsh=java:org.xmlsh.xpath.XPathFunctions
xecho <[ xmlsh:eval("xecho $*" , ("foo" , <bar/>, 1 ) ) ]>
xquery -n 'xmlsh:eval("xls")'
xpath -n 'xmlsh:eval("xls $*" , "*.xml")'
var=<[ xmlsh:eval("xls") ]>

I think this is an extremely useful feature (thanks to Kurt for suggesting it !), and I want to make it easier to use by not requiring the "declare namespace". But, I don't want to impose magic namespaces into code that isnt expecting it. Also doing this globally results in a namespace declaration in the output (which is being stripped out by the serialization process but is still there under the hood).

To solve the first problem, I'm thinking of building in the 'declare namespace' part into the <[ ... ]> syntax only and leaving it up to the user to decide to import the extension functions themselves for explicit calls to xquery/xpath/xslt.

I'm not sure how to solve the second problem, although it is largely an invisible one. I was reading on the saxon docs and the suggestion was to do the namespace declaration in the preamble of xquery itself rather then externally to avoid extraneous namespace declarations resulting in the output. Thats what I'd do for the <[ ...]> but maybe I should leave it up to the user to do that their own way for an explicit call to xquery/xslt/xpath.

As for the namespace prefix ... there's a convention of using "ex" for extension functions so maybe I should pick "xmlshex" although thats getting long, or just "ex" ... opinions welcome !

Sunday, August 23, 2009

XML representation of xmlsh script

A year ago at Balisage an attendee thought of something I had never considered. When hearing I was giving a talk on an "xml shell language" his immediate thought was that the scripting language itself would be in XML. One of my primary reasons for xmlsh is that I don't personally enjoy writing XML for command languages, especially interactively. I dont want to write
<command name="ls"><arg>*</arg></command>
when I could type
ls *

I think xproc and xslt are great example of using xml itself as a programming language. They have many great features, but one of them doesn't include ease of writing or reading (imho). Especially interactdively.

A goal of a shell is that it works equally well as a interactive command language, and as a (file based) script language.
So I never once considered using XML as the language for xmlsh.

Shame on me, since as stated in my Philosophy page I am hoping XML is used for everything. I guess I mean everything *except for xmlsh itself*.
Hows that for hypocritical.

In hindsight, seeing the success of xproc, I have to realize that there are advantages in an XML form for scripting languages. While I certianly dont want to *author* a script in XML, atleast interactively, there are advantages to having a form of the script in XML. These include

* Easier output by programs - geting the syntax exactly right is easier in XML
* Ability to transform - apply XSLT to the script
* Easier to store and manipulate using XML such as an XML DB
* Easier to parse

A direct example of the above is that getting my xproc-to-xml translator may be easier if I could output an XML version of an xmlsh script instead of the text version. Getting things like where spaces and newlines are allowed exactly right is tedious for a processor. ( usually "obvious" for a human ).

So is there value in an XML representation of an xmlsh script ?
Maybe.
Is it easy to implement ?
Maybe. I fully compile the text using javacc into an intermediate tree format before executing. This tree format could come from XML parsing instead.

Is it worth doing ?
I dont know. I could use opinions.
Would anyone else find value in this ? This would be equivilent to creating an XML syntax for xquery, such as XQueryX
But likewise, I've never heard of this actually being used by anyone.
( but I know my ignorance isnt a true representation of fact )
It would probably end up with the same kind of verbosity as XQueryX and
Quote: "The result is not particularly convenient for humans to read and write"

Monday, July 27, 2009

try/catch/finally/throw

I've been meaning to implement a java style try/catch/finally/throw syntax to xmlsh before I can confidently consider it "robust".

In the unix shells there are signals, and the "trap" command is used to trap signals.
For xmlsh there really is no good concept of signals, but there is java exceptions.
Exceptions are much more useful then signals *and* they are a natural fallout of the types of commands xmlsh was designed for.

So the questions are

Should try/catch be implemented ?
If so should it follow java conventions when possible (try/catch/finally/throw). If not then should it try to simulate the shell's "trap" syntax ? or maybe something different like XProc exceptions.

What should be an error and what an exception? There is a long history in unix shells of using the return value to indicate errors. Should exceptions follow that philosophy? That is, should exceptions only be used in cases where its very difficult to return errors ? Or maybe ALL exceptions should "naturally be trapped" and simply converted to a non-zero error return ?

Sunday, July 26, 2009

Java Scripting ? JSR 223 ?

Java scripting (JSR223) interface for xmlsh ? Useful ? Stupid ?
Inquiring minds want to know !

http://www.jcp.org/en/jsr/detail?id=223

https://scripting.dev.java.net/

Monday, June 15, 2009

Shifting sequences

My other foray into trying to bring functional parity between positional parameters and sequence variables is "shift". I'd like a shift operator that operates on a sequence variable similar to how shift works on positional parameters. Something like perl's shift.

example

a=(foo bar spam)
shift a

now a is (bar spam)

Not sure if I should do this though. It is the equivilent of this not-too-ugly (but not too obvious either)

a=<[ $a[position() > 1] ]>

Expanding sequences into positional parameters.

I'm debauting over a syntax to convert a sequence into positional parameters.

The problem is this. Suppose I have a sequence expression, say a variable
a=(a b c d e)

then want to pass that sequence as separate parameters to a command. Using $a does NOT expand the sequence. For example
command $a

passes 1 argument of type sequence (unless command is an external program in which case the sequence is converted to args).

this is typically desirable. Sequence expressions should be able to be passed to commands without "flatening". Particularly useful if you want to pass multiple sequences for example as say xquery or xslt parameters
xquery -q ... -v value1 $a -v value2 $b

a and b can be sequence values. But suppose I actually want a sequence to be flattened. I haven't figured out a clean way to do it in the current syntax. Posix shells don't help with this as they don't have sequences. Bash has arrays which are similar but you cant pass an array around to a command.

Eval can do this if the sequence contains simple unquoted strings like above
eval set $a
sets $1,2,3 to a,b,c BUT

a=<[ <foo/> , 2 , <bar>spam</bar> ]>
eval set $a

is a syntax error.
I believe this works but haven't tried it in all cases
set --
for arg in $a ; do
set -- "$@" $arg
done

ugly though !

I'm considering the "all array" syntax to do this like ${a[*]}
which would instead of returning all elements of the sequence as a single sequence value, returning them as multiple positional values similar to $*

That might do the trick but I'm not sure I like it. The odd thing is that this feature is only really useful when in the context of command argument expansion.
e.g
b=${a[*]}

would do nothing different then
b=$a

This is a bit subtle but the difference is that single variables can only hold 1 level of sequence whereas positional parameters hold 2 levels. That is "$*" is a List of Values where each value can be a sequence.

also considering both a prefix and suffix modifier such as ${+a} or ${a:+}

Of course none of these syntax would be obvious at all what they do until you learn them !

Wednesday, June 10, 2009

Limited background execution

I'm preparing test cases for the upcoming Balisage 2009 conference.

I'm re-encountering an age-old scripting problem/hack/idea.
Typical scripting languages (say unix shell scripts or CMD or other scripting languages), if they support "background execution" is a very limited form.
Take the unix shells for example. You can run a program "in the background" (as a non-blocking process) with "&". On Windows (cmd.exe) you can use "start".
Xmlsh is the same as unix shells, you use "&". Thats all great when your doing simple things. & + wait == job control. Great. But a common problem is that if your doing N background processes and N is large or unknown its hard to control. I've written shell scripts that can handle "N" (say 10) background processes and block if you try to do N+1 but its very clumbsy scripting, not something I recommend. And very special use case. but in todays multi-core world, being able to run N background processes (or threads) is very useful, *extremely* useful if you can arrange for N not to become "huge".

take a simple XML example. Suppose I have a directory of files and I want to run xslt on them all.

xmlsh script:

for i in *.xml ; do
xslt -s script.xslt < $i > output_dir/$i
done

But suppose I'm running on a 4 or 8 core CPU and this is a CPU bound process. I'd like to run these in some kind of parallelism.
This will work ...

for i in *.xml ; do
xslt -s script.xslt < $i > output_dir/$i &
done
wait

But if the number of files becomes large (say > 10) then this so greatly thrashes the system that not only does it slow down, but it risks eating up the system memory such that you cant guarentee completeness.

If you were writing a enterprise type product you could use a Thread Pool or Worker Thread or Worker Process model, and then use a queuing system and send requests into the queue, and have say 10 worker threads/processes doing the work from the queue. This may actually be implementable in the ongoing experimental http server module in xmlsh but suppose I wanted something less elegent but nearly as useful.

Imagine a syntax that says "Start a background thread but ONLY if there are < N outstanding background threads".

I can imagine a very simple syntax like maybe this

for i in *.xml ; do
xslt -s script.xslt < $i > output_dir/$i &[10]
done
wait

This would mean "start up to 10 concurrent xslt threads but no more"

It may not be quite as efficient as a worker pool but the syntax and implementation could be very simple. My estimation is that the end result would be nearly as efficient.

Thursday, May 28, 2009

"Worlds Simpliest Web Server"

Ok thats a bold claim. I'm not going to do the research to prove it. But to me this is awsome.

I'm working on http client and server support for xmlsh so that I can use xmlsh to prototype a content server. Not quite complete yet, but for an example, this code implements a full HTTP server serving content from the current directory tree. I tested it out by cd'ing to the JDK docs/api directory and launching IE with "http://localhost/index.html" - works !

-- xmlsh code

get () { cat ${PWD}$1 ; }
httpserver -port 80 -get get start

--

Thats it !
Every GET request gets executed by the local "get" function which cats out the file.

Speaking of "cat" though, I've decided that I need to implement basic unix commands natively in xmlsh. Originally I didnt want to do this because it was "reinventing the wheel", but the above example is a great case for it. In the above, the get() method acutally has to spawn 3 threads and a subprocess just to cat the file.
There is no native syntax to stream from input to output without running a command.
If these were all xml files then I could use "xcat" which would be very efficient, or even "xread a < file ; xecho $a;" but for text files there is no builtin "cat" command so its subproces/thread time. Yuck !

Similar for some of the basic unix commands like touch,mv,cp,cat,rm,mkdir,ls.
All of these require a unix subsystem currently (like a real unix OS or cygwin).
It would be very nice if these basics were 'built in'. I'm thinking of truely building them in as internal commands (very simplified option set) or as an "extension module". Comments appreciated.
The advantage would be not only performance on the basic commands, but also usability. A pure xmlsh script could depend on these basic commands existing.
For example the test cases check the environment for these commands already, it would be nice if they could be relied on.

Friday, May 22, 2009

Wiki Spam

I've finally had to go authoritarian and turn off self registration on the wiki (http://www.xmlsh.org)

At first I noticed some new posts and was excited someone was helping to add content to the site. Then I read closely and discovered it was subtle spam for paid web service unrelated to xmlsh or even xml. I edited out that part leaving in the good part which was added. Then today I discover pure graffiti, a plain web link added to the top of the page.

This "community authoring" model might work for something heavily trafficked like wikipedia, but its not working for me. I beg my friends to add content, but instead strangers login and spam the site. I don't have the time to keep up after it so I've turned it off.

If you would like to add non-spam content or even correct bad spelling, please let me know and I'll gladly register you and send you the login. If you want to add spam you'll have to try a little harder now. Sorry.

Wednesday, May 13, 2009

Working on Documentation

As part of moving into Alpha2 and preparing for Beta, I'm slowly working on the documentation. This is in wiki format on the main site (http://www.xmlsh.org)

Any comments or suggestions on how to improve the documentation greatly welcome.

Any volunteers to help with documentation even more welcome !

There's a dual purpose for documention. First, of course, is to help document things so people can use it. (even for me, I actually just stumbled on a feature I forgot I implemented).

But the other purpose is to flesh out problems that are not obvious in the test cases or the code, but become obvious when documenting. For example I just realized (and fixed) that the only named port was "error", I hadnt actually implemented the implicit stdin/stdout as named ports. I only discovered this while cleaning up the port redirection page.

The problem with the 2nd part is its way too easy to get sidetracked and start working on implementation ... 1 minute of documentation can easily lead to hours of implementation ... and hence thats why the docs are in such bad shape :(

Suggestions on what to focus on and how to avoid getting caught up in implementation welcome.

Friday, May 8, 2009

Alpha 2 released

I think this is a major milestone. I released Alpha 2 today (0.0.2.0).
While these version numbers are somewhat arbitrary they are a mental guide.
With Alpha 2 I have semi-formally "frozen" the syntax and will focus on stability, minor feature enhancements and command enhancements while attempting to not change the syntax or at least not change it in an incompatible way. It would be foolish to promise I wont add new syntax (Thinking of try/catch for example ...) But I am going to try to keep any syntax changes minimal and compatible.

I believe this release is ready for production environments, in controlled situations. At my day job, xmlsh has been running in production for about 9 months so you can be assured that I'm not going to do anything that would break that.

Thursday, April 30, 2009

Preparing for Alpha 2

I am preparing a snapshot release which will become "a2", aka release 0.0.2.0
I believe I am largely complete with mucking around with the syntax and core feature set of xmlsh and am getting nearer the refinement and enhancement stages. The Alpha 2 release will more formally announce that I'm close to freezing the language specs. Alpha 2 will focus on tying up the last bits of loose ends on the feature and language core and preparing for Beta. Beta 1 will formally freeze the language specs and begin freezing the core command set and focus on bug fixes and stability for a true release.

ETA ... as long as it takes :)

Wednesday, April 15, 2009

XML Event Pipeline - Initial Results

At long last I got the code in shape so that I could implement "native" piping in xmlsh using a binary event queue instead of serializing to text. From the beginning this has been my goal but its been more difficult then I thought. A primary complication is that not all commands input or output XML, unlike say XProc which can only have XML in the pipes, xmlsh requires working with text streams.
For example "echo foo | cat" should work just as well as "xcat foo | xcat".

I finally got it to work. Truly event driven pipes that can stream both text and XML events (StAX events). All the tests finally passed and I was feeling really good, until I ran performance tests. Ug. The event pipes are about 2x slower then text pipes ! Including the overhead of serializing and parsing the text format ! This is totally shocking to me as I had always presumed that the majority of the overhead would be in serializing and parsing XML text. But nope, turns out that the overhead of creating the event objects to stick in the pipe are less efficient then serializing to text then parsing the text on the other end. Certainly this is a consequence of a particular implementation, not a general statement. But still, its not a result I expected and performance analysis doesn't have any huge smoking guns. The biggest issue seems to be in the StAX event creation. I didn't fully realize until this that the XMLEvent object is non-symetric. That is, when writing events you don't write the same events as you receive. You write out StartElement , Namespace , Attribute, but when you read events they are consolidated into one StartElement event. The overhead during write of consolidating these events turns out to be much bigger then serializing them to text and re-parsing the text.

I'm going to put this aside for a while, whilst I contemplate the error of my thinking. For now if you want to experiment with the XML pipeline, the code is checked in, and you can enable them by setting the "XPIPE" environment variable. I will probably change that in the future.

Wednesday, April 8, 2009

MarkLogic integration

I'm working on a simple integration to MarkLogic servers in xmlsh. I now have the basics working which given how ML is structured should be enough to do almost anything. The following commands are now supported (in my dev environment).
list - list all documents in the DB
put - put a document to the DB
get - get a document from the DB
del - delete a document from the DB
query - run an 'ad hoc' xquery on the ML server
invoke - run a stored xquery on the ML server

interestingly, list, del and get are implemented as trivial xsh scripts calling query.

I plan on publishing this shortly and would love any input on how best to integrate with marklogic services, I am new to the ML server so dont know the "best practices".

Sunday, April 5, 2009

Named ports

The latest version of xmlsh has support for "Named Ports".
The original incentive was to be able to more closely map to xproc port semantics, even though I may end up not using them for xproc. I decided to include them anyway because I think this is one area that the unix shells are deficient, or too tied to legacy decisions.

In unix, processes access environment supplied streams by file number (0,1,2 ...).
There are by default 3 standard ones (stdin,stdout,stderr) corresponding to 0,1,2.
The unix shells give access to these via the standard IO redirection cmd > file, cmd < file etc. If you want access to anything except the first 2 you have to use a numeric modifier. e.g. cmd 2>err . You can do this with any file descriptor like cmd 30>file 40<file2 . I've always thought that this was a bit of a hack. Its certainly very closely tied to the OS. Ported versions of shells often cant support this syntax for anything but the predefined 3. Java runtimes have no direct portable way to do this either. Even if it did, it wouldn't work in xmlsh because shell instances are not run as separate sub processes. It could be done in xmlsh by keeping a mapping table of redirected ports that the shell redirected so that internal commands could access.

Instead I choose to go with named ports like xproc uses. I've completed the plumbing to support this but not yet happy with the syntax. The part I'm happy with is this. For any command you can augment a redirection with (port). For example
cmd (output)>file1 (alternate)>file2 (input)<file3

Inside cmd, the XCommand (via XEnvironment) exposes these ports as named ports and you can get at them via getInput(name) or getOutput(port).

But suppose you want to specify the name of the port to the command and not have it confused with the name of a file. As an experiment I've added general support for port naming so that any command which wants a stream can pass in the "filename" and it can access Files, URLs , Variables, Ports and expressions interchangeably. For example (and largely for testing) I have implemented xidentity to support optional filenames so you can do

xidentity file
xidentity http://url
xidentity {variable}
xidentity $xml_variable
xidentity <[ <doc/> ]>

and now with ports you can do

xidentity "(in)" (in)<file

Note the "(in)" is quoted. By the time xidentity gets its, its simply the string (in) but to get it through the parser you have to quote it. I find this less then ideal. But if I remove ( from being a magic token then it breaks all sorts of things, including the ability to distinguish between arguments and IO redirection.
( I.E in "cmd (in)<file" does cmd get 1 arg or 0 ? )

However the ability to pass in port names in the same context as filenames or URI's is very compelling. Alternative suggestions welcome. Maybe an extended scheme ? like port:name e.g.

xidentity port:in (in)<file

Suggestions and comments welcome

Friday, March 27, 2009

GUI tool for discovering class relationships idea

I work a lot with libraries of code which I did not write myself.
I also have this problem with my own code but more often with sets of "foreign" code.
The problem is that I can end up with an 'impedance mismatch' between types and need to figure out how to match them up. Say I end up with class A but need a class B.
I'm hoping there is some way to convert A to B. A real example recently I had a XdmNode and wanted an XMLStreamReader. In a simple case I would look in A and see if there is a method that returns B, or look in B and see if there is a constructor that takes A. But in the real world its rarely that simple, there are frequently dependency issues and also chains of transformations. Eg. to convert A to B might require a C, or might require going through an intermediary path like A->D->B.
In the above real example, the path ended up being

XdmValue -> ValueRepresentation -> { Value | NodeInfo } -> SequenceIterator -> PullFromIterator -> PullToStax -> XMLStreamReader

Thats 6 levels of transformations ! I have been working in that codebase for several years and still it took some hints from the author to figure it out, and then lots of luck.

This got me thinking ... Wouldn't it be nice if there was a tool for this task ?
There are lots of java class browsers (I use the Package explorer in Eclipse),
but none that I know of that do this job. Given 2 classes, show all the paths between the classes as well as all the dependencies along the paths.

Now to be really useful, the routes need to be appropriate, as well as existent.
For example, in the above example, I first found, and used, a different path from XdmValue to XMLStreamReader that turned out used a class which was not entirely functional (It lost Location information). There's no way I can imagine a tool knowing this short of AI. Which leads to a whole different set of ideas, an AI that can understand software. Something you could ask "Whats the best way to create an XMLStream from an XdmNode in this context". Or maybe "will this code actually work right ?", "Please generate unit tests for this set of code and run them and show me what broke and why".

This seems an awful lot like "Theorm Prover" software that in the 70's and 80's was being worked on agressively, but I havent heard much of since.

Sunday, March 8, 2009

Text serialization

I'm working at adding more internal interfaces to the IPort hierarchy, which is used by OutputPort and InputPort (and hence IOEnvironment).
I'm adding a StAX interface in preparation for a true binary event based pipe.
As I add more options for interfaces the problem of both interop and serialization get worse. I'll leave Interop out for now.
But for Serialization ... I'd like a common way of specifying text serialization options. Right now its prety much left up to the individual commands, but thats just not right.
For example xpwd, xls, xcat, xquery etc may use different internal interfaces (SAX , Saxon, DOM , StAX etc). If they end up in a text file or stdout (text) then the serialization may differ. For example they may or may not omit the xml declaration, use indentation, do namespace fixups etc. Its all pretty hog-wild.
I've tried to standardize on a common serialization format but its not manageable in the long run. Users really need a way to force the serialization options explicitly, both on a command basis and globally.

When I first started xmlsh I imagined some kind of common "output filter",
either explicitly as a pipe like
command | xformat -options

then maybe implicitly as a common filter on the output of all commands.
Lately I've been thinking about building this in deeper as a set of properties that are inherited via the environment variables. a "serialization" property set.
xproc has a well defined set of these paramateres, borrowed from xquery.
There's a lot to these. The problem is related to the multiple interfaces.
Not all interfaces and all API's support all serialization parameters.
A simple example, StAX doesnt have a property to avoid writing out the xml declaration, although you can fake it by avoiding writeStartDocument(), but there's no global way to set it. Similarly StAX doesnt have a "indentation" property, although you can fake it with a filter. SAX, DOM and Saxon all have different sets of properties they can support (Saxon being the most rich, or atleast closest aligned with xquery and xslt).

So how to implement this ?
At the user level I think an XSERIALIZATION variable makes sense, this can be inherited and overwritten for child process/commands.
But internally ... I have yet to figure out a way to consistently apply this property. It may be I have to filter all output through a serialization pipe/stream ... which adds unnecessary performance problems in the cases where its unneeded.

Ideas very welcome !

Wednesday, February 11, 2009

Block Quotes & CDATA

A lot of what I'm adding to xmlsh lately is to support the xproc project.

Xproc places some very unique challenges which is good to expose weaknesses in xmlsh,

and weaknesses in my brain :)

One problem is quoteing (see "Quoting is Hard!" post).

In order to implement passing arguments to xproc steps I have to be very careful with quotes in the code generation as the strings have to be passed unchanged through xmlsh onto the underlying commands or sub-commands such as xpath expression.

One problem with XML and XPATH (and by inference xproc) is the frequent use of both types of quotes, single and double, interchangable. The unix shells have very unique interpretation of quoting and quoting quotes which is largely incompatable. Say I need to pass the "foo" unchanged, then I need to quote it as '"foo"' .. but If I need to pass 'foo' unchanged it needs to get quoted as "'foo'". And if I need to pass through '$foo' I have to use something more complex like "'"'$foo'"'" ... mortal brains were not intended for this !!!

So I thought "how does XML handle this".. The closest thing is CDATA sections ... which prety well solve the problem ... but CDATA is one hell of a verbose syntax. The above would be

<![CDATA['$foo']]>

UG ... is there something simplier ? How about XML Comments ?

This is certianly pretier ... but its symantically misleading. XML Comments are supposed to be comments, not quotes. So if xmlsh used XML Comment syntax as quotes it would be very misleading.

So the solution ? I made something up of course !!! I'm not fully commited to this syntax yet, but it passes some basic tests. It needs to be a multi-string comment syntax, needs to be a string not common in either shell or xml expressions, and not "too ugly", and also be somewhat similar to other shell expressions.

I first tried <{ block quote }> then discovered (duh !) it conflicts with the port syntax <{port}

I went with

<{{block quote}}>

the '$foo' example is expressed as

<{{'$foo'}}>

This seems not too bad and wasnt a huge difficulty getting into the grammer.

Tuesday, January 20, 2009

Quoting is hard !

After all these years and I never really appreciated how hard quoting is. I dont mean the simple echo "foo bar" quoting, but all the nuances of blackslash, single quote, variable expansion, wild card expansion etc. I'm getting close in xmlsh but not quite there. I think to go the last 10% I'm going to have to rewrite the entire word expansion module. A complexity is that backslash quotes have to be recognized, and sometimes stripped out, but their effect is longlasting.

Consider this.

echo \*

Seemingly simple ... but when this is split up into the many types of expansion, wild card expansion actually occurs last, AFTER removing backslashes and then doing variable substitution.

So this can work

a=*

echo $a

variable substitution has to occur first ... which means backslash recognition has to occur before that ... so THIS can work

a=*

echo \$a

It gets a lot harder then that. the difference between "foo\bar" "foo\\bar" "foo\\\bar" is strange enough ... but add in '' and $ and then xml expressions and wildcards and it really gets strange.

I think I'm going to have to shift over to a "color" model ... that is, during the expansion keep track on a char by char basis the quote "color" for each and every charactor individually. By color I mean \ vs " vs ' ... any given charactor could be colored by one or more of these attributes simultaneously and it effects further processing ... even when the offending quote chars are removed.

Saturday, January 17, 2009

URI for CWD ?

A few weeks ago I added the feature where URI's can be used in place of files everywhere that filenames are used for input. This works for all internal and builtin commands as well as in IO redirection.

For example this works because IO redirection is done within xmlsh

cat < http://test.xlmsh.org/data/books.xml

This works because xcat is an xmlsh command

xcat http://test.xlmsh.org/data/books.xml

But this does not work (because cat is not an xmlsh program)

cat http://test.xlmsh.org/data/books.xml

I added this both to support easy access to web data, as well as to be able to track the base URI mainly for xproc support. Base URI support is useful not only for xproc but for expanded entities, such as in the following case

xcat <> and xml oriented commands (xquery, xed, xslt) can work correctly with a default namespace. But what about a default base URI ?
So I could do something like

declare base-uri http://test.xmlsh.org/data
xcat books.xml

But now there is a conflict between the base-uri and the current directory.
How does the shell know to pull books.xml from the web not from the filesystem ? Once you set a base URI you cant get at files anymore.
This got me thinking more ... what is the base-uri except for the current directory ? What if they were the same. If you could "cd" to a web address, for example

cd http://test.xmlsh.org/data
cat books.xml

ftp could work too
cd ftp://test.xmlsh.org/data

This would actually be prety easy to implement. And maybe useful ?
But the side effects could be weird. Questions arise if I did this :

What would * expand to ? ( echo *)
What do I set for the current directory to external files ?
How would xls work ? ( I experimented with ftp directories and they may be parsable,
but not most http direcotries).

Friday, January 16, 2009

which pipeline is "this" shell ?

A long time ago ... in a dark collage basement, I discovered that /bin/sh did a weird thing with pipelines. In the pipeline
a | b | c

it is the LAST segment ("c") which is run in the current shell, and "a" and "b" are in forked processes. I always found this somewhat strange until I realized this syntax actually works

echo foo | read a

Because "read a" is executed in "this" shell,

$ echo $a
foo

Try this in bash or other "modern" shells and it doesnt work.
I wonder why noone noticed ? I just tried in a modern linux FC8 ksh and voila ! That wonderful legacy behaviour works !

[dave@home ~]$ ksh
$ echo foo | read a
$ echo $a
foo

But unfortunately xmlsh is different then bash OR ksh .. it runs all commands in a sub thread/shell. Does anyone have any opinions on how useful or important this somewhat arcane behavior is ? It definitely has some use cases, but I wonder why bash authors didn't deem it important enough to preserve.

In xmlsh it may be even more useful, consider this in a script

xslt ... | xquery ... | xread DOC

This cant be implemented easily otherwise ...
DOC=$<(xslt ... | xquery ... )

since the $<( ) syntax doesnt read from stdin, you'd have to do

xread DOC1
DOC=$( echo $DOC1 | xslt ... | xquery ...}

I guess this is back to the forking question ...

Tuesday, January 13, 2009

Forking Input

I'm about to embark on a significant feature enhancement/change to xmlsh to support xproc. Xproc requires that streams (pipes) be able to "fork". That is, the input to a step (command) may have to be copied and sent to multiple places, including expressions used for argument construction. This could be done by reading the input into an XML variable then passing that around to the various places that need it, but I'd like to keep the generated xmlsh script as "natural" as possible and wherever possible to preserve the ability to stream. In xproc, "natural" means prety much everything has access to (potentially a copy of) the input stream. But in xmlsh, streams are read similar to the unix shells, where any reader of the input consumes it.

I'm debuting if I should add explicit syntax to cause a stream fork, or possibly for implicitly. The reason that unix shells consume input is somewhat dependent on the unix OS pipe and file semantics. its not enirely clear if preserving this notion is important in xmlsh.

For example suppose I wanted to run both an xquery and xpath on the standard input.

xquery '//foo'
xpath '//foo'

The first command (currently) actually consumes all the input and the second command fails. Is that best ? Maybe xproc has a good point. Suppose the above commands didnt consume the input. That way both the xquery and xpath could read a 'copy' of the input stream and produce results. Would this be more useful ?

An alternative is to provide an explict syntax to provide forking. For example

| xquery '//foo'
xpath '//foo'

In this case I invent the "|" synatx with no leading command to mean "fork the input". that way its explicit that xquery gets a copy of the input and xpath then consumes it.

A similar problem comes in with command expansion. such as

echo $(xpath '//foo')

right now the xpath gets a null input stream because otherwise it would consume all the input. But suppose that command substitution also got forked copies of the input ?
Something like this is actually required by xproc ( the with-option tags must be able to read from the standard input). Again, I could implement this all by reading the entire input into a variable at the beginning then echo'ing it all over the place, but its a compelling idea to natively support stream forking. Not only would there be some convenience in script authoring, but some optimizations could be done which would be hard if the forking was explicit.

Command substitution file parameters.

I just finished implementing the file option to command substitution. This is the same syntax that bsh/ksh use.

In 0.0.1.4 you can use the expression $(<file)

a=$(<file.txt)

then words are not expanded so that $a is a single string but when used in an argument list such as

set $(<file.txt)

words ARE expanded. The behaviour can be forced to not expand by using quotes such as

set "$(<file.txt)"

In the next release will be the similar, but admitedly clumbsy syntax of $<(<file.xml).

Saturday, January 3, 2009

Welcome !

This blog was created to document in a less formal way the experience of creating xmlsh. My goal is that others may learn from experiences, both successes and failures.

I started xmlsh over a year ago, in Dec 2007. At this point (Jan 2009) it is a in "Alpha 1" state. That is, it is functional and used currently in production but still subject to core enhancements before I recommend it to be used in production environments.

Current work in progress is an experiment to create an "XProc" implementation which converts xproc pipelines to xmlsh scripts.