At long last (years) I knuckled down and made a simple xmlsh GUI. This will be in the next release.
Why ? I have been opposing this for a long time for many reasons, the least of which is I dont really like writing GUI's. But I got tired of the limited editing capabilities of DOS command shells and enjoy the very simple BSH UI GUI ...
I think what stopped me for long is the slippery slope. Once you start with a GUI where do you stop ? Xmlsh is a command line shell and a embedded API, not a WYSIWYG tool.
But alas ... a simple GUI is useful sometimes. I played around with various toolkits and settled on using plain AWT and Swing. I found that Eclipse Windows Builder supports simple AWT apps. Quite a nice tool. I tried SWT but while it is much more feature full, it is very much tied to Eclipse and required a dozen more jar files to run even a basic window. With AWT I was able to do a functional GUI in 200 lines of code. This will likely expand to 20000 ... as feature creep sets in, but its a start.
Here is a screenshot of a sample session "xmlshui"
Thursday, May 31, 2012
Wednesday, May 16, 2012
Streaming Streaming Streaming ...
Now that I work for MarkLogic I am dealing with more and more "Big" "Data" ... and as usual xmlsh + marklogic is a huge win. But as I start ramping up my use of large datasets especially large numbers of small documents (millions, hundred million ...) the old tricks dont work quite so well.
For example recently I needed to upload 3 million XML files to a ML server from a relational DB.
My first pass was my favorite tool for this ... xsql + xsplit + ml:put
Since I like to debug stuff as I build it ... the simple way is to do this.
xsql ... > bigfile.xml
xsplit -o xml bigfile.xml
cd xml
ml:put -baseuri /xxx/ -m 100 - -maxthreads 4 *.xml
On my big beefy server box this worked although a bit slowly. So ok I wanted to now transfer this data to an EC2 instance. Its "only" 10G of data so I did this
tar -cvzf xml.tar.gz xml
then transfered the now compressed file to the EC2 machine.
Then on the EC2 machine I tried to replicate the above steps.
tar -xzf xml.tar.gz
I waited ... waited ... waited ... 3 DAYS and it wasnt done yet. Admitedly this was a medium instance of EC2 but it should have handled this. The problem seemed to be the system was stuck in 90% system time.
My guess is the age old problem of lots of files in a directory. Especially over EBS ... it just doesnt perform well. Its actually exponentially slow to add files to directory once they get big ... particurly nasty when the files are small so the overhead of simply creating a file entry is much bigger then the file IO itself.
So what to do ... I did 2 things ... I restarted the EC2 instance as an m1.xlarge ... ($$$$ ka chink)
Then instead of pre extracting the xml to a directory in whole I used a new feature I recently added to ml:put ...
tar -xzf xml.tar.gz | ml:put -baseuri /xxx/ -m 100 -maxthreads 4 -f - -delete
What this does is let tar still extract the files but it then lists them to stdout.
From there ml:put reads the list of files as they are extracting, batches them up and sends them to MarkLogic then deletes them. The end result is that there is only about 500 or so files in the xml directory at any one time. This completed in about half an hour ... about 2000 docs/sec ... much better.
Of course this speedup was due to the larger instance as well as the technique ...
But this gets me thinking ... Why do I need the overhead of writing to a temp directory for this ? Its still adding a significant unnecessary overhead. I should be able to send a bunch of XML files to ml:put in a stream and use no temporary files. In fact I should be able to do a full pipeline with no overhead like
xquery 'for 1 to 10000000 return document { ... } ' | ml:put ...
or perhaps
xsql 'select * from table' | xsplit -stream | ml:put ....
The core problem here is the lack of a streaming interface for XDM. In order to send a bunch of XML files (or XDM values) through a stream (or to a file and back) they need to be packaged in something. Typically wrapped in a root element or maybe zipped or tar'd.
Zip is really lousy for this because its TOC is at the end so you cant stream unpack a zip file. Tar is good because each file entry is contigous and you can stream unpack them. But what about cases where I just want to dynamically create (or transform) XML and spit it out like the first example
xquery 'for 1 to 10000000 return document { ... } ' | ml:put ...
If I wrap this in a single document it becomes hard to stream. ml:put *could* have xsplit builtin ... but to keep to the tools approach I'd rather split the functionality. So say I put xsplit into the pipeline like the second example. How is xsplit to produce *multiple* documents on its output stream in a way that is readable ? Were back to a serialization format for XDM (http://xml.calldei.com/XDMSerialize)
This is a fundimental problem in traditional XML toolchains. There is simply no standard and efficient way to stream sequences. So what to do ?
I'm considering a 3 phase approach.
1) Implement an enhancement to xmlsh commands and pipes such that they can request, produce, and consume sequences through ports. So for example "xsplit -stream" could output the split documents all to stdout. But what would this look like ? How to implement it ?
2) For pipes implement an optional XDM stream pipe. This would allow streaming of XDM values (including sequences of documents), without serialization directly through the pipe. This does mean that the pipe might get large if the documents are large ... I may have to limit the pipe to a small number of values.
3) Implement some kind of text serialization for sequences. Essentially back to http://xml.calldei.com/XDMSerialize ... although I am not sure I like my proposal so much in the face of this use case. The original proposal does not consider streaming as the major use case. However the use cases it was designed for should overlap with streaming. I'm not even sure I need to support most of XDM ... falling back to what XProc does (streams of documents) may be sufficient although I abhor the restriction on purely theoretical grounds. But the fact is any text serialization of XDM will be lossy. It is just a matter of drawing the line somewhere, and maybe the most valuable use case is drawing the line at documents.
Well back to the drawing board. I'd like to implement this but still so many open issues !!!
Comments welcome.
For example recently I needed to upload 3 million XML files to a ML server from a relational DB.
My first pass was my favorite tool for this ... xsql + xsplit + ml:put
Since I like to debug stuff as I build it ... the simple way is to do this.
xsql ... > bigfile.xml
xsplit -o xml bigfile.xml
cd xml
ml:put -baseuri /xxx/ -m 100 - -maxthreads 4 *.xml
On my big beefy server box this worked although a bit slowly. So ok I wanted to now transfer this data to an EC2 instance. Its "only" 10G of data so I did this
tar -cvzf xml.tar.gz xml
then transfered the now compressed file to the EC2 machine.
Then on the EC2 machine I tried to replicate the above steps.
tar -xzf xml.tar.gz
I waited ... waited ... waited ... 3 DAYS and it wasnt done yet. Admitedly this was a medium instance of EC2 but it should have handled this. The problem seemed to be the system was stuck in 90% system time.
My guess is the age old problem of lots of files in a directory. Especially over EBS ... it just doesnt perform well. Its actually exponentially slow to add files to directory once they get big ... particurly nasty when the files are small so the overhead of simply creating a file entry is much bigger then the file IO itself.
So what to do ... I did 2 things ... I restarted the EC2 instance as an m1.xlarge ... ($$$$ ka chink)
Then instead of pre extracting the xml to a directory in whole I used a new feature I recently added to ml:put ...
tar -xzf xml.tar.gz | ml:put -baseuri /xxx/ -m 100 -maxthreads 4 -f - -delete
What this does is let tar still extract the files but it then lists them to stdout.
From there ml:put reads the list of files as they are extracting, batches them up and sends them to MarkLogic then deletes them. The end result is that there is only about 500 or so files in the xml directory at any one time. This completed in about half an hour ... about 2000 docs/sec ... much better.
Of course this speedup was due to the larger instance as well as the technique ...
But this gets me thinking ... Why do I need the overhead of writing to a temp directory for this ? Its still adding a significant unnecessary overhead. I should be able to send a bunch of XML files to ml:put in a stream and use no temporary files. In fact I should be able to do a full pipeline with no overhead like
xquery 'for 1 to 10000000 return document { ... } ' | ml:put ...
or perhaps
xsql 'select * from table' | xsplit -stream | ml:put ....
The core problem here is the lack of a streaming interface for XDM. In order to send a bunch of XML files (or XDM values) through a stream (or to a file and back) they need to be packaged in something. Typically wrapped in a root element or maybe zipped or tar'd.
Zip is really lousy for this because its TOC is at the end so you cant stream unpack a zip file. Tar is good because each file entry is contigous and you can stream unpack them. But what about cases where I just want to dynamically create (or transform) XML and spit it out like the first example
xquery 'for 1 to 10000000 return document { ... } ' | ml:put ...
If I wrap this in a single document it becomes hard to stream. ml:put *could* have xsplit builtin ... but to keep to the tools approach I'd rather split the functionality. So say I put xsplit into the pipeline like the second example. How is xsplit to produce *multiple* documents on its output stream in a way that is readable ? Were back to a serialization format for XDM (http://xml.calldei.com/XDMSerialize)
This is a fundimental problem in traditional XML toolchains. There is simply no standard and efficient way to stream sequences. So what to do ?
I'm considering a 3 phase approach.
1) Implement an enhancement to xmlsh commands and pipes such that they can request, produce, and consume sequences through ports. So for example "xsplit -stream" could output the split documents all to stdout. But what would this look like ? How to implement it ?
2) For pipes implement an optional XDM stream pipe. This would allow streaming of XDM values (including sequences of documents), without serialization directly through the pipe. This does mean that the pipe might get large if the documents are large ... I may have to limit the pipe to a small number of values.
3) Implement some kind of text serialization for sequences. Essentially back to http://xml.calldei.com/XDMSerialize ... although I am not sure I like my proposal so much in the face of this use case. The original proposal does not consider streaming as the major use case. However the use cases it was designed for should overlap with streaming. I'm not even sure I need to support most of XDM ... falling back to what XProc does (streams of documents) may be sufficient although I abhor the restriction on purely theoretical grounds. But the fact is any text serialization of XDM will be lossy. It is just a matter of drawing the line somewhere, and maybe the most valuable use case is drawing the line at documents.
Well back to the drawing board. I'd like to implement this but still so many open issues !!!
Comments welcome.
Subscribe to:
Posts (Atom)