Wednesday, June 10, 2009

Limited background execution

I'm preparing test cases for the upcoming Balisage 2009 conference.

I'm re-encountering an age-old scripting problem/hack/idea.
Typical scripting languages (say unix shell scripts or CMD or other scripting languages), if they support "background execution" is a very limited form.
Take the unix shells for example. You can run a program "in the background" (as a non-blocking process) with "&". On Windows (cmd.exe) you can use "start".
Xmlsh is the same as unix shells, you use "&". Thats all great when your doing simple things. & + wait == job control. Great. But a common problem is that if your doing N background processes and N is large or unknown its hard to control. I've written shell scripts that can handle "N" (say 10) background processes and block if you try to do N+1 but its very clumbsy scripting, not something I recommend. And very special use case. but in todays multi-core world, being able to run N background processes (or threads) is very useful, *extremely* useful if you can arrange for N not to become "huge".

take a simple XML example. Suppose I have a directory of files and I want to run xslt on them all.

xmlsh script:

for i in *.xml ; do
xslt -s script.xslt < $i > output_dir/$i
done

But suppose I'm running on a 4 or 8 core CPU and this is a CPU bound process. I'd like to run these in some kind of parallelism.
This will work ...

for i in *.xml ; do
xslt -s script.xslt < $i > output_dir/$i &
done
wait

But if the number of files becomes large (say > 10) then this so greatly thrashes the system that not only does it slow down, but it risks eating up the system memory such that you cant guarentee completeness.

If you were writing a enterprise type product you could use a Thread Pool or Worker Thread or Worker Process model, and then use a queuing system and send requests into the queue, and have say 10 worker threads/processes doing the work from the queue. This may actually be implementable in the ongoing experimental http server module in xmlsh but suppose I wanted something less elegent but nearly as useful.

Imagine a syntax that says "Start a background thread but ONLY if there are < N outstanding background threads".

I can imagine a very simple syntax like maybe this

for i in *.xml ; do
xslt -s script.xslt < $i > output_dir/$i &[10]
done
wait

This would mean "start up to 10 concurrent xslt threads but no more"

It may not be quite as efficient as a worker pool but the syntax and implementation could be very simple. My estimation is that the end result would be nearly as efficient.

4 comments:

  1. A couple things come to my mind when I read this.

    1. Make's -j option. For our purposes this allows us to tell an application how many of its tasks we've given it it can do in parallel. Along with limitations on what can and can't be parallelified.

    2. The thread pool concept you mentioned above, multiplied by readability concerns.

    Why limit it to single lines? Why not define a list, kind of like a makefile, in a complex variable type that represents a job queue?

    queue=??? #declare new job queue using magic.

    for i in *.xml; do
    echo "Found $i, adding to queue."
    job append to queue (
    echo "Now processing $i"
    xslt -s script.xslt < $i > outdir/$i
    gzip -9 outdir/$i
    ) # that looks sort of familiar, missing some curlies and a 'new Thread' or somesuch.
    done

    job process -j8 $queue

    This way you can block (wait) by simply not backgrounding the job processor... which is starting to look like it COULD be prototyped in terms of make(1) and a temporary makefile somewhere. If you were so perverse.

    And it seems to allow for clearer idea of what's going on when reading the code to, or so I might think...

    This is all just rambling, hopefully a fraction of it is useful :)

    ReplyDelete
  2. Interesting ideas. Yes I was thinking of make -k.
    One quick note, "line" is not the same thing as "statement". The proposed syntax would work perfectly well on any block of code (and yes even nested) as well as scripts. like this

    for i in *.xml ; do
    ( stmt1 ;
    stmt2 ;
    stmt3 ;
    stmt4 ) &[10]
    done

    the single "line" is just a degenerate case.
    And this doesn't require the invention of a "queue", although that has some good merit.

    My initial goal is to avoid complicated syntax and new commands to manage queues and jobs etc.
    But definitely needs thinking. I haven't looked yet at what the new parallel languages are doing (I think Microsoft has something in the new C++ or C# compilers that can mark blocks of code for parallelisms).

    Functional languages, like XQuery, xlst and xproc can in theory be parallelized but so far It seems like its a lot harder to do then one imagines (because 10 years later it hasn't actually been implemented yet). So I'm thinking, what can be done *simply* that has a lot of the benefits of a more sophisticated approach but inst a 5 year research project to implement.

    The http server stuff I'm working on could be used for this as well. However a problem with thread queues, server oriented requests etc is that you have to start breaking out the problem in terms of messages and sub tasks and encapsulated persistant state. It ends up being quite painful to get it all specified and working right. This is why the unix "fork" command was (and still is) so elegant. You dont have to gather all the state information to send to a new process, it just inherits a copy of the current system.

    ReplyDelete
  3. Ah ! I thought of a even less intrusive change which would do the core of what I was thinking about without any new syntax or new commands.
    Simply add a new flag to "wait" which means instead of "Wait for all jobs to exit" (wait) or "wait for job X to exit" (wait X). Add one that means "Wait until there are at most N jobs" something like
    wait -atmost 10

    then you can start things in the background (using &) as much as you want, then to limit how many are in the background toss in a "wait -atmost N" to keep things from thrashing.

    This can actually be implemented as a user function right, but having it builtin to wait may be cleaner.

    ReplyDelete
  4. wow, that only took 10 minutes to implement.
    It will be in version 0.0.2.3

    ReplyDelete

Due to comment spam, moderation is turned on. I will approve all non-spam comments.