While hunting down broken images in the pdf import, I came across a very weird behaviour, that was seemingly caused by plain C stream IO: every once in a while, streams that were explicitely opnened in binary mode converted 0×0A into 0×0D. One-on-one substitution, not the usual 0×0A to 0×0A 0×0D lineend conversion, notably. I wasted about a day trying to track that down, suspecting everything from crazy filesystem flags, to symbol hijacking during link time (i.e. somebody sneaking in own fwrite/putc/whatever into the binary). Laboriously ruled out one after the other.

Finally, I did what I should have done in the first place: I used a different hex editor. And lo and behold, the trusty midnight commander showed all the files with correct 0×0A. Caught in the act and proven guilty: emacs’ hexl-mode.

Well, kind of. It turned out to be a rather unlucky combination of things going wrong: emacs with its buffer encoding scheme, which tries to be smart and detect the actual encoding - if there’s a 0×0A in the file before a 0×0D, emacs assumes raw-text-unix and leaves line ends alone. If there’s a 0×0D in the file before any 0×0A, emancs assumes raw-text-mac, and converts all assumed lineends to 0×0D. And since hexl-mode seems to work on top of an existing buffer (and never loads the file directly from disk), this resulted in the described mess.

So, to save yourself some trouble: don’t trust your tools. And make sure to have files you want to work on with hexl-mode match something in file-coding-system-alist’s content, that enforces binary reads (elc, png, jpeg, ar, gz etc. for my setup).

Being honoured to mentor Shane Matthews on his GSoC 2007 project Impress OpenGL rendered transitions, I today received the mentor gift - thank you Google for the tshirt!

On related news, the resulting CWS got recently merged to HEAD, so those of the adventurous kind can now build this as an optional feature: just issue “build ENABLE_OPENGL=TRUE” in the slideshow module to be blessed with an ogltrans shared lib. And thanks to Rene, after CWS configure22 is integrated, one can even give “–enable-opengl” on the configure line. After that, register the ogltrans lib into OOo via unopkg, and use this presentation file to see what others are missing.

This could even be shipped as a separate extension after 2.4 is out…

After an impromptu dead parrot sketch at the OOoCon (”the beamer works” - “no, it’s dead” - “bah, it’s not dead, it just needs a reboot”… (you might look forward to the still-missing video)), I had a rather rough ride with the proverbial last 10 percent of the browser functionality - I’m still not 100% happy with the representation in OOo’s Calc, and the xcs schema parser leaves something to desire. But surely the thing is now ready for config hackers to play with it - grab your copy here:

09e22fc6d8d9f9bfd3dc7cf838ce6590 ooconfig.oxt

(for those of you who don’t know what I’m talking about: OoConfig is a Python extension, providing a tool quite similar in spirit to Mozilla’s about:config page - a way of tweaking the complete (even hidden) set of configuration items. See also the wiki for further information)

Just found this. Pretty cool way of packaging an application to be completely relocatable (though without desktop integration, barring further hacks). Something for OOo Portable?

Via Russel Coker of Debian fame, the Boston Globe reports that 55 percent of the world’s installed photovoltaic power is in: Germany. Yay! With my solar roof contributing, if even only the tiniest amount…

Seems that Intel has set their TBB lib free - which solves a few of the basic issues outlined here (I’ve commented a bit in the context of c++).

Okay, what does all of this
really buy us, right now?
Lets assume we want to speed up Calc’s way of parsing and
calculating cell values. Given the following dependency tree (cell 1
refers to the result of cell4, cell 6 to the result of cell2, and so
forth):

    parse_cell_4                                parse_cell_5
         |                                           |
    parse_cell_1        parse_cell_2            parse_cell_3
         |                   |                       |
         -------------- parse_cell_6 -----------------

The partial ordering of cell evaluation equivalent to this tree is as follows:

parse_cell_4<parse_cell_1
parse_cell_5<parse_cell_3
parse_cell_1<parse_cell_6
parse_cell_2<parse_cell_6
parse_cell_3<parse_cell_6

Wanting to employ what we’ve talked about earlier, to parallelize this
calculation, one would need a static expression (i.e. one fixed at
compile time):

as_func(
    as_func(
        parse,
        as_func(
            parse,
            4),
        1),
    as_func(
        parse,
        2),
    as_func(
        parse,
        as_func(
            parse,
            5),
        3))

(with as_func denoting that the argument should be
evaluated lazily, and parse being a Calc cell parsing unary function,
expecting the cell number as its sole parameter).

Now, having the formula dependencies fixed in the code isn’t of much
value for a spreadsheet, thus we need to handle this a bit more
dynamically. Futures and Actions in
principle provide the functionality that we need here, only that
there’s one slight catch: the pool of threads processing the Actions or
Futures might actually be too small to have all cells resolved, that
in turn are referenced from other ones (with circular cell references
being the extreme example. But see below). This is because forcing a
Future or Action to evaluate will block the thread requesting that
value, which will eventually lead to starvation, unless there’s at
least one more thread active to process cell values other Actions are
waiting for.

Of course, one could remedy this by using N threads when
dealing with N cells. But that would get out of hand
pretty quickly. Alternatively, the cell parsing function can be split
into two parts: the first generates a parse tree, thus extracting the
referenced cells (depending on the heap allocation overhead, one could
also throw away the parser outcome, except for the cell
references. But OTOH, with a decent thread-local allocator, the full
parsing could happen concurrently. YMMV). Given the list of references
for each cell, one again gets the partial ordering over the value
calculation:

vector< vector >  intermediate;
parallel_transform( cells.begin(), cell.end(),
                    intermediate.begin(),
                    &parse_step_1 );

For each cell, this step yields a vector of preconditions (other
cells, that need to be calculated before). Pushing the actual cell
value calculation functions into a job queue, and handing it the
dependency graph (represented by the individual cell’s references)
generated above, we arrive at a fully dynamic version of the
concurrent cell parsing example:

int              i=0;
job_queue        queue;
vector results(intermediate.size(),0.0);
transform( intermediate.begin(),
           intermediate.end(),
           results.begin(),
           bind( &job_queue::add_job,
                 ref(queue),
                 bind( &parse_step_2,
                       ref(results),
                       ref(cells),
                       var(i)++ ),
                _1 ));
queue.run();

This looks weird, but peeking at the prototypes of the involved
functions might help clear things up:

/** adds a job functor

    @param functor
    Functor to call when job is to be executed

    @param prerequisites
    Vector of indices into the job queue, that must be processed
    strictly before this job. Permitted to use not-yet-existing
    indices.
 */
template job_queue::add_job( Func               functor,
                                            vector const& prerequisites );

and

/** calculates cell value

    @param io_results
    In/out result vector. Receives resulting cell value, and is used
    to read referenced cell's input values.

    @param content
    Cell content

    @param cell_index
    Index of cell to process
 */
parse_step_2( vector& io_results,
              string const&   content,
              int             cell_num );

This job queue can then decide, either globally or via policies, what
to do in various situations:

    Circular dependencies: either refuse working on such a job, or
    use a default value for an arbitrary member of the cycle (to
    break it)

    Whether to execute jobs in parallel or not. Depending on the number of cores
    available (both physically and load-wise), the queue could decide
    to stay single-threaded, if the number of jobs is low, or
    multi-threaded for a larger number of jobs. Note that
    this decision might be highly influenced by the amount of work a
    single job involves, and therefore external hints to the queue
    might be necessary. Kudos to mhu for the hint, that it’s wise to
    parallelize ten jobs that take one hour each, but not so for
    twenty jobs that only take a few microseconds to complete.

At any rate, fine-tuning this to various hardware, operating systems
and deployment settings is much easier than for manual thread
creations. Plus, given a few (falsifiable) attributes of the functions
called, it’s also much safer.

I’ve already talked a bit
about status quo of threading in OOo, and listed some largely
useful stuff that others have been doing to remedy the problems
shared-state multi-threading poses.

Why not going further? If there’s a way to know that a function is pure, or that an
object is thread-safe, then it would be nice to have
automatic parallelization employed. The underlying issue that
needs to be controlled here are races -
prohibiting unserialized access to shared state. So, either a function
is pure, i.e. does not depend on shared state at all, or the called
subsystem takes care of itself against concurrent modification (this
is most easily achieved by the environment concept of the UNO
threading
framework: the UNO runtime implicitely serializes
external access to thread-unsafe appartements).

Although C++ provides no portable ways to express those concepts on a
language level, for pure functions, there’s a way of using a limited
subset of lambda
expressions
, that inhibit access to shared state on the syntax
level. And it’s perfectly possible to mark objects (or even subsets of
a class’ methods) to be thread-safe. One straight-forward way to do
this are specializations of UNO interface references, i.e. ones that
denote thread-safe
components
.

Given all of this, we can form statements that contain:

    unsafe method calls

    thread-safe method calls

    pure function calls

    impure function calls

So, in fact, a runtime engine could reason about which subexpressions
can be executed concurrently, and which must be serialized. If you
treat method calls as what they are, e.g. implicitely carrying a
this pointer argument, a possible data flow graph might
look like this:

new object1                      new object2
 |                                |
 +->object1::methodA              +->object2::methodA
            |                               |
            +------------------------------>+->object2::methodB(object1)
                                                         |
                                                         v
                                                 object1::methodC

new object3
 |
 +->object3::methodA

That is, the this pointer is carried along as a target
for modifications, and as soon as two methods have access to the same
object, they need to be called sequentially. This does not apply for
UNO interface calls or objects that are tagged as thread-safe, of course. To be specific, a forest of data flow trees can be generated, which
defines a partial ordering over the subexpressions. If neither
exp1<exp2 nor exp1>exp2 can be deduced
from this ordering, those two subexpressions can be executed in
parallel. Really basic stuff, that compiler optimizers do, as well -
only that plain C/C++ doesn’t provide that many clues to safely
parallelize. From the example above, it is obvious that
object3::methodA can be executed concurrently to all other
methods, that object1::methodC must be execute strictly
after object2::methodA, and that
object1::methodA and object2::methodA can
also be executed concurrently.

Okay, this is largely crack-smoking. But there is something to be made
of it. Stay tuned.

This starts where Stephan’s OOoCon
talk

left off - by taking his problem statement and maybe a gist of the
proposed concepts into OOo’s core (which happens to be C++ instead of
Haskell. Which some folks might bemoan, but still, ~90% of our code
consists of that language).

To recap, the free lunch
is over
for us application software developers, i.e. either OOo
gets seriously multi-threaded soon, or we stop adding features and
only optimize for speed, or OOo will get significantly slower within
the next few years. But as you, the gurus and yours
faithfully have noticed by now, becoming concurrent with osl::MutexGuard,
osl::Thread
& friends is tedious, error-prone, and neither race nor dead-lock
free for a project of OOo’s size. With the expressive power of a
general programming language, it is even theoretically impossible to
verify against these properties (for a sufficiently large project. OOo
probably satisfies this constraint).

So, Stephan already showed what’s possible in other
languages
. To boot, the trick is to raise the level of abstraction
above digging-in-the-dirt, hand-made, explicit, and ugly locking or thread
fiddling. Whereever possible, parallelism should be detected either
implicitely, or at least denoted declaratively (instead of manually
telling the machine what to do). John Ousterhout has a nice writeup about
that. It’s basically another instance of the lack of transparency and
ease of use that is so ingrained with day-to-day tasks in C++. For
stuff like object lifetime management, serialization and so on, boost provides transparent
solutions. For large-scale concurrency management, the UNO
threading framework
does. For local, statement-level
parallelization, there ain’t much - or is there?

In fact, there actually are several attempts to extend C++ in that
direction. To name a few:

    OpenMP, a C/C++ language
    extension, which provides things like parallel loops, work
    sharing or master and synchronization constructs. Drawbacks:
    controlled via pragmas, expressiveness limited, C origins
    (e.g. there’s no proper way to propagate exceptions out of OpenMP threads).

    Boost.Act, a template
    framework that, together with Boost.Threads, provides
    functionality on the same conceptual level as OpenMP (it
    actually builds upon OpenMP internally). Furthermore,
    Boost.Act has so-called
    active
    objects
    , which provide their results lazily - one can
    construct a whole tree of calculations using active objects,
    which internally enqueue those, run them asynchronously, and
    force a wait on the result only if a non-active object
    actually requests it.

    Concur,
    featuring active objects, messages,
    parallel STL algorithms and futures (with futures being values
    that are possibly unevaluated at the point of instantiation,
    and get their actual result only eventually - forcing futures
    evaluation might block the requester until the value becomes
    available). Probably served as an inspiration for
    Boost.Act.

    TBB, or Intel
    ThreadBuildingBlocks, another (closed-source) C++ template
    framework with similar scope & features as Concur and
    Boost.Act.

    Two proposals for extending the C++ standard: Multithreading
    API for C++0X
    (and related: Transporting
    Values and Exceptions between Threads
    )

    KDE’s
    ThreadWeaver
    , a threaded job queue that
    offers some nice concepts, such as explicit job dependencies.

From a coarse inspection, Boost.Act seems appealing (TBB really
isn’t an option for OOo, no?) - boost is high-quality anyway, and I
think that the Action
concept is a richer one than Futures.

Basically, the advanced approaches listed above employ declarative
semantics, i.e. one no longer needs to create threads by hand, assign
a thread function, and manually serialize access to shared
state. Instead, one simply states that a loop body shall be executed
in parallel:

parallel_transform( src.begin(), src.end(), dest.begin(), bind(_1 * 42) );

This multiplies each element in src with 42, and stores
the result in dest. How many threads are created, and
where they are taken from is completely up to the implementation - a
typical threading runtime on a machine with N cores will
hold a static pool of N threads, each of them receiving a
chunk of the parallel_transform workload from above. In
the old way of doing multi-threading, this behaviour would potentially
be spread over the whole code base (making it a cross-cutting
concern, just like explicit method guarding is today - with thousands
of osl::MutexGuard(m_aMutex) invocations all over the
place).

One of the crucial points of doing stuff like the above in a safe way
is to avoid hidden access to shared state. Access to the container can
be serialized by parallel_transform, but if the called
functor modifies shared objects concurrently, all bets are off. Thus,
the listed frameworks usually have some safe-guards in place, most of
them using a way of explicitely annotating free-standing functions and
classes to be thread-safe (note that this tagging of thread-safeness
can easily be achieved by means of the UNO
threading framework
as well - Kay already has different
uno::Reference types on his list, to distinguish between
thread-safe and thread-unsafe components during compile-time). Another
important concept noteworthy here are pure functions,
i.e. functions that, given the same input parameters, always return
the same result, no matter where or when called (related, but not
exactly the same are referentially
transparent
functions, which could also be exploited by an
optimizing parallel execution runtime). It’s always safe to call pure
functions on disjunct parameters concurrently with no extra
synchronization effort. A safe (because syntactically limited) way of
defining pure functions is via boost’s bind or
lambda library. There, expressions that modify shareable state could
simply be banned from the available syntax (they aren’t - but it’s
possible to do that).

Need an example? Good, take the standard one, that so naturally asks
for parallelization, that it’s usually been sped up via SIMD in the past (you
might have noticed that those parallel_something
algorithms principally apply the same instructions (the functor to
run) in parallel to all container elements - that’s SIMD), namely operating on image
pixel:

// color-convert an image from RGB to CMYK color space
parallel_transform(src_img.begin(),
                   src_img.end(),
                   dst_img.begin(),
                   bind(&rgb2cmyk,_1,cref(conversion_parms));

This is the archetype of inherent parallelism, having a huge amount of data (image pixel), that all need to have the same operation applied to. Disregarding memory bandwidth for the moment, this snippet scales linearly with the number of available cores.

Note the conversion_parms parameter to the bound
function, which can pose problems if rgb2cmyk silently
const_casts it to something mutable. This is the dark
corner of C/C++’s non-enforcable constness, and therefore, a minimal
amount of developer discipline is necessary: if objects or functions
employ magic behind their public interfaces (on-demand creation of
data, buffering, caching, copy-on-write), this still needs to be made
thread-safe explicitely. Fortunately, for each of the aforementioned
items, there’s a thread-safe helper available in OOo (use rtl::Static
for late-initialization of static data, boost::shared_ptr
for lazy initialization of class members, caching and buffering, and
cow_wrapper
for copy-on-write).

To be continued.

After much prep work (kudos to you, Armin!), things start to look like it’s working:



(if you inspect this carefully, you’ll see the difference between the preview image on the left side, which is still rendered the old way, and Draw’s edit pane content to the right, which now gets displayed via cairocanvas).

Gory details, live demo, and plans how to move on at this year’s OOoCon. If you can’t wait: OOoWiki has a brief hacker’s guide.