Sunday, February 9, 2014

Rewrites of the STM core model -- again

Hi all,

A quick note about the Software Transactional Memory (STM) front.

Since the previous post, we believe we progressed a lot by discovering an alternative core model for software transactions. Why do I say "believe"? It's because it means again that we have to rewrite from scratch the C library handling STM. This is currently work in progress. Once this is done, we should be able to adapt the existing pypy-stm to run on top of it without much rewriting efforts; in fact it should simplify the difficult issues we ran into for the JIT. So while this is basically yet another restart similar to last June's, the difference is that the work that we have already put in the PyPy part (as opposed to the C library) remains.

You can read about the basic ideas of this new C library here. It is still STM-only, not HTM, but because it doesn't constantly move objects around in memory, it would be easier to adapt an HTM version. There are even potential ideas about a hybrid TM, like using HTM but only to speed up the commits. It is based on a Linux-only system call, remap_file_pages() (poll: who heard about it before? :-). As previously, the work is done by Remi Meier and myself.

Currently, the C library is incomplete, but early experiments show good results in running duhton, the interpreter for a minimal language created for the purpose of testing STM. Good results means we brough down the slow-downs from 60-80% (previous version) to around 15% (current version). This number measures the slow-down from the non-STM-enabled to the STM-enabled version, on one CPU core; of course, the idea is that the STM version scales up when using more than one core.

This means that we are looking forward to a result that is much better than originally predicted. The pypy-stm has chances to run at a one-thread speed that is only "n%" slower than the regular pypy-jit, for a value of "n" that is optimistically 15 --- but more likely some number around 25 or 50. This is seriously better than the original estimate, which was "between 2x and 5x". It would mean that using pypy-stm is quite worthwhile even with just two cores.

More updates later...



Anonymous said...

Did you consider existing STM libraries in your implementation? It might be worthwhile to take a look at stasis ( which has a pretty complete set of features.

Armin Rigo said...

Statis is not really applicable here: it's a Transactional Storage system, which despite the attempt of this paper to generalize it, is not going to apply successfully in the context of PyPy.

Armin Rigo said...

More comments on Hacker News.

Dima Tisnek said...

poll response: I've heard of remap_file_pages! :)

I was wondering how to use this call when I learnt of it, but couldn't figure anything out except possibly database applications (similar) and sort algorithms (too limited). I think this call may be used when manipulating framebuffer too, there was something about having multiple mappings [to hardware] some readonly, some not.

I would like to [possibly] disagree with your statement in c7 README "Most probably, this comes with no overhead once the change is done..."

TLB cache is a limited resource and may easily be contended on large systems. Regular mmap could [in theory] use huge TLB pages, remapped individual pages cannot.

In addition there is a small penalty during first access to the remapped page, though you may consider it amortized depending on remap/reuse ratio.

Granted it's still small stuff.

Reserving one register is is a cool trick, and I find quite acceptable. It too has a small penalty, but the benefits surely outweigh those!

Armin Rigo said...

@Dina: Thanks for the feedback! Note that "%gs" is a special register that is usually not used: there is no direct way to read/write its actual value. It needs to be done with a syscall, at least before very recent CPUs. It can only be used in addressing instructions as an additional offset.

Arne Babenhauserheide said...

just 15% slower sounds wonderful!