Tuesday, April 15, 2014

NumPy on PyPy - Status Update

Work on NumPy on PyPy continued in March, though at a lighter pace than the previous few months. Progress was made on both compatibility and speed fronts. Several behavioral issues reported to the bug tracker were resolved. The most significant of these was probably the correction of casting to built-in Python types. Previously, int/long conversions of numpy scalars such as inf/nan/1e100 would return bogus results. Now, they raise or return values, as appropriate.

On the speed front, enhancements to the PyPy JIT were made to support virtualizing the raw_store/raw_load memory operations used in numpy arrays. Further work remains here in virtualizing the alloc_raw_storage when possible. This will allow scalars to have storages but still be virtualized when possible in loops.

Aside from continued work on compatibility/speed of existing code, we also hope to begin implementing the C-level components of other numpy modules such as mtrand, nditer, linalg, and so on. Several approaches could be taken to get C-level code in these modules working, ranging from reimplementing in RPython to interfacing with existing code with CFFI, if possible. The appropriate approach depends on many factors and will probably vary from module to module.

To try out PyPy + NumPy, grab a nightly PyPy and install our NumPy fork. Feel free to report comments/issues to IRC, our mailing list, or bug tracker. Thanks to the contributors to the NumPy on PyPy proposal for supporting this work.

Wednesday, April 9, 2014

STM results and Second Call for Donations

Hi all,

We now have a preliminary version of PyPy-STM with the JIT, from the new STM documentation page. This PyPy-STM is still not quite useful, failing to top the performance of a regular PyPy by a small margin on most benchmarks, but it's definitely getting there :-) The overheads with the JIT are still a bit too high. (I've been tracking an obscure bug since days. It turned out to be a simple buffer overflow. But if anybody has a clue about why a hardware watchpoint in gdb, set on one of the garbled memory locations, fails to trigger but the memory ends up being modified anyway... and, it turns out, by just a regular pointer write... ideas welcome.)

But I go off-topic :-) The main point of this post is to announce the 2nd Call for Donation about STM. We achieved most of the goals laid out in the first call. We even largely overachieved them in terms of raw performance, even if there are many cases that are unreasonably slow for now. So, after the successful research, we are launching a second proposal about the development part of the project:

  1. Polish PyPy-STM to get a consistently reasonable speed, 25%-40% slower than a regular JITted PyPy when running single-threaded code. Of course it is supposed to scale nicely as long as there are no user-visible conflicts.

  2. Focus on developing the Python-facing interface: both internal things (e.g. do dictionaries need to be more TM-friendly in general?) as well as directly visible things (e.g. some profiler-like interface to explore common conflicts in a program).

  3. Regular multithreaded code should benefit out of the box, but the final goal is to explore and tweak some existing non-multithreaded frameworks and improve their TM-friendliness. So existing programs using Twisted or Stackless, for example, should run on multiple cores without any major change.

See the full call for more details! I'd like to thank Remi Meier for getting involved. And a big thank you to everybody who contributed money on the first call. It took more time than anticipated, but it's there in good but rough shape. Now it needs a lot of polishing :-)

Armin

Wednesday, March 26, 2014

pygame_cffi: pygame on PyPy

The Raspberry Pi aims to be a low-cost educational tool that anyone can use to learn about electronics and programming. Python and pygame are included in the Pi's programming toolkit. And since last year, thanks in part to sponsorship from the Raspberry Pi Foundation, PyPy also works on the Pi (read more here).

With PyPy working on the Pi, game logic written in Python stands to gain an awesome performance boost. However, the original pygame is a Python C extension. This means it performs poorly on PyPy and negates any speedup in the Python parts of the game code.

One solution to making pygame games run faster on PyPy, and eventually on the Raspberry Pi, comes in the form of pygame_cffi. pygame_cffi uses CFFI to wrap the underlying SDL library instead of a C extension. A few months ago, the Raspberry Pi Foundation sponsored a Cape Town Python User Group hackathon to build a proof-of-concept pygame using CFFI. This hackathon was a success and it produced an early working version of pygame_cffi.

So for the last 5 weeks Raspberry Pi has been funding work on pygame_cffi. The goal was a complete implementation of the core modules. We also wanted benchmarks to illuminate performance differences between pygame_cffi on PyPy and pygame on CPython. We are happy to report that those goals were met. So without further ado, here's a rundown of what works.

Current functionality

Invention screenshot:
Mutable mamba screenshot:

With the above-mentioned functionality in place we could get 10+ of the pygame examples to work, and a number of PyWeek games. At the time of writing, if a game doesn't work it is most likely due to an unimplemented transform or draw function. That will be remedied soon.

Performance

In terms of performance, pygame_cffi on PyPy is showing a lot of promise. It beats pygame on CPython by a significant margin in our events processing and collision detection benchmarks, while blit and fill benchmarks perform similarly. The pygame examples we checked also perform better.

However, there is still work to be done to identify and eliminate bottlenecks. On the Raspberry Pi performance is markedly worse compared to pygame (barring collision detection). The PyWeek games we tested also performed slightly worse. Fortunately there is room for improvement in various places.

Invention & Mutable Mamba (x86)
Standard pygame examples (Raspberry Pi)

Here's a summary of some of the benchmarks. Relative speed refers to the frame rate obtained in pygame_cffi on PyPy relative to pygame on CPython.

Benchmark Relative speed (pypy speedup)
Events (x86) 1.41
Events (Pi) 0.58
N2 collision detection on 100 sprites (x86) 4.14
N2 collision detection on 100 sprites (Pi) 1.01
Blit 100 surfaces (x86) 1.06
Blit 100 surfaces (Pi) 0.60
Invention (x86) 0.95
Mutable Mamba (x86) 0.72
stars example (x86) 1.95
stars example (Pi) 0.84

OpenGL

Some not-so-great news is that PyOpenGL performs poorly on PyPy since PyOpenGL uses ctypes. This translates into a nasty reduction in frame rate for games that use OpenGL surfaces. It might be worthwhile creating a CFFI-powered version of PyOpenGL as well.

Where to now?

Work on pygame_cffi is ongoing. Here are some things that are in the pipeline:

  • Get pygame_cffi on PyPy to a place where it is consistently faster than pygame on CPython.
  • Implement the remaining modules and functions, starting with draw and transform.
  • Improve test coverage.
  • Reduce the time it takes for CFFI to parse the cdef. This makes the initial pygame import slow.

If you want to contribute you can find pygame_cffi on Github. Feel free to find us on #pypy on freenode or post issues on github.

Cheers,
Rizmari Versfeld


Saturday, March 15, 2014

STMGC-C7 with PyPy

Hi all,

Here is one of the first full PyPy's (edit: it was r69967+, but the general list of versions is currently here) compiled with the new StmGC-c7 library. It has no JIT so far, but it runs some small single-threaded benchmarks by taking around 40% more time than a corresponding non-STM, no-JIT version of PyPy. It scales --- up to two threads only, which is the hard-coded maximum so far in the c7 code. But the scaling looks perfect in these small benchmarks without conflict: starting two threads each running a copy of the benchmark takes almost exactly the same amount of total time, simply using two cores.

Feel free to try it! It is not actually useful so far, because it is limited to two cores and CPython is something like 2.5x faster. One of the important next steps is to re-enable the JIT. Based on our current understanding of the "40%" figure, we can probably reduce it with enough efforts; but also, the JIT should be able to easily produce machine code that suffers a bit less than the interpreter from these effects. This seems to mean that we're looking at 20%-ish slow-downs for the future PyPy-STM-JIT.

Interesting times :-)

For reference, this is what you get by downloading the PyPy binary linked above: a Linux 64 binary (Ubuntu 12.04) that should behave mostly like a regular PyPy. (One main missing feature is that destructors are never called.) It uses two cores, but obviously only if the Python program you run is multithreaded. The only new built-in feature is with __pypy__.thread.atomic: this gives you a way to enforce that a block of code runs "atomically", which means without any operation from any other thread randomly interleaved.

If you want to translate it yourself, you need a trunk version of clang with three patches applied. That's the number of bugs that we couldn't find workarounds for, not the total number of bugs we found by (ab)using the address_space feature...

Stay tuned for more!

Armin & Remi

Monday, March 10, 2014

PyPy on uWSGI

Hello everyone

There is an interview with Roberto De Ioris (from uWSGI fame) about embedding PyPy in uWSGI that covers recent addition of a PyPy embedding interface using cffi and the experience with using it. Read The full interview

Cheers
fijal

Friday, March 7, 2014

NumPy on PyPy - Progress in February

More progress was made on the NumPy front in the past month. On the compatibility front, we now pass ~130 more tests from NumPy's suite since the end of January. Currently, we pass 2336 tests out of 3265 tests run, with many of the failures representing portions of NumPy that we don't plan to implement in the near future (object dtypes, unicode, etc). There are still some failures that do represent issues, such as special indexing cases and failures to respect subclassed ndarrays in return values, which we do plan to resolve. There are also some unimplemented components and ufuncs remaining which we hope to implement, such as nditer and mtrand. Overall, the most common array functionality should be working.

Additionally, I began to take a look at some of the loops generated by our code. One widely used loop is dot, and we were running about 5x slower than NumPy's C version. I was able to optimize the dot loop and also the general array iterator to get us to ~1.5x NumPy C time on dot operations of various sizes. Further progress in this area could be made by using CFFI to tie into BLAS libraries, when available. Also, work remains in examining traces generated for our other loops and checking for potential optimizations.

To try out PyPy + NumPy, grab a nightly PyPy and install our NumPy fork. Feel free to report comments/issues to IRC, our mailing list, or bug tracker. Thanks to the contributors to the NumPy on PyPy proposal for supporting this work.

Cheers,
Brian

Tuesday, February 18, 2014

Py3k status update #13

This is the 13th status update about our work on the py3k branch, which we
can work on thanks to all of the people who donated to the py3k proposal.

We're just finishing up a cleanup of int/long types. This work helps the py3k
branch unify these types into the Python 3 int and restore JIT compilation of
machine sized integers
.

This cleanup also removes multimethods from these types. PyPy has
historically used a clever implementation of multimethod dispatch for declaring
methods of the __builtin__ types in RPython.

This multimethod scheme provides some convenient features for doing this,
however we've come to the conclusion that it may be more trouble than it's
worth. A major problem of multimethods is that they generate a large amount of
stub methods which burden the already lengthy and memory hungry RPython
translation process. Also, their implementation and behavior can be somewhat
complicated/obscure.

The alternative to multimethods involves doing the work of the type checking
and dispatching rules in a more verbose, manual way. It's a little more work in
the end but less magical.

Recently, Manuel Jacob finished a large cleanup effort of the
unicode/string/bytearray types that also removed their multimethods. This work
also benefits the py3k branch: it'll help with future PEP 393 (or PEP 393
alternative
) work. This effort was partly sponsored by Google's Summer of
Code: thanks Manuel and Google!

Now there's only a couple major pieces left in the multimethod removal (the
float/complex types and special marshaling code) and a few minor pieces that
should be relatively easy.

In conclusion, there's been some good progress made on py3k and multimethod
removal this winter, albeit a bit slower than we would have liked.

cheers,
Phil

Sunday, February 9, 2014

Rewrites of the STM core model -- again

Hi all,

A quick note about the Software Transactional Memory (STM) front.

Since the previous post, we believe we progressed a lot by discovering an alternative core model for software transactions. Why do I say "believe"? It's because it means again that we have to rewrite from scratch the C library handling STM. This is currently work in progress. Once this is done, we should be able to adapt the existing pypy-stm to run on top of it without much rewriting efforts; in fact it should simplify the difficult issues we ran into for the JIT. So while this is basically yet another restart similar to last June's, the difference is that the work that we have already put in the PyPy part (as opposed to the C library) remains.

You can read about the basic ideas of this new C library here. It is still STM-only, not HTM, but because it doesn't constantly move objects around in memory, it would be easier to adapt an HTM version. There are even potential ideas about a hybrid TM, like using HTM but only to speed up the commits. It is based on a Linux-only system call, remap_file_pages() (poll: who heard about it before? :-). As previously, the work is done by Remi Meier and myself.

Currently, the C library is incomplete, but early experiments show good results in running duhton, the interpreter for a minimal language created for the purpose of testing STM. Good results means we brough down the slow-downs from 60-80% (previous version) to around 15% (current version). This number measures the slow-down from the non-STM-enabled to the STM-enabled version, on one CPU core; of course, the idea is that the STM version scales up when using more than one core.

This means that we are looking forward to a result that is much better than originally predicted. The pypy-stm has chances to run at a one-thread speed that is only "n%" slower than the regular pypy-jit, for a value of "n" that is optimistically 15 --- but more likely some number around 25 or 50. This is seriously better than the original estimate, which was "between 2x and 5x". It would mean that using pypy-stm is quite worthwhile even with just two cores.

More updates later...

Armin

Thursday, February 6, 2014

NumPy Status Update - December/January

Work continued on the NumPy + PyPy front steadily in December and more lightly in January. The continued focus was compatibility, targeting incorrect or unimplemented features that appeared in multiple NumPy test suite failures. We now pass ~2/3 of the NumPy test suite. The biggest improvements were made in these areas:

- Bugs in conversions of arrays/scalars to/from native types
- Fix cases where we would choose incorrect dtypes when initializing or computing results
- Improve handling of subclasses of ndarray through computations
- Support some optional arguments for array methods that are used in the pure-python part of NumPy
- Support additional attributes in arrays, array.flags, and dtypes
- Fix some indexing corner cases that arise in NumPy testing
- Implemented part of numpy.fft (cffti and cfftf)

Looking forward, we plan to continue improving the correctness of the existing implemented NumPy functionality, while also beginning to look at performance. The initial focus for performance will be to look at areas where we are significantly worse than CPython+NumPy. Those interested in trying these improvements out will need a PyPy nightly, and an install of the PyPy NumPy fork. Thanks again to the NumPy on PyPy donors for funding this work.

Tuesday, December 10, 2013

NumPy Status Update - November

Since the PyPy 2.2 release last month, more progress has been made on the NumPy compatibility front. Initial work has been directed by running the NumPy test suite and targeting failures that appear most frequently, along with fixing the few bugs reported on the bug tracker.

Improvements were made in these areas:
- Many missing/broken scalar functionalities were added/fixed. The scalar API should match up more closely with arrays now.
- Some missing dtype functionality was added (newbyteorder, hasobject, descr, etc)
- Support for optional arguments (axis, order) was added to some ndarray functions
- Fixed some corner cases for string/record types

Most of these improvements went onto trunk after 2.2 was split, so if you're interested in trying them out or running into problems on 2.2, try the nightly.

Thanks again to the NumPy on PyPy donors who make this continued progress possible.

Cheers,
Brian

Monday, December 9, 2013

PyGame CFFI

One of the RaspberryPi's goals is to be a fun toolkit for school children (and adults!) to learn programming and electronics with. Python and pygame are part of this toolkit. Recently the RaspberryPi Foundation funded parts of the effort of porting of pypy to the Pi -- making Python programs on the Pi faster!

Unfortunately pygame is written as a Python C extension that wraps SDL which means performance of pygame under pypy remains mediocre. To fix this pygame needs to be rewritten using cffi to wrap SDL instead.

RaspberryPi sponsored a CTPUG (Cape Town Python User Group) hackathon to put together a proof-of-concept pygame-cffi. The day was quite successful - we got a basic version of the bub'n'bros client working on pygame-cffi (and on PyPy). The results can be found on github with contributions from the five people present at the sprint.

While far from complete, the proof of concept does show that there are no major obstacles to porting pygame to cffi and that cffi is a great way to bind your Python package to C libraries.

Amazingly, we managed to have machines running all three major platforms (OS X, Linux and Windows) at the hackathon so the code runs on all of them!

We would like to thank the Praekelt foundation for providing the venue and The Raspberry Pi foundation for providing food and drinks!

Cheers,
Simon Cross, Jeremy Thurgood, Neil Muller, David Sharpe and fijal.


Saturday, November 30, 2013

PyPy Leysin Winter Sprint (11-19st January 2014)

The next PyPy sprint will be in Leysin, Switzerland, for the ninth time. This is a fully public sprint: newcomers and topics other than those proposed below are welcome.

Goals and topics of the sprint

  • Py3k: work towards supporting Python 3 in PyPy
  • NumPyPy: work towards supporting the numpy module in PyPy
  • STM: work towards supporting Software Transactional Memory
  • And as usual, the main side goal is to have fun in winter sports :-) We can take a day off for ski.

Exact times

For a change, and as an attempt to simplify things, I specified the dates as 11-19 January 2014, where 11 and 19 are travel days. We will work full days between the 12 and the 18. You are of course allowed to show up for a part of that time only, too.

Location & Accomodation

Leysin, Switzerland, "same place as before". Let me refresh your memory: both the sprint venue and the lodging will be in a very spacious pair of chalets built specifically for bed & breakfast: http://www.ermina.ch/. The place has a good ADSL Internet connexion with wireless installed. You can of course arrange your own lodging anywhere (as long as you are in Leysin, you cannot be more than a 15 minutes walk away from the sprint venue), but I definitely recommend lodging there too -- you won't find a better view anywhere else (though you probably won't get much worse ones easily, either :-)

Please confirm that you are coming so that we can adjust the reservations as appropriate. The rate so far has been around 60 CHF a night all included in 2-person rooms, with breakfast. There are larger rooms too (less expensive per person) and maybe the possibility to get a single room if you really want to.

Please register by Mercurial:

https://bitbucket.org/pypy/extradoc/
https://bitbucket.org/pypy/extradoc/raw/extradoc/sprintinfo/leysin-winter-2014

or on the pypy-dev mailing list if you do not yet have check-in rights:

http://mail.python.org/mailman/listinfo/pypy-dev

You need a Swiss-to-(insert country here) power adapter. There will be some Swiss-to-EU adapters around -- bring a EU-format power strip if you have one.

Wednesday, November 27, 2013

PyPy 2.2.1 - Incrementalism.1

We're pleased to announce PyPy 2.2.1, which targets version 2.7.3 of the Python language. This is a bugfix release over 2.2.

You can download the PyPy 2.2.1 release here:

http://pypy.org/download.html

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 2.2 and cpython 2.7.2 performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows 32, or ARM (ARMv6 or ARMv7, with VFPv3).

Work on the native Windows 64 is still stalling, we would welcome a volunteer to handle that.

Highlights

This is a bugfix release. The most important bugs fixed are:

  • an issue in sockets' reference counting emulation, showing up notably when using the ssl module and calling makefile().
  • Tkinter support on Windows.
  • If sys.maxunicode==65535 (on Windows and maybe OS/X), the json decoder incorrectly decoded surrogate pairs.
  • some FreeBSD fixes.

Note that CFFI 0.8.1 was released. Both versions 0.8 and 0.8.1 are compatible with both PyPy 2.2 and 2.2.1.

Cheers, Armin Rigo & everybody

Sunday, November 17, 2013

CFFI 0.8

Hi all,

CFFI 0.8 for CPython (2.6-3.x) has been released.

Quick download: pip install cffi --upgrade
Documentation: https://cffi.readthedocs.org/en/release-0.8/

What's new: a number of small fixes; ffi.getwinerror(); integrated support for C99 variable-sized structures; multi-thread safety.

--- Armin

Update: CFFI 0.8.1, with fixes on Python 3 on OS/X, and some FreeBSD fixes (thanks Tobias).

Friday, November 15, 2013

NumPy status update

Here is what has been happening with NumPy in PyPy in October thanks to the people who donated to the NumPyPy proposal:

The biggest change is that we shifted to using an external fork of numpy rather than a minimal numpypy module. The idea is that we will be able to reuse most of the upstream pure-python numpy components, replacing the C modules with appropriate RPython micronumpy pieces at the correct places in the module namespace.

The numpy fork should work just as well as the old numpypy for functionality that existed previously, and also include much new functionality from the pure-python numpy pieces that simply hadn't been imported yet in numpypy. However, this new functionality will not have been "hand picked" to only include pieces that work, so you may run into functionality that relies on unimplemented components (which should fail with user-level exceptions).

This setup also allows us to run the entire numpy test suite, which will help in directing future compatibility development. The recent PyPy release includes these changes, so download it and let us know how it works! And if you want to live on the edge, the nightly includes even more numpy progress made in November.

To install the fork, download the latest release, and then install numpy either separately with a virtualenv: pip install git+https://bitbucket.org/pypy/numpy.git; or directly: git clone https://bitbucket.org/pypy/numpy.git; cd numpy; pypy setup.py install.

EDIT: if you install numpy as root, you may need to also import it once as root before it works: sudo pypy -c 'import numpy'

Along with this change, progress was made in fixing internal micronumpy bugs and increasing compatibility:
  • Fixed a bug with strings in record dtypes
  • Fixed a bug where the multiplication of an ndarray with a Python int or float resulted in loss of the array's dtype
  • Fixed several segfaults encountered in the numpy test suite (suite should run now without segfaulting)

We also began working on __array_prepare__ and __array_wrap__, which are necessary pieces for a working matplotlib module.

Cheers,
Romain and Brian

Thursday, November 14, 2013

PyPy 2.2 - Incrementalism

We're pleased to announce PyPy 2.2, which targets version 2.7.3 of the Python language. This release main highlight is the introduction of the incremental garbage collector, sponsored by the Raspberry Pi Foundation.
This release also contains several bugfixes and performance improvements.
You can download the PyPy 2.2 release here:
http://pypy.org/download.html
We would like to thank our donors for the continued support of the PyPy project. We showed quite a bit of progress on all three projects (see below) and we're slowly running out of funds. Please consider donating more so we can finish those projects! The three projects are:
  • Py3k (supporting Python 3.x): the release PyPy3 2.2 is imminent.
  • STM (software transactional memory): a preview will be released very soon, as soon as we fix a few bugs
  • NumPy: the work done is included in the PyPy 2.2 release. More details below.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 2.2 and cpython 2.7.2 performance comparison) due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows 32, or ARM (ARMv6 or ARMv7, with VFPv3).
Work on the native Windows 64 is still stalling, we would welcome a volunteer to handle that.

Highlights

  • Our Garbage Collector is now "incremental". It should avoid almost all pauses due to a major collection taking place. Previously, it would pause the program (rarely) to walk all live objects, which could take arbitrarily long if your process is using a whole lot of RAM. Now the same work is done in steps. This should make PyPy more responsive, e.g. in games. There are still other pauses, from the GC and the JIT, but they should be on the order of 5 milliseconds each.
  • The JIT counters for hot code were never reset, which meant that a process running for long enough would eventually JIT-compile more and more rarely executed code. Not only is it useless to compile such code, but as more compiled code means more memory used, this gives the impression of a memory leak. This has been tentatively fixed by decreasing the counters from time to time.
  • NumPy has been split: now PyPy only contains the core module, called _numpypy. The numpy module itself has been moved to https://bitbucket.org/pypy/numpy and numpypy disappeared. You need to install NumPy separately with a virtualenv: pip install git+https://bitbucket.org/pypy/numpy.git; or directly: git clone https://bitbucket.org/pypy/numpy.git; cd numpy; pypy setup.py install.
  • non-inlined calls have less overhead
  • Things that use sys.set_trace are now JITted (like coverage)
  • JSON decoding is now very fast (JSON encoding was already very fast)
  • various buffer copying methods experience speedups (like list-of-ints to int[] buffer from cffi)
  • We finally wrote (hopefully) all the missing os.xxx() functions, including os.startfile() on Windows and a handful of rare ones on Posix.
  • numpy has a rudimentary C API that cooperates with cpyext
Cheers,
Armin Rigo and Maciej Fijalkowski

Wednesday, November 13, 2013

Py3k status update #12

This is the 12th status update about our work on the py3k branch, which we
can work on thanks to all of the people who donated to the py3k proposal.

Here's an update on the recent progress:

  • Thank you to everyone who has provided initial feedback on the PyPy3 2.1 beta
    1 release. We've gotten a number of bug reports, most of which have been
    fixed.
  • As usual, we're continually keeping up with changes from the default
    branch. Oftentimes these merges come at a cost (conflicts and or
    reintegration of py3k changes) but occasionally we get goodies for free, such
    as the recent JIT optimizations and incremental garbage collection.
  • We've been focusing on re-optimizing Python 2 int sized (machine sized)
    integers:

We have a couple of known, notable speed regressions in the PyPy3 beta release
vs regular PyPy. The major one being with Python 2.x int sized (or machine
sized) integers.

Python 3 drops the distinction between int and long types. CPython 3.x
accomplishes this by removing the old int type entirely and renaming the long
type to int. Initially, we've done the same for PyPy3 for the sake of
simplicity and getting everything working.

However PyPy's JIT is capable of heavily optimizing these machine sized integer
operations, so this came with a regression in performance in this area.

We're now in the process of solving this. Part of this work also involves some
house cleaning on these numeric types which also benefits the default branch.

cheers,
Phil

Saturday, October 26, 2013

Making coverage.py faster under PyPy

If you've ever tried to run your programs with coverage.py under PyPy,
you've probably experienced some incredible slowness. Take this simple
program:

def f():
    return 1


def main():
    i = 10000000
    while i:
        i -= f()

main()

Running time coverage.py run test.py five times, and looking at the best
run, here's how PyPy 2.1 stacks up against CPython 2.7.5:

Python Time Normalized to CPython
CPython 2.7.5 3.879s 1.0x
PyPy 2.1 53.330s 13.7x slower

Totally ridiculous. I got turned onto this problem because on one of my
projects CPython takes about 1.5 minutes to run our test suite on the build
bot, but PyPy takes 8-10 minutes.

So I sat down to address it. And the results:

Python Time Normalized to CPython
CPython 2.7.5 3.879s 1.0x
PyPy 2.1 53.330s 13.7x slower
PyPy head 1.433s 2.7x faster

Not bad.

Technical details

So how'd we do it? Previously, using sys.settrace() (which coverage.py
uses under the hood) disabled the JIT. Except it didn't just disable the JIT,
it did it in a particularly insidious way — the JIT had no idea it was being
disabled!

Instead, every time PyPy discovered that one of your functions was a hotspot,
it would start tracing to observe what the program was doing, and right when it
was about to finish, coverage would run and cause the JIT to abort. Tracing
is a slow process, it makes up for it by generating fast machine code at the
end, but tracing is still incredibly slow. But we never actually got to the
"generate fast machine code" stage. Instead we'd pay all the cost of tracing,
but then we'd abort, and reap none of the benefits.

To fix this, we adjusted some of the heuristics in the JIT, to better show it
how sys.settrace(<tracefunc>) works. Previously the JIT saw it as an opaque
function which gets the frame object, and couldn't tell whether or not it
messed with the frame object. Now we let the JIT look inside the
<tracefunc> function, so it's able to see that coverage.py isn't
messing with the frame in any weird ways, it's just reading the line number and
file path out of it.

I asked several friends in the VM implementation and research field if they
were aware of any other research into making VMs stay fast when debugging tools
like coverage.py are running. No one I spoke to was aware of any (but I
didn't do a particularly exhaustive review of the literature, I just tweeted at
a few people), so I'm pleased to say that PyPy is quite possibly the first VM
to work on optimizing code in debugging mode! This is possible because of our
years spent investing in meta-tracing research.

Happy testing,
Alex

Wednesday, October 16, 2013

Update on STM

Hi all,

The sprint in London was a lot of fun and very fruitful. In the last update on STM, Armin was working on improving and specializing the automatic barrier placement. There is still a lot to do in that area, but that work is merged now. Specializing and improving barrier placement is still to be done for the JIT.

But that is not all. Right after the sprint, we were able to squeeze the last obvious bugs in the STM-JIT combination. However, the performance was nowhere near to what we want. So until now, we fixed some of the most obvious issues. Many come from RPython erring on the side of caution and e.g. making a transaction inevitable even if that is not strictly necessary, thereby limiting parallelism. Another problem came from increasing counters everytime a guard fails, which caused transactions to conflict on these counter updates. Since these counters do not have to be completely accurate, we update them non-transactionally now with a chance of small errors.

There are still many such performance issues of various complexity left to tackle: we are nowhere near done. So stay tuned or contribute :)

Performance

Now, since the JIT is all about performance, we want to at least show you some numbers that are indicative of things to come. Our set of STM benchmarks is very small unfortunately (something you can help us out with), so this is not representative of real-world performance. We tried to minimize the effect of JIT warm-up in the benchmark results.

The machine these benchmarks were executed on has 4 physical cores with Hyper-Threading (8 hardware threads).

Raytracer from stm-benchmarks: Render times in seconds for a 1024x1024 image:

Interpreter Base time: 1 thread 8 threads (speedup)
PyPy-2.1 2.47 2.56 (0.96x)
CPython 81.1 73.4 (1.1x)
PyPy-STM 50.2 10.8 (4.6x)

For comparison, disabling the JIT gives 148s on PyPy-2.1 and 87s on PyPy-STM (with 8 threads).

Richards from PyPy repository on the stmgc-c4 branch: Average time per iteration in milliseconds:

Interpreter Base time: 1 thread 8 threads (speedup)
PyPy-2.1 15.6 15.4 (1.01x)
CPython 239 237 (1.01x)
PyPy-STM 371 116 (3.2x)

For comparison, disabling the JIT gives 492ms on PyPy-2.1 and 538ms on PyPy-STM.

Try it!

All this can be found in the PyPy repository on the stmgc-c4 branch. Try it for yourself, but keep in mind that this is still experimental with a lot of things yet to come. Only Linux x64 is supported right now, but contributions are welcome.

You can download a prebuilt binary from here: https://bitbucket.org/pypy/pypy/downloads/pypy-oct13-stm.tar.bz2 (Linux x64 Ubuntu >= 12.04). This was made at revision bafcb0cdff48.

Summary

What the numbers tell us is that PyPy-STM is, as expected, the only of the three interpreters where multithreading gives a large improvement in speed. What they also tell us is that, obviously, the result is not good enough yet: it still takes longer on a 8-threaded PyPy-STM than on a regular single-threaded PyPy-2.1. However, as you should know by now, we are good at promising speed and delivering it... years later :-)

But it has been two years already since PyPy-STM started, and this is our first preview of the JIT integration. Expect major improvements soon: with STM, the JIT generates code that is completely suboptimal in many cases (barriers, allocation, and more). Once we improve this, the performance of the STM-JITted code should come much closer to PyPy 2.1.

Cheers

Remi & Armin

Tuesday, October 15, 2013

Incremental Garbage Collector in PyPy

Hello everyone.

We're pleased to announce that as of today, the default PyPy comes with a GC that has much smaller pauses than yesterday.

Let's start with explaining roughly what GC pauses are. In CPython each object has a reference count, which is incremented each time we create references and decremented each time we forget them. This means that objects are freed each time they become unreachable. That is only half of the story though. First note that when the last reference to a large tree of objects goes away, you have a pause: all the objects are freed. Your program is not progressing at all during this pause, and this pause's duration can be arbitrarily large. This occurs at deterministic times, though. But consider code like this:

class A(object):
     pass

a = A()
b = A()
a.item = b
b.item = a
del a
del b

This creates a reference cycle. It means that while we deleted references to a and b from the current scope, they still have a reference count of 1, because they point to each other, even though the whole group has no references from the outside. CPython employs a cyclic garbage collector which is used to find such cycles. It walks over all objects in memory, starting from some known roots, such as type objects, variables on the stack, etc. This solves the problem, but can create noticeable, nondeterministic GC pauses as the heap becomes large and convoluted.

PyPy essentially has only the cycle finder - it does not bother with reference counting, instead it walks alive objects every now and then (this is a big simplification, PyPy's GC is much more complex than this). Although this might sound like a missing feature, it is really one of the reasons why PyPy is so fast, because at the end of the day the total time spent in managing the memory is lower in PyPy than CPython. However, as a result, PyPy also has the problem of GC pauses.

To alleviate this problem, which is essential for applications like games, we started to work on incremental GC, which spreads the walking of objects and cleaning them across the execution time in smaller intervals. The work was sponsored by the Raspberry Pi foundation, started by Andrew Chambers and finished by Armin Rigo and Maciej FijaĊ‚kowski.

Benchmarks

Everyone loves benchmarks. We did not measure any significant speed difference on our quite extensive benchmark suite on speed.pypy.org. The main benchmark that we used for other comparisons was translating the topaz ruby interpreter using various versions of PyPy and CPython. The exact command was python <pypy-checkout>/bin/rpython -O2 --rtype targettopaz.py. Versions:

  • topaz - dce3eef7b1910fc5600a4cd0afd6220543104823
  • pypy source - defb5119e3c6
  • pypy compiled with minimark (non-incremental GC) - d1a0c07b6586
  • pypy compiled with incminimark (new, incremental GC) - 417a7117f8d7
  • CPython - 2.7.3

The memory usage of CPython, PyPy with minimark and PyPy with incminimark is shown here. Note that this benchmark is quite bad for PyPy in general, the memory usage is higher and the amount of time taken is longer. This is due to the JIT warmup being both memory hungry and inefficient (see below). But first, the new GC is not worse than the old one.

EDIT:Red line is CPython, blue is incminimark (new), green is minimark (old)

The image was obtained by graphing the output of memusage.py.

However, the GC pauses are significantly smaller. For PyPy the way to get GC pauses is to measure time between start and stop while running stuff with PYPYLOG=gc-collect:log pypy program.py, for CPython, the magic incantation is gc.set_debug(gc.DEBUG_STATS) and parsing the output. For what is worth, the average and total for CPython, as well as the total number of events are not directly comparable since it only shows the cyclic collector, not the reference counts. The only comparable thing is the amount of long pauses and their duration. In the table below, pause duration is sorted into 8 buckets, each meaning "below that or equal to the threshold". The output is generated using the gcanalyze tool.

CPython:

150.1ms 300.2ms 450.3ms 600.5ms 750.6ms 900.7ms 1050.8ms 1200.9ms
5417 5 3 2 1 1 0 1

PyPy minimark (non-incremental GC):

216.4ms 432.8ms 649.2ms 865.6ms 1082.0ms 1298.4ms 1514.8ms 1731.2ms
27 14 6 4 6 5 3 3

PyPy incminimark (new incremental GC):

15.7ms 31.4ms 47.1ms 62.8ms 78.6ms 94.3ms 110.0ms 125.7ms
25512 122 4 1 0 0 0 2

As we can see, while there is still work to be done (the 100ms ones could be split among several steps), we did improve the situation quite drastically without any actual performance difference.

Note about the benchmark - we know it's a pretty extreme case of JIT warmup, we know we suck on it, we're working on it and we're not afraid of showing PyPy is not always the best ;-)

Nitty gritty details

Here are some nitty gritty details for people really interested in Garbage Collection. This was done as a patch to "minimark", our current GC, and called "incminimark" for now. The former is a generational stop-the-world GC. New objects are allocated "young", which means that they initially live in the "nursery", a special zone of a few MB of memory. When the nursery is full, a "minor collection" step moves the surviving objects out of the nursery. This can be done quickly (a few millisecond) because we only need to walk through the young objects that survive --- usually a small fraction of all young objects; and also by far not all objects that are alive at this point, but only the young ones. However, from time to time this minor collection is followed by a "major collection": in that step, we really need to walk all objects to classify which ones are still alive and which ones are now dead ("marking") and free the memory occupied by the dead ones ("sweeping"). You can read more details here.

This "major collection" is what gives the long GC pauses. To fix this problem we made the GC incremental: instead of running one complete major collection, we split its work into a variable number of pieces and run each piece after every minor collection for a while, until there are no more pieces. The pieces are each doing a fraction of marking, or a fraction of sweeping. It adds some few milliseconds after each of these minor collections, rather than requiring hundreds of milliseconds in one go.

The main issue is that splitting the major collections means that the main program is actually running between the pieces, and so it can change the pointers in the objects to point to other objects. This is not a problem for sweeping: dead objects will remain dead whatever the main program does. However, it is a problem for marking. Let us see why.

In terms of the incremental GC literature, objects are either "white", "gray" or "black". This is called tri-color marking. See for example this blog post about Rubinius, or this page about LuaJIT or the wikipedia description. The objects start as "white" at the beginning of marking; become "gray" when they are found to be alive; and become "black" when they have been fully traversed. Marking proceeds by scanning grey objects for pointers to white objects. The white objects found are turned grey, and the grey objects scanned are turned black. When there are no more grey objects, the marking phase is complete: all remaining white objects are truly unreachable and can be freed (by the following sweeping phase).

In this model, the important part is that a black object can never point to a white object: if the latter remains white until the end, it will be freed, which is incorrect because the black object itself can still be reached. How do we ensure that the main program, running in the middle of marking, will not try to write a pointer to white object into a black object? This requires a "write barrier", i.e. a piece of code that runs every time we set a pointer into an object or array. This piece of code checks if some (hopefully rare) condition is met, and calls a function if that is the case.

The trick we used in PyPy is to consider minor collections as part of the whole, rather than focus only on major collections. The existing minimark GC had always used a write barrier of its own to do its job, like any generational GC. This existing write barrier is used to detect when an old object (outside the nursery) is modified to point to a young object (inside the nursery), which is essential information for minor collections. Actually, although this was the goal, the actual write barrier code is simpler: it just records all old objects into which we write any pointer --- to a young or old object. As we found out over time, doing so is not actually slower, and might actually be a performance improvement: for example, if the main program does a lot of writes into the same old object, we don't need to check over and over again if the written pointer points to a young object or not. We just record the old object in some list the first time, and that's it.

The trick is that this unmodified write barrier works for incminimark too. Imagine that we are in the middle of the marking phase, running the main program. The write barrier will record all old objects that are being modified. Then at the next minor collection, all surviving young objects will be moved out of the nursery. At this point, as we're about to continue running the major collection's marking phase, we simply add to the list of pending gray objects all the objects that we just considered --- both the objects listed as "old objects that are being modified", and the objects that we just moved out of the nursery. A fraction from the former list were black object; so this mean that they are turned back from the black to the gray color. This technique implements nicely, if indirectly, what is called a "backward write barrier" in the literature. The backwardness is about the color that needs to be changed in the opposite of the usual direction "white -> gray -> black", thus making more work for the GC. (This is as opposed to "forward write barrier", where we would also detect "black -> white" writes but turn the white object gray.)

In summary, I realize that this description is less about how we turned minimark into incminimark, and more about how we differ from the standard way of making a GC incremental. What we really had to do to make incminimark was to write logic that says "if the major collection is in the middle of the marking phase, then add this object to the list of gray objects", and put it at a few places throughout minor collection. Then we simply split a major collection into increments, doing marking or sweeping of some (relatively arbitrary) number of objects before returning. That's why, after we found that the existing write barrier would do, it was not much actual work, and could be done without major changes. For example, not a single line from the JIT needed adaptation. All in all it was relatively painless work. ;-)

Cheers,
armin and fijal