Logo MTL4

Addicted to peak performance

Sooner or later it comes the day when new software is benchmarked against existing one. We believe that we achieved very good performance for C++ standards. But we are still conceivably slower than hand-tuned machine language codes. We addressed this issue with a similar strategy as Python did.

Python solves the problem of lower performance by not solving it. Instead, an interface to C/C++ named SWIG was established. Now people write core components in performance-critical parts with C/C++ and use them in Python. This way they benefit of the expressiveness of Python with run-time behavior comparable to C/C++.

Similarly, we stopped trying to reach peak performance at any rate. Often the medicilously arranged register choreography of some numeric tools implemented in assembly language cannot be generated by most compiler as efficiently.

In numbers: while many tuned BLAS libraries reach over 90 per cent peak performance in dense matrix multiplication, we achieve typically 60 - 70 per cent peak. This said, we terminated pushing C++ programs further into areas that today's compilers are not capable to support.

If tuned BLAS libraries reach such high performance--after a lot of hard work though--why do not use it? Following the antic piece of wisdom "If you can't beat them, join them".

So, we internally (we hesitate to say automagically) use the tuned libraries. That usage remains transparent to the user. This way we can provide BLAS performance with a more elegant programming style. (Disclaimer)

In addition, our library is not limited to certain types nor to operations with arguments of the same type. We are able to handle mixed operations, e.g., multiplying float matrices with double vectors. And of course, we support matrices and vectors of all suitable user and built-in types. In both cases, we provide decent performance.

Resuming, assembly libraries allow for maximal speed on a rather limited number of types. Advanced template programming establishes almost competitive performance on an infinite set of types while enabling the assembly performance where available. So, one can write applications with matrices and vectors of genuine or user-defined types and enjoy maximal available speed. And we dare to bore the reader with the repetition of the fact that applications only contain code like A = B * C and the library chooses the optimal implementation. So, what do you have to loose except that your programs look nicer?

Return to Why Not Using Shallow Copy in Numerical Software                                Table of Content                                Proceed to Performance on an AMD Opteron 2GHz


Addicted to peak performance -- MTL 4 -- Peter Gottschling and Andrew Lumsdaine -- Gen. with rev. 7542 on 7 Apr 2011 by doxygen 1.5.9 -- © 2010 by SimuNova UG.