(Fixed precision) floating point performance: Best datatype?

  • 18 Replies
  • 411 Views

0 Members and 1 Guest are viewing this topic.

Offline marcm200

  • *
  • Fractal Frankfurter
  • *
  • Posts: 564
« on: September 05, 2019, 07:00:10 PM »
I am currently looking for a faster way to achieve higher precision than long double (in C++, 63 bits mantissa). GCC's __float128 is very convenient for that purpose, but it's 20 times slower at my computer (checked with 2 very different level 14 Julia sets).

So I was wondering: Does anyone use a fast higher precision floating point datatype?

I'd be currently happy with 128 or 256 bits, or arbitrary precision if I can specifiy the initial length for compiler generated temporary results to avoid costly memory allocations (I need a quadrillion calculations or so). It doesn't need to provide much functionality, I only need the four basic operations (and even division is not necessary for now) and some rounding/truncation.

Information on the Internet is quite contradictory as to what type/library works fastest/best/with lowest overhead. Any first hand fractal experience?

EDIT: 2019, Dec 17 removed question prefix as I now have several working high precision data types to choose from.


Linkback: https://fractalforums.org/programming/11/fixed-precision-floating-point-performance-best-datatype/3048/
« Last Edit: December 17, 2019, 07:07:51 PM by marcm200, Reason: Prefix changed »

Offline claude

  • *
  • 3f
  • ******
  • Posts: 1344
    • mathr.co.uk
« Reply #1 on: September 05, 2019, 10:16:28 PM »
There is double-double arithmetic, using two doubles as an unevaluated sum.  inline C++ should be lowest overhead when using the 'libqd' library, which also has quad-double (four doubles).  They provide ~106 and ~212 bits of mantissa respectively.  I'd be interested in your timings with these types, as I'm currently trying to figure out how to implement triple-double (which would be ~159 bits) for speedup over quad-double.  The technique applies to single precision float just as well.  Sometimes this use case (for GPU) is called "emulated doubles".  For Haskell, the 'compensated' package has pure-Haskell implementations of the basic circuits, it misses the transcendental functions though.

Also worth trying Boost 'multiprecision' wrapper for MPFR, it has a stack-allocated compile-time-fixed-precision variant when you don't need fully arbitrary (runtime-chosen) precision which should be competitive with low-level MPFR functions called from C (though, I haven't benchmarked to be sure).

Offline hobold

  • *
  • Fractal Feline
  • **
  • Posts: 154
« Reply #2 on: September 05, 2019, 11:25:43 PM »
Going beyond what the hardware natively supports, i.e. double precision, will necessarily sacrifice a lot of performance. A factor of twenty for IEEE 128 bit float is actually not bad at all.

I think the fastest extended precision I heard about was "double double", on machines with hardware support for "fused multiply-add". In that case, you might lose only a factor of ten performance.

(The fused multiply-add uses higher precision internally, and does not round between multiplication and addition. With a bit of care, the internal higher precision can be exposed quite efficiently.)

Offline claude

  • *
  • 3f
  • ******
  • Posts: 1344
    • mathr.co.uk
« Reply #3 on: September 06, 2019, 01:05:48 AM »
here's (untested!) GLSL code showing the huge difference a precise FMA makes to f*f→(f,f)
Code: [Select]
vec2 split(float a)
{
  float t = 4097.0 * a;
  float ahi = t - (t - a);
  float alo = a - ahi;
  return vec2(ahi, alo);
}

ff two_prod(float a, float b)
{
  float p = a * b;
#ifdef HAVE_FMA
  float e = fma(a, b, -p);
#else
  vec2 a_ = split(a);
  vec2 b_ = split(b);
  float e = ((a_.x * b.x - p) + a_.x * b_.y + a_.y * b_.x) + a_.y * b_.y;
#endif
  return ff(p, e);
}

Offline marcm200

  • *
  • Fractal Frankfurter
  • *
  • Posts: 564
« Reply #4 on: September 06, 2019, 10:15:24 AM »
Thanks for the answers!

@claude: I try the ligqd and boost, which sounds very promising (if I can figure out how to install those easily and safely).

@hobold: A factor of 20 not being bad - I wasn't aware that those calculations are so slow when simulated.

Maybe I should try claude's approach and construct my own datatype. After all the "input" numbers are of the form p/2^25 so it might work enlarging everything to a denominator of (2^25)^n for f(z)=z^n+A*z+c and then calculate using the numerators as large integers, so basically storing numbers in a base 2^32. I have to think about that.


Offline ker2x

  • *
  • Fractal Fanatic
  • ***
  • Posts: 21
« Reply #5 on: September 13, 2019, 07:58:15 AM »
I try the ligqd and boost, which sounds very promising (if I can figure out how to install those easily and safely).

bwahahaha good look with that ! boost is a tremendous pain to manage on windows  :))
(but easy with linux)

Offline marcm200

  • *
  • Fractal Frankfurter
  • *
  • Posts: 564
« Reply #6 on: September 13, 2019, 08:44:31 AM »
@ker2x: Agreed. Finding a tutorial was quite a task in itself, so for now, I defer installing boost or libqd (since I then need to look for the dependencies as well - I come to like GCC's float128 more and more due to its convenient usage.

I'll go with claude's suggestion of a tailor-made data type.


Offline marcm200

  • *
  • Fractal Frankfurter
  • *
  • Posts: 564
« Reply #7 on: December 17, 2019, 07:04:21 PM »
I'd be interested in your timings with these types, as I'm currently trying to figure out how to implement triple-double (which would be ~159 bits) for speedup over quad-double.
I just came across this site: https://mrob.com/pub/math/f161.html#source with source code for extended precision (double double, triple double and quad double), 107, 161 and 214 bit mantissa precision.

I just started using the f107 version - first images are computed much faster than GNU's quadmath __float128.

MPFR is about 2 times slower than float128 and my own custom-made fixed precision numerical type (32 bit to the left of the radix point, 96 bit to the right) is about 25% faster (un-optimized, still experimental).


Offline claude

  • *
  • 3f
  • ******
  • Posts: 1344
    • mathr.co.uk
« Reply #8 on: December 17, 2019, 07:13:59 PM »
nice find, thanks for the link.  EDIT: shame there is no license information in the 3x and 4x header files (though the 2x precision version is GPLv2+)
« Last Edit: December 17, 2019, 07:39:17 PM by claude, Reason: license »

Offline marcm200

  • *
  • Fractal Frankfurter
  • *
  • Posts: 564
« Reply #9 on: December 18, 2019, 12:14:59 PM »
The f107 is about 2.5x as fast as float128 for my computations, so that's a win (although I have an issue with the printing functions, they "forget" digits in my compiler settings, but the values are there (2^1 + 2^-60).

For f107 or float128: They have to test whether and in what direction rounding has to occur. Is this test computationally expensive?

Since I am working with dyadic fractions and use a datatype with enough precision so out of range will not occur during calculation: Would it be time-beneficial to delve into the code and theory in general to remove that test? Or is it only a minor gain in speed, if at all measurable?

Offline claude

  • *
  • 3f
  • ******
  • Posts: 1344
    • mathr.co.uk
« Reply #10 on: December 18, 2019, 12:36:35 PM »
I ported Edward Kmett's "compensated" from Haskell to C for f107 and f215, and also Robert Munafo's "f107, f161, f215" from C++ to C (targeting C because of OpenCL later).  Also added libqd ports from C++ to C and I used my earlier fixed-point work too.  Benchmarks on AMD Ryzen 7 2700X Eight-Core Processor (16 threads, CPU time normalized to native double implementation):

Code: [Select]
bits    time  implementation
 53     1.00  native double
 53*    1.56  native double with perturbation (without reference or glitch detection)
 53*    1.70  native double with perturbation with "Pauldelbrot" glitch detection (without reference)
107     7.87  mrob approximate double-double with fma
107     8.06  qd approximate double-double with fma
 32    14.15  fixed 1.1x32 little endian
107    14.71  qd accurate double-double with fma
 32    18.96  fixed 1.1x32 big endian
107    25.50  mrob accurate double-double with fma
161    27.29  mrob approximate triple-double with fma
107    31.03  ekmett accurate double-double
 64    34.59  fixed 1.2x32 big endian
 64    36.43  fixed 1.2x32 little endian
 96    48.55  fixed 1.3x32 big endian
161    54.47  mrob accurate triple-double with fma
215    56.69  mrob approximate quad-double with fma
161    58.16  mrob accurate triple-double without fma
 96    60.06  fixed 1.3x32 little endian
215    64.67  mrob approximate quad-double without fma with accurate lower precisions
128    80.07  fixed 1.4x32 little endian
128    81.47  fixed 1.4x32 big endian
215    81.56  qd approximate quad-double with fma
160    95.18  fixed 1.5x32 big endian
160   108.22  fixed 1.5x32 little endian
192   126.16  fixed 1.6x32 big endian
215   137.22  qd accurate quad-double with fma
192   138.61  fixed 1.6x32 little endian
224   165.93  fixed 1.7x32 big endian
224   166.25  fixed 1.7x32 little endian
215  1443.62  ekmett accurate quad-double

These timings after I "static inline"d all the things for up to 36% speed boost depending on the algorithms.

Test is zoomed out Mandelbrot set.  I've not tested for accuracy yet, just rudimentary correctness ("it looks ok").

The data points marked * are for perturbed iterations, requiring reference orbit data.  I'm using 0+0i as the reference orbit, which means no glitches will occur according to Pauldelbrot's Criterion.
« Last Edit: December 29, 2019, 12:43:41 AM by claude, Reason: added perturbed iterations »

Offline marcm200

  • *
  • Fractal Frankfurter
  • *
  • Posts: 564
« Reply #11 on: December 18, 2019, 07:04:23 PM »
Some timings for my test image (quartic Julia set, slow dynamics (w.r.t. cell-mapping/IA algorithm by Figueiredo)).

Looking at claude's list I think I should give the qd library a longer try since it is even faster for double double than the f107.

I still have to try more images and other iteration formulas, but for now I'm quite happy with Robert Munafo's implementation.

Abbreviations below: fpa = my 32.96 fixed-point implementation; optimized means calculation path for bounding box was optimized (passing of references, temporary variables, operator avoidance and the like.

Code: [Select]
  5 sec   optimized mrob accurate double double without fma
 15 sec   un-optimized mrob accurate double double without fma
 19 sec   optimized fpa
 41 sec   un-optimized quadmath __float128
 73 sec   un-optimized fpa
(I couldn't test the triple or quad double datatype as I need a floor function for the computation)

Offline Adam Majewski

  • *
  • Fractal Furball
  • ***
  • Posts: 224
« Reply #12 on: December 19, 2019, 03:59:05 PM »
I think I should give the qd library a longer try since it is even faster for double double than the f107.

https://pl.wikibooks.org/wiki/Programowanie_w_systemie_UNIX/qd

Offline marcm200

  • *
  • Fractal Frankfurter
  • *
  • Posts: 564
« Reply #13 on: December 19, 2019, 04:22:15 PM »
Thanks, but I'm working on a Windows 10 computer. f107 is "installed" by just including a cpp file into my source code which is very convenient and error-free (and can be un-installed with no trouble). I'll check if qd can be used that way too.

Offline claude

  • *
  • 3f
  • ******
  • Posts: 1344
    • mathr.co.uk
« Reply #14 on: December 28, 2019, 09:57:15 PM »
https://gmplib.org/list-archives/gmp-devel/2019-October/005600.html
Quote
   N. Fabiano and J. Muller and J. Picot
   Algorithms for Triple-Word Arithmetic
   IEEE Transactions on Computers 68(11) 1573--1583 November 2019
   https://doi.org/10.1109/TC.2019.2918451


clip
Effect of floating point precision on Mandelbrot calculations

Started by timhume on Fractal Mathematics And New Theories

5 Replies
335 Views
Last post February 27, 2018, 01:35:08 AM
by claude
xx
bounded floating-point format

Started by quaz0r on Off Topic

0 Replies
190 Views
Last post January 18, 2018, 07:10:30 PM
by quaz0r
xx
The floating Gardens of Borduria

Started by Frakkie on Fractal Image Gallery

0 Replies
94 Views
Last post November 19, 2018, 06:41:01 PM
by Frakkie
clip
KF performance benchmarks

Started by claude on Kalles Fraktaler

15 Replies
702 Views
Last post June 04, 2018, 05:36:42 PM
by claude
clip
Organic pseudoFractals UPDATED: fixed frag w/presetw in FP

Started by M Benesi on Fractal Image Gallery

8 Replies
527 Views
Last post February 01, 2018, 02:52:08 AM
by 3DickUlus