Basic Mandelbrot render optimization

  • 17 Replies
  • 403 Views

0 Members and 1 Guest are viewing this topic.

Offline Dinkydau

  • *
  • Fractal Friar
  • *
  • Posts: 149
    • DeviantART gallery
« on: May 25, 2018, 03:46:37 AM »
For a while I've had my own program that can render Mandelbrot-like fractals, so that I could easily experiment with various thing such as inflections and strange formulas. I posted the latest working version and the code in C++ here:
https://fractalforums.org/other/55/explore-fractals-inflection-tool/777
It's slow and I don't understand what needs to be done to make it faster.

In the posted version I used the double _Complex datatype from C++, giving a little less than 64 bits of precision and no way to get more. This was already slow, much much slower than fractal extreme and mandel machine, for example, but acceptable for experiments.

A few days ago I figured out how to replace the doubles by the mpf_class datatype from the GMP library. People on the internet say it's a highly efficient library for arbitrary precision, which I wanted. Some bugs aside, it works... I mean, the results are correct, but an unzoomed regular M-set at 1200◊1200 resolution with a limit of 5000 iterations now takes 18 seconds (yes, 18 seconds) to render. In theory I can now zoom infinitely deep but the render time makes doing anything almost impossible.

I'm already doing Mariani/Silver-subdivision with multiple threads, and using C++ and GMP as recommended on the internet, so other than also writing assembly code I have no idea what I can do to make this faster, but it's so slow I think I'm missing something.

This should be the most time-consuming loop:
while (zrsqr + zisqr <= 4.0 && iterationCount < maxIters) {
         zi = zr*zi;
         zi += zi;
         zi += ci;
         zr = zrsqr - zisqr + cr;
         zrsqr = zr*zr;
         zisqr = zi*zi;
         iterationCount++;
      }
done exactly as how Bruce Dawson posted on his blog: https://randomascii.wordpress.com/2011/08/13/faster-fractals-through-algebra/
« Last Edit: May 25, 2018, 04:01:07 AM by Dinkydau »

Offline claude

  • *
  • Fractal Frankfurter
  • *
  • Posts: 587
    • mathr.co.uk
« Reply #1 on: May 25, 2018, 12:57:28 PM »
Perhaps restructure your code to use SIMD.  https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html is one way, read the docs for your compiler.  I had some test code (inaccesible at the moment, sorry) where I grouped 4x4 pixels into vectors, using -march=native I got a 2x speed boost when recompiling the Athlon II x4 640 (SSE) version on Ryzen 2700x (AVX).  To get the most benefit from the SIMD I used a trick, doing 8 iterations between escape checks >4 of "did any of these 16 pixels escape", if any escaped i rolled back the last 8 iterations and iterated pixels individually.

In my tests I also found out that separate real/imag double were faster than double _Complex, possibly the _Complex handles some edge cases re NaN or Inf specially.

Also, GMP MPF is buggy, recommend use MPFR.  Switching to MPFR fixed some blank screen issues in KF.

EDIT another thing, using C API MPFR may be significantly faster than C++ wrappers, as you have more control over memory allocation (don't repeatedly allocate/free memory for temporaries inside inner loops)
« Last Edit: May 25, 2018, 01:20:52 PM by claude, Reason: wrapper »

Offline Dinkydau

  • *
  • Fractal Friar
  • *
  • Posts: 149
    • DeviantART gallery
« Reply #2 on: May 25, 2018, 07:34:58 PM »
Thanks. I will try to try your ideas. Using the C api (still GMP, no MPFR) reduces the time to 9 seconds so far. For MPFR it depends on if I can get it to work. GMP was already a nightmare.

Offline superheal

  • *
  • Fractal Friend
  • **
  • Posts: 17
« Reply #3 on: May 27, 2018, 12:20:28 PM »
I have tested both Mariani/Silver-subdivision and a boundary tracing algorithm, and get better performance by using boundary tracing. I dont think that this will make or break your code. Just for reference the mariani algorithm has to calculate 10.56% of the image, while boundary tracing 4.5%. Both of the algorithms ran in 4 threads.

Offline Dinkydau

  • *
  • Fractal Friar
  • *
  • Posts: 149
    • DeviantART gallery
« Reply #4 on: June 02, 2018, 12:55:25 AM »
For some reason MPFR increases the render time by about a factor 4. It's the same code, except I replace the "mpf" part of the function and datatype names with "mpfr" and added the argument MPFR_RNDN to the functions as required. Maybe I did something wrong. At first it wouldn't compile, but when I included mpfr.h before gmp.h and gmpxx.h it compiled with this bad result.

I will consider the boundary tracing if I decide to change the Mariani/Silver functions. I haven't tried SIMD yet. I wonder if it has any effect on the usage of GMP functions, however. Those function calls are not regular sequential additions and multiplications that can be combined into a single instruction.

Offline quaz0r

  • *
  • Fractal Phenom
  • ****
  • Posts: 53
« Reply #5 on: June 02, 2018, 02:14:52 AM »
Quote from: Dinkydau
In the posted version I used the double _Complex datatype from C++

_Complex is some sort of weird internal macro or something from C.  as is often the case with new C++ programmers, you are doing C not C++   8)  cppreference.com is your best friend for learning modern and proper C++, for instance  http://en.cppreference.com/w/cpp/numeric/complex

Quote from: Dinkydau
A few days ago I figured out how to replace the doubles by the mpf_class datatype from the GMP library. People on the internet say it's a highly efficient library for arbitrary precision, which I wanted.  In theory I can now zoom infinitely deep but the render time makes doing anything almost impossible.

yeah, theres literally nothing you can do about that.  people on the internet are right, it is a highly efficient library, but arbitrary precision full iterations is always going to be unuseably slow.  thats why series approximation and perturbation are such a breakthrough.

Quote from: Dinkydau
I haven't tried SIMD yet. I wonder if it has any effect on the usage of GMP functions, however. Those function calls are not regular sequential additions and multiplications that can be combined into a single instruction.

just for kicks take a peek at this sometime  http://www.holoborodko.com/pavel/mpfr/  and see how much easier the magic of C++ can make your life  :D

Quote from: claude
Perhaps restructure your code to use SIMD.  https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html is one way, read the docs for your compiler.  I had some test code (inaccesible at the moment, sorry) where I grouped 4x4 pixels into vectors, using -march=native I got a 2x speed boost when recompiling the Athlon II x4 640 (SSE) version on Ryzen 2700x (AVX).

hoping to tease a bit of auto-vectorization out of your compiler is never going to be as rewarding as you want it to be, though.  if you are using C++ you may as well take a look at a simd library like Vc  https://github.com/VcDevel/Vc  which gives you a zero-overhead abstraction for writing explicitly vectorized code.  for instance instead of double you would then use double_v, a vector of doubles, and then it all compiles to whatever architecture you compile it for.  its really quite slick..

Quote from: claude
To get the most benefit from the SIMD I used a trick, doing 8 iterations between escape checks >4 of "did any of these 16 pixels escape", if any escaped i rolled back the last 8 iterations and iterated pixels individually.

i do the same exact thing.. ive never seen any discussion about this anywhere, but the speedup is actually quite remarkable.  one thing i wasnt sure of was exactly how many consecutive iterations you can get away with.  i found that much more than 4 was too much, so i left it at 4 and havent revisited it yet.  i guess you just need to determine how many times can you iterate past the >4 escape check before you overflow whichever type you are currently using..

Quote from: claude
In my tests I also found out that separate real/imag double were faster than double _Complex, possibly the _Complex handles some edge cases re NaN or Inf specially.

the implementation of the complex type in the standard is no good for this sort of thing unless you can successfully tell your compiler to turn off all the edge case checks and misc garbage it does which makes it super slow.  i think ffast-math is the gcc option that turns all the garbage off and makes performance acceptable, and last i looked at this clang refused to turn all the garbage off no matter what compiler flags you use.  this guy at cppcon gave an interesting presentation on the matter a few years ago  https://youtu.be/he-XVt1xIE0  basically either figure out how to sweet-talk your compiler into not tanking your performance, or use your own simple little complex type to avoid the issue altogether.


Offline hobold

  • *
  • Fractal Fanatic
  • ***
  • Posts: 36
« Reply #6 on: June 02, 2018, 08:54:51 PM »
For some reason MPFR increases the render time by about a factor 4. It's the same code, except I replace the "mpf" part of the function and datatype names with "mpfr" and added the argument MPFR_RNDN to the functions as required. Maybe I did something wrong.
You probably did not do anything wrong. MPFR is absolutely perfectionist about rounding that conforms to the IEEE standard for floating point computations. MPF (without the R) is not. MPF isn't generally imprecise or otherwise bad, it merely doesn't put any effort into controlling the inevitable rounding errors. That means, for example, that MPF delivers slightly different actual precision on different platforms, while MPFR is meant to deliver bit-exact identical results everywhere.

There are corner cases where portable exactness is very difficult to reach. The authors of MPFR might have gone the easy route and always used a "widening multiply", followed by an explicit rounding step which basically discards the lower half of all computed bits.

I don't know what actually happens in MPFR code, but losing a factor of four speed still sounds plausible.

Offline claude

  • *
  • Fractal Frankfurter
  • *
  • Posts: 587
    • mathr.co.uk
« Reply #7 on: June 02, 2018, 09:12:13 PM »
In my experience MPFR mpfr_t is usually only a few percent slower than GMP mpf_t for the basic +-* operations: I think you must be doing something wrong to get a 4x slowdown (perhaps MPFR compiled without optimizations, or in debug mode, or something).

Offline claude

  • *
  • Fractal Frankfurter
  • *
  • Posts: 587
    • mathr.co.uk
« Reply #8 on: June 02, 2018, 09:17:38 PM »
for instance instead of double you would then use double_v, a vector of doubles, and then it all compiles to whatever architecture you compile it for.  its really quite slick..

That's similar to the approach the page I linked describes: you define a vector type (with awkward __attribute__ syntax), and can then do +-*/ on it, among other operations.  If your target doesn't support wide enough vectors then it is synthesized from smaller ones.

Offline claude

  • *
  • Fractal Frankfurter
  • *
  • Posts: 587
    • mathr.co.uk
« Reply #9 on: June 06, 2018, 03:52:39 PM »
Here's my current optimized code (SIMD with 16 doubles per vector):
Code: [Select]
// gcc -std=c11 -Wall -Wextra -pedantic -O3 -march=native -fopenmp -o simd16 simd16.c -lm
// time ./simd16 > out.pgm

#include <complex.h>
#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

typedef int32_t v16i32 __attribute__ ((vector_size (64)));
typedef int64_t v16i64 __attribute__ ((vector_size (128)));
typedef double  v16f64 __attribute__ ((vector_size (128)));

static inline int64_t and(v16i64 a)
{
  return
    a[0] & a[1] & a[2] & a[3] &
    a[4] & a[5] & a[6] & a[7] &
    a[8] & a[9] & a[10] & a[11] &
    a[12] & a[13] & a[14] & a[15];
}

static v16f64 m_compute_pixel
  ( const int32_t N
  , const double R
  , const v16f64 cx
  , const v16f64 cy
  )
{
  const double R2 = R * R;
  v16f64 x = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
  v16f64 y = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
  int32_t n = 0;
  v16f64 x_old = x;
  v16f64 y_old = y;
  int n_old = n;
  v16f64 x2 = x * x;
  v16f64 y2 = y * y;
  while (n < N && and(x2 + y2 <= 4.0))
  {
    x_old = x;
    y_old = y;
    n_old = n;
    for (int i = 0; i < 8; ++i)
    {
      y = 2.0 * x * y + cy;
      x = x2 - y2 + cx;
      x2 = x * x;
      y2 = y * y;
    }
    n = n + 8;
  }
  x = x_old;
  y = y_old;
  n = n_old;
  x2 = x * x;
  y2 = y * y;
  v16f64 s;
  for (int k = 0; k < 16; ++k)
  {
    int m = n;
    while (m < N && x2[k] + y2[k] <= R2)
    {
      y[k] = 2.0 * x[k] * y[k] + cy[k];
      x[k] = x2[k] - y2[k] + cx[k];
      x2[k] = x[k] * x[k];
      y2[k] = y[k] * y[k];
      m = m + 1;
    }
    s[k] = m < N ? m + 1 - log2(0.5 * log(x2[k] + y2[k])) : -N;
  }
  return s;
}

int main()
{
  const int S = 1 << 10;
  const int N = 1 << 20;
  const double R = 1 << 18;
  unsigned char *image = malloc(S * S);
  #pragma omp parallel for schedule(static, 1)
  for (int j = 0; j < S >> 2; ++j)
  {
    for (int i = 0; i < S >> 2; ++i)
    {
      v16f64 cx, cy;
      int k = 0;
      for (int dj = 0; dj < 4; ++dj)
        for (int di = 0; di < 4; ++di)
        {
          cx[k] = ((i << 2) + di + 0.5) / S * 4 - 2;
          cy[k] = ((j << 2) + dj + 0.5) / S * 4 - 2;
          k = k + 1;
        }
      const v16f64 mu = m_compute_pixel(N, R, cx, cy);
      k = 0;
      for (int dj = 0; dj < 4; ++dj)
        for (int di = 0; di < 4; ++di)
        {
          image[((j << 2) + dj) * S + (i << 2) + di] = 16 * mu[k];
          k = k + 1;
        }
    }
  }
  printf("P5\n%d %d\n255\n", S, S);
  fwrite(image, S * S, 1, stdout);
  free(image);
  return 0;
}

Offline Dinkydau

  • *
  • Fractal Friar
  • *
  • Posts: 149
    • DeviantART gallery
« Reply #10 on: June 15, 2018, 02:10:18 AM »
Thanks quaz0r, hobold and claude, and for the example code especially. I still haven't gotten to try writing code with vectors. As it requires practice first it seems like such a big hurdle, but I want to do it.

To get the most benefit from the SIMD I used a trick, doing 8 iterations between escape checks >4 of "did any of these 16 pixels escape", if any escaped i rolled back the last 8 iterations and iterated pixels individually.
Why does it help to do this specifically to get benefit from SIMD? It sounds like this only saves checking for escaping which can be done regardless of SIMD.

_Complex is some sort of weird internal macro or something from C.  as is often the case with new C++ programmers, you are doing C not C++   8)  cppreference.com is your best friend for learning modern and proper C++, for instance  http://en.cppreference.com/w/cpp/numeric/complex

[...]

the implementation of the complex type in the standard is no good for this sort of thing unless you can successfully tell your compiler to turn off all the edge case checks and misc garbage it does which makes it super slow.  i think ffast-math is the gcc option that turns all the garbage off and makes performance acceptable, and last i looked at this clang refused to turn all the garbage off no matter what compiler flags you use.  this guy at cppcon gave an interesting presentation on the matter a few years ago  https://youtu.be/he-XVt1xIE0  basically either figure out how to sweet-talk your compiler into not tanking your performance, or use your own simple little complex type to avoid the issue altogether.
As an experiment I changed the program to use the C++ complex type instead. It's equally fast. ffast-math doesn't make it faster but it does introduce problems with high powers. The edges of iteration bands are black which is incorrect. The image shows z -> z^256 + c. It gets worse as the power increases. I imagine this also happens with low powers but the effect is too small to notice.

Offline claude

  • *
  • Fractal Frankfurter
  • *
  • Posts: 587
    • mathr.co.uk
« Reply #11 on: June 15, 2018, 03:29:31 PM »
Thanks quaz0r, hobold and claude, and for the example code especially. I still haven't gotten to try writing code with vectors. As it requires practice first it seems like such a big hurdle, but I want to do it.
Why does it help to do this specifically to get benefit from SIMD? It sounds like this only saves checking for escaping which can be done regardless of SIMD.

Sure, but the benefit with SIMD is much bigger.

Quote
As an experiment I changed the program to use the C++ complex type instead. It's equally fast. ffast-math doesn't make it faster but it does introduce problems with high powers. The edges of iteration bands are black which is incorrect. The image shows z -> z^256 + c. It gets worse as the power increases. I imagine this also happens with low powers but the effect is too small to notice.
I'm guessing  the black bands are from overflow to infinity/nan.  If you increase the power you have to decrease the loop unrolling factor.

Offline Dinkydau

  • *
  • Fractal Friar
  • *
  • Posts: 149
    • DeviantART gallery
« Reply #12 on: Yesterday at 02:13:53 AM »
I have tried using vectors in a different way, using AVX intrinsics. Including the file immintrin.h allows to use many function-looking things that give almost direct access to vector instructions. I found this nice introduction:
https://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX

This was very interesting to try and it improves speed but doesn't help with achieving higher precision. By now I think the whole problem with GMP is indeed the fact that it uses function calls so much and no SIMD tricks can be used to improve that. It's a useful optimization but for a different purpose than making an existing arbitrary precision library faster. I guess the only way to achieve high performance and high precision together is using assembly language because that's the only way to avoid functions. It makes more sense to me now that the most efficient existing programs are based on that. This is one of the situations where it really makes a big difference.

Offline claude

  • *
  • Fractal Frankfurter
  • *
  • Posts: 587
    • mathr.co.uk
« Reply #13 on: Yesterday at 02:28:24 AM »
The implementation of GMP uses many assembly language functions optimized for different processors.  The choice is made at compile time, so if you can compile a GMP optimized for your architecture it may run faster.  MPFR is built on top of GMP, so optimizing GMP helps MPFR run faster.

Though I tried on my AMD Ryzen 7 2700x CPU, and a -march=native build was no faster than a baseline x86_64 build with no -march flag (I only benchmarked at a very high precision, maybe the asm makes more of a difference at lower precisions).  I tried both gcc-7 and gcc-8, no difference.

The function call overhead is tiny compared to the cost of the operations, for all but the smallest precisions.

But: one potential issue is with C++ (or other language) wrappers, if they do memory allocation for temporary high-precision variables inside your inner loops then performance will be miserable.  In my 'et' project I have a mini-compiler that "floats out" temporary variables from inner loops, so they are only allocated/freed once instead of once per iteration.  But 'et' is a bit dormant these days, because I discovered that GHC Haskell garbage collection scales very poorly with large core counts, need to find some GC option tuning guide or so or go back to having the GUI program written in C or C++ and only have the 'et' compiler in Haskell.

Offline Dinkydau

  • *
  • Fractal Friar
  • *
  • Posts: 149
    • DeviantART gallery
« Reply #14 on: Yesterday at 03:15:16 AM »
I believe it's great for very high precisions but...
The function call overhead is tiny compared to the cost of the operations, for all but the smallest precisions.
That's precisely the problem. I started doing this because 64 bits was just not enough. 128 would already help a lot for the purpose of experiments. What I wanted to do was explore extremely high power Mandelbrot sets. Because the rotational symmetry of those sets is high, the bulbs are small and much zooming is already required to reach the first visible details, which would be the starting point of actual exploration.

Maybe there's another library that uses inline functions (as I think they are called) like those AVX intrinsics, to avoid the problem. (A google search doesn't give good results quickly.) I have come across some messages about a 128-bit integer type specific to gcc:
https://stackoverflow.com/questions/16088282/is-there-a-128-bit-integer-in-gcc
I'm not sure if this or something like it can be used. It surprises me how hard it is to get just a little more precision than normal without losing a ton of performance. I would expect this is something quite commonly wished.

But: one potential issue is with C++ (or other language) wrappers, if they do memory allocation for temporary high-precision variables inside your inner loops then performance will be miserable.  In my 'et' project I have a mini-compiler that "floats out" temporary variables from inner loops, so they are only allocated/freed once instead of once per iteration.  But 'et' is a bit dormant these days, because I discovered that GHC Haskell garbage collection scales very poorly with large core counts, need to find some GC option tuning guide or so or go back to having the GUI program written in C or C++ and only have the 'et' compiler in Haskell.
I'm not sure what GMP does internally but I don't create new GMP type numbers inside an inner loop. They're re-used. So I hope that's enough to avoid the problem you're describing.

By the way, I use the -march=native flag as well.


xx
The basic rules of the Mandelbrot-Set

Started by Fraktalist on Fractal Mathematics And New Theories

5 Replies
410 Views
Last post October 12, 2017, 10:04:09 PM
by v
xx
The Candy Vault - Using an Orbit Trap to render the Mandelbrot.

Started by saka on Fractal movie gallery

3 Replies
119 Views
Last post March 22, 2018, 12:02:53 AM
by Spaciane
xx
How to render only z - buffer

Started by Salol on Mandelbulb3d

3 Replies
148 Views
Last post January 30, 2018, 03:09:48 AM
by Salol
clip
Basic bulletin board codes - Online Manual

Started by Anon on Board Rules and Guidelines

0 Replies
171 Views
Last post September 02, 2017, 11:00:15 PM
by Anon
clip
Image render tests

Started by Anon on Testing-Area

4 Replies
399 Views
Last post March 12, 2018, 11:24:14 PM
by 3DickUlus