KF performance benchmarks

  • 15 Replies
  • 380 Views

0 Members and 1 Guest are viewing this topic.

Offline claude

  • *
  • Fractal Freak
  • **
  • Posts: 760
    • mathr.co.uk
« on: March 13, 2018, 06:41:36 PM »
There are some older benchmarks and some analysis here:
https://mathr.co.uk/blog/2017-10-23_deep_zoom_rendering_cost.html

This week I redid some of the benchmarks, results attached.  This time I benchmarked with two settings, here are the differences between them:

"best":
Code: [Select]
GlitchLowTolerance: 1
ApproxLowTolerance: 1
Guessing: 0
IsolatedGlitchNeighbourhood: 0

"fast":
Code: [Select]
GlitchLowTolerance: 0
ApproxLowTolerance: 0
Guessing: 1
IsolatedGlitchNeighbourhood: 4

Both settings had "MaxReferences: 10000" to avoid premature finish.

On the graph you can see that the "fast" settings are consistently improving (downward slope to the end), while the "best" settings have performance issues (upward slope, indicating that tiled rendering may be faster than rendering large in one go).  This seems to be because large images have many single-pixel glitches, which the "fast" settings smudge with a neighbouring pixel, while the "best" settings end up with a reference in each one (expensive!).

The graph doesn't show correctness, though: the "fast" settings give bad output for Olbaid-ST's 1e1086 location.

Offline Dinkydau

  • *
  • Fractal Feline
  • **
  • Posts: 176
    • DeviantART gallery
« Reply #1 on: March 13, 2018, 07:39:14 PM »
This comfirms the factor 100 difference in speed that the program feels like as well. For the Mandelbrot set, using the best settings to render at depths greater than 2^10000,  where the more efficient Mandel machine can't go, is usually not actually an option. It would take too long.

Offline gerrit

  • *
  • 3f
  • ******
  • Posts: 1419
« Reply #2 on: March 13, 2018, 07:46:31 PM »
Interesting. What does "guessing" do again? (Does not appear in "user manual".)

Tiled rendering is easier said than done, I guess you need to write some program (with arbitrary precision support) to split a kfr file into tiles, render, then merge the kfb's, then do the "Examine zoom sequence" trick to get an image.

May be worth it at more realistic sizes like 20000^2 as some of the upward sloping curves seem to be ramping up (>0 2nd derivative).

Offline claude

  • *
  • Fractal Freak
  • **
  • Posts: 760
    • mathr.co.uk
« Reply #3 on: March 13, 2018, 08:32:17 PM »
What does "guessing" do

Say you have 3 pixels in a row (in any direction), and the outer two have been calculated, and found to have the same integer interation count: then "guessing" interpolates the smoothed values linearly to fill the middle pixel without calculating it directly.  Guessing can be a big speedup with large interior regions, but for flat outer regions it can lead to wonky appearance with some colouring schemes.

Offline claude

  • *
  • Fractal Freak
  • **
  • Posts: 760
    • mathr.co.uk
« Reply #4 on: March 13, 2018, 08:35:48 PM »
Tiled rendering is easier said than done

You can use Ctrl and cursor keys in kf to navigate a tile at a time, but doing this manually would be a major hassle for all but trivial tile sizes.

Offline claude

  • *
  • Fractal Freak
  • **
  • Posts: 760
    • mathr.co.uk
« Reply #5 on: May 22, 2018, 01:27:36 AM »
Benchmarked new AMD Ryzen 2700x (8 cores, 16 threads) vs old AMD Athlon II X4 640 (4 cores/threads), running kf-2.12.11 on both.

On the new CPU the 'fast' settings benchmark run completes in
Code: [Select]
real 25m7.410s
user 154m17.011s
sys  0m40.126s

On the old CPU the same run completed in ~2 hours real time, so the new CPU is about 4.8x faster in this test.

Offline claude

  • *
  • Fractal Freak
  • **
  • Posts: 760
    • mathr.co.uk
« Reply #6 on: May 22, 2018, 04:10:22 AM »
kf-2.13.3 uses 2.25x as much CPU time on Ryzen 2700x as kf-2.12.11 but only takes a little longer for the 'fast' benchmark - presumably because the less-well-threaded reference computations take up more of the wall-clock time with many idle cores.
Code: [Select]
real 28m42.393s
user 346m18.383s
sys 0m21.380s

Offline saka

  • *
  • Fractal Freshman
  • *
  • Posts: 9
« Reply #7 on: May 23, 2018, 07:19:34 AM »
Quote
kf-2.13.3 uses 2.25x as much CPU time on Ryzen 2700x as kf-2.12.11 but only takes a little longer for the 'fast' benchmark - presumably because the less-well-threaded reference computations take up more of the wall-clock time with many idle cores.

I'm running a 16-core AMD, and the reference computation has the CPU at less than 10% usually. 

I'm not an expert on the process, and haven't looked at the source code.  But, is there some way that this CPU time could be used? 

Offline claude

  • *
  • Fractal Freak
  • **
  • Posts: 760
    • mathr.co.uk
« Reply #8 on: May 23, 2018, 05:11:03 PM »
Unfortunately the reference computation code is hard to parallelize very much.  For low zoom levels the overhead (synchronization cost) of parallelization outweighs the benefits, even.  For the deepest zooms it uses 3 threads, each doing 1 real squaring and 1 or 2 real additions.  The trouble is that iterating z <- z^2 + c is inherently a serial task, so the threads have to synchronize after each iteration.  (Note that only this quadratic Mandelbrot formula is multi-threaded, the others could be I suppose but that means writing and maintaining more code...)

https://code.mathr.co.uk/kalles-fraktaler-2/blob/HEAD:/fraktal_sft/floatexp_reference.cpp#l73 is the 3-thread code, with 2 barrier waits (synchronization points) in each iteration

Offline hobold

  • *
  • Fractal Phenom
  • ****
  • Posts: 46
« Reply #9 on: May 23, 2018, 09:53:59 PM »
Unfortunately the reference computation code is hard to parallelize very much.  For low zoom levels the overhead (synchronization cost) of parallelization outweighs the benefits, even.  For the deepest zooms it uses 3 threads, each doing 1 real squaring and 1 or 2 real additions.  The trouble is that iterating z <- z^2 + c is inherently a serial task, so the threads have to synchronize after each iteration.  (Note that only this quadratic Mandelbrot formula is multi-threaded, the others could be I suppose but that means writing and maintaining more code...)

https://code.mathr.co.uk/kalles-fraktaler-2/blob/HEAD:/fraktal_sft/floatexp_reference.cpp#l73 is the 3-thread code, with 2 barrier waits (synchronization points) in each iteration
If you expect to do many iterations, and if you can justify quite a bit of coding effort, you could get more parallelism the following way:

1. Determine explicit formulas for two iterations, for four iterations, for eight iterations, and so on.
That means compute explicit polynomials of degree four, degree eight, degree sixteen, and so on. You can precompute higher powers of the constant C as well.

2. Those higher order polynomials effectively compute several fractal iterations in one evaluation. This does not inherently lead to any savings (beyond re-using the precomputed powers of C), but the larger polynomial opens up opportunities for parallelization:
https://en.wikipedia.org/wiki/Estrin%27s_scheme

3. Eventually one of the multi-iteration evaluations will bail out. So you need to keep track of the previous iteration, because you need to restart from there to determine the exact number of iterations until bailout. This can be done one by one, or by recursively using polynomials of half degree. (Or not at all, of you compute all points of the orbit anyway; see below.)


4. The above is a fairly parallel way of generating every fourth (or eight, or 16th, ...) point of the orbit. The intermediate orbit points can be filled in in parallel, too, using the closest lower iteration point to start from (again recursively with half degree polynomials).


Cautionary note: this is a lot of programming work, and it is also somewhat wasteful (you might do a few redundant computations). But it can scale to a pretty high number of threads with good load balancing (at least in phase 4), so it should be a net win for the modern >= 6 core processors.

Offline claude

  • *
  • Fractal Freak
  • **
  • Posts: 760
    • mathr.co.uk
« Reply #10 on: May 23, 2018, 10:57:41 PM »
Interesting idea, but i think it will be too costly, both in terms of total CPU time, and more importantly developer time.

Offline hobold

  • *
  • Fractal Phenom
  • ****
  • Posts: 46
« Reply #11 on: May 24, 2018, 07:08:57 PM »
Interesting idea, but i think it will be too costly, both in terms of total CPU time, and more importantly developer time.
Yeah, it's probably one of these "we got sqrt(N) speedup on a supercomputer with N processors, and it only took us exp(N) time to write and debug" wild ideas. :-)

Offline claude

  • *
  • Fractal Freak
  • **
  • Posts: 760
    • mathr.co.uk
« Reply #12 on: May 24, 2018, 07:37:50 PM »
For 3 iterations I think it works out as:
\[ ((c^4 + 2c^3 + c^2 + c) + (4c^3 + 4c^2) z^2) + ((6 c^2 + 2c) + (4 c) z^2) z^4 + z^8 \]
So 3 serial squarings to get z^2, z^4, z^8, and 3 serial additions to add the terms, plus multiplications and additions that could be parallelized, can't see any improvement over (((z^2 + c)^2 + c)^2 + c) which I think is the most efficient way to do it (N squarings for a degree 2^N result).

Offline hobold

  • *
  • Fractal Phenom
  • ****
  • Posts: 46
« Reply #13 on: May 28, 2018, 11:07:23 AM »
Hmm, I think Estrin's scheme, considering it evaluates a polynomial in the order of a binary tree, cannot reduce the length of the critical path (the chain of dependent operations) below something proportional to log2(degree) (i.e. the height of the binary tree).

But iterating f(x) := x*x + c doubles the polynomial degree with each iteration. Thus Estrin's scheme is of no help here. :-/

Sorry, I have caused a false alarm. For some reason I thought each Mandelbrot iteration would only increase polynomial degree by two.

Offline gerrit

  • *
  • 3f
  • ******
  • Posts: 1419
« Reply #14 on: June 04, 2018, 06:33:20 AM »
Unfortunately the reference computation code is hard to parallelize very much.  For low zoom levels the overhead (synchronization cost) of parallelization outweighs the benefits, even.  For the deepest zooms it uses 3 threads, each doing 1 real squaring and 1 or 2 real additions.  The trouble is that iterating z <- z^2 + c is inherently a serial task, so the threads have to synchronize after each iteration.  (Note that only this quadratic Mandelbrot formula is multi-threaded, the others could be I suppose but that means writing and maintaining more code...)

https://code.mathr.co.uk/kalles-fraktaler-2/blob/HEAD:/fraktal_sft/floatexp_reference.cpp#l73 is the 3-thread code, with 2 barrier waits (synchronization points) in each iteration
Can't you interleave the reference calculation with the rest? Use the idle processors to start computing first the SA and when that's done the image, while the reference orbit gets calculated? This way there is no waste of resources.


xx
benchmarks

Started by claude on Mandelbulber

0 Replies
132 Views
Last post May 22, 2018, 01:37:38 AM
by claude
xx
curious for GPU benchmarks of "Ryzen with Vega"

Started by hobold on Programming

0 Replies
236 Views
Last post February 12, 2018, 09:31:39 PM
by hobold