Fractalforums
Fractal Software => Programming => Topic started by: claude on August 14, 2020, 02:42:52 AM

It seems traditional Mset (etc) perturbation rendering has horrible cache behaviour: each pixel goes through the whole reference orbit, which probably doesn't fit in the cache, so it will need to be reloaded from main memory each time.
Maybe there is a way to improve it by iterating in chunks/tiles? So have a buffer of M pixels, that are each advanced by a chunk of N reference steps each time. If the memory needed (O(M + N)) fits in the cache, it should be faster.
Or is the computation time vs reference memory access already such that the computation time masks the memory fetch?
Has anyone investigated this kind of stuff already?

Hmm. Good catch.
Cache latency should not be a problem. I am assuming the reference orbit is an array of values which are getting accessed from index 0 to N in sequence. Then the automatic prefetchers will catch on to the access pattern and start bringing in successive array elements early enough.
But bandwidth of level 2 cache and outwards could easily be a limiter.
So I guess you'd want to try some kind of "cache blocking" / "strip mining" / "data tiling" based on the typical size of the level 1 data cache. A reasonable assumption on 'x86 processors these days is a L1D size of 32KB. Because of simultaneous multithreading ("hyperthreading" is Intel's marketing name), this capacity is shared between up to two threads (unless you are running on exotic Xeon Phi hardware).
While both simultaneously active threads could be using the same reference orbit, they are unlikely to iterate simultaneously over the same indexes. So effective L1D size per thread is halved to 16KB. And we will have to account for other data also competing for cache space. As a rule of thumb, I'd suggest a block size of 8KB for the reference orbits.
How many complex arbitrary precision values can be fit into 8KB at all? If individual arbitrary precision numbers are kilobytes in size, we better start from the level 2 cache to work out a reasonable block size.

the reference needs to be computed in arbitrary precision, but storing it in low precision for perturbation computations is fine. KF uses 3x low precision values, caching the constant*z^2 for glitch detection as well as the real and imaginary parts. so 2448 bytes per iteration (still need to fully investigate whether the extra range of long double or floatexp is needed for low precision reference orbits, or if double precision should always be ok)

the reference needs to be computed in arbitrary precision, but storing it in low precision for perturbation computations is fine. KF uses 3x low precision values, caching the constant*z^2 for glitch detection as well as the real and imaginary parts. so 2448 bytes per iteration (still need to fully investigate whether the extra range of long double or floatexp is needed for low precision reference orbits, or if double precision should always be ok)
Extra range is needed on a few specific iterations, if zooming past e300 and doing flybys past minibrots below e300.
Nanoscope stores a similar array to KF, 3x double with re, im, and a mag value used for glitch detection. It also stores a pointer that is either null or points to a wideexponent copy of those three values. If the double precision values underflow (denorm or zero) these get set for that iteration during reference orbit computation. Otherwise the pointer is null. During iteration, if the pointer is not null the next iteration is done using 52bit mantissa wideexponent calculations (significantly slower, but for only one iteration).
The circumstances that trigger this are instances like this. Say there's a period73174 mini at about e380. If the zoom goes very close to that mini, then for images near and deeper than that mini, every 73174th reference orbit iteration is within 1e380 of zero or so, so it underflows in a double. So wide exponent calculations must be done on those iterations, and those reference orbit entries must be stored with a wide exponent.
For every other iteration below e300, Nanoscope does the calculations with rescaled values that don't need a wide exponent. The rescaling is changed every 500 or so iterations  this works because until a point is on the verge of escape, its orbit dynamics are dominated by the effects of repelling points in the Julia set nearby, and that typically means its magnitude doubles each iteration. The exponent width of doubles is 11 bits, so from about 1000 to about 1000, representing powers of 2 (not 10), so easily accommodates 500 doublings with plenty of margin for error. Rerescaling is also done after every iteration that needed a wideexponent reference point, because the orbit has jumped much closer to 0 again on such iterations.
On "final approach" an escapebound critical orbit moves faster, with the magnitude squaring each iteration, but by the time this happens the unscaled magnitude is above the e300 threshold and Nanoscope has switched to bogstandard perturbation calculations without any rescaling or other sneaky tricks. And escape is usually within the next 500 iterations anyway.

Okay, so we do not have to store arbitrary precision numbers as references; at most there is a need for an extra int to store an extended exponent? Do real and imaginary part each have their own extended exponent?
Anyway, at those sizes, 8KB tiles can store a few hundred iterations worth of reference orbit. That should easily be enough to amortize the overhead of tiling in the first place.
The tiling overhead can still be substantial, though, because now you write code that has more than one pixel in flight. After iterating one pixel over one tile's worth of reference, you store the state of that pixel somewhere. Then you take some other pixel and iterate that over the same tile of reference orbit. This is where reuse of cached data is supposed to help. However, if you have too many inflight partially iterated pixels, then their state information will eventually spill out of the cache.
Sounds like quite a bit of work. Maybe one should do a bit more research beforehand. Say, measuring actual L1D cache hit rates with the hardware performance counters. Note that an ordinary Mandelbrot innermost loop (using hardware FP, no coloring decisions) has approximately zero L1D misses. So even if the measurement says there is a low single digit percentage cache miss rate, the optimization can already be worth doing. The misses in that scenario could be undercounted due to automatic prefetching, but still be rather costly (e.g. because the whole reference orbit spills out of an L2 cache shared with threads on more than one processor core).
(All this ancient & obscure trivia suddenly feels useful for a change :).)