• April 20, 2021, 09:42:12 AM

Login with username, password and session length

Author Topic: (Question) memory bandwidth trade-offs for perturbation rendering  (Read 368 times)

0 Members and 1 Guest are viewing this topic.

Offline claude

  • 3f
  • ******
  • Posts: 1833
    • mathr.co.uk
(Question) memory bandwidth trade-offs for perturbation rendering
« on: August 14, 2020, 02:42:52 AM »
It seems traditional M-set (etc) perturbation rendering has horrible cache behaviour: each pixel goes through the whole reference orbit, which probably doesn't fit in the cache, so it will need to be reloaded from main memory each time.

Maybe there is a way to improve it by iterating in chunks/tiles?  So have a buffer of M pixels, that are each advanced by a chunk of N reference steps each time.  If the memory needed (O(M + N)) fits in the cache, it should be faster.

Or is the computation time vs reference memory access already such that the computation time masks the memory fetch?

Has anyone investigated this kind of stuff already?


Linkback: https://fractalforums.org/index.php?topic=3717.0

Offline hobold

  • Fractal Frogurt
  • ******
  • Posts: 419
Re: memory bandwidth trade-offs for perturbation rendering
« Reply #1 on: August 14, 2020, 07:20:26 AM »
Hmm. Good catch.

Cache latency should not be a problem. I am assuming the reference orbit is an array of values which are getting accessed from index 0 to N in sequence. Then the automatic prefetchers will catch on to the access pattern and start bringing in successive array elements early enough.

But bandwidth of level 2 cache and outwards could easily be a limiter.

So I guess you'd want to try some kind of "cache blocking" / "strip mining" / "data tiling" based on the typical size of the level 1 data cache. A reasonable assumption on 'x86 processors these days is a L1D size of 32KB. Because of simultaneous multithreading ("hyperthreading" is Intel's marketing name), this capacity is shared between up to two threads (unless you are running on exotic Xeon Phi hardware).

While both simultaneously active threads could be using the same reference orbit, they are unlikely to iterate simultaneously over the same indexes. So effective L1D size per thread is halved to 16KB. And we will have to account for other data also competing for cache space. As a rule of thumb, I'd suggest a block size of 8KB for the reference orbits.

How many complex arbitrary precision values can be fit into 8KB at all? If individual arbitrary precision numbers are kilobytes in size, we better start from the level 2 cache to work out a reasonable block size.

Offline claude

  • 3f
  • ******
  • Posts: 1833
    • mathr.co.uk
Re: memory bandwidth trade-offs for perturbation rendering
« Reply #2 on: August 14, 2020, 07:34:31 AM »
the reference needs to be computed in arbitrary precision, but storing it in low precision for perturbation computations is fine.  KF uses 3x low precision values, caching the constant*|z|^2 for glitch detection as well as the real and imaginary parts.  so 24-48 bytes per iteration (still need to fully investigate whether the extra range of long double or floatexp is needed for low precision reference orbits, or if double precision should always be ok)

Offline pauldelbrot

  • 3f
  • ******
  • Posts: 2711
Re: memory bandwidth trade-offs for perturbation rendering
« Reply #3 on: August 14, 2020, 08:29:45 AM »
the reference needs to be computed in arbitrary precision, but storing it in low precision for perturbation computations is fine.  KF uses 3x low precision values, caching the constant*|z|^2 for glitch detection as well as the real and imaginary parts.  so 24-48 bytes per iteration (still need to fully investigate whether the extra range of long double or floatexp is needed for low precision reference orbits, or if double precision should always be ok)

Extra range is needed on a few specific iterations, if zooming past e300 and doing flybys past minibrots below e300.

Nanoscope stores a similar array to KF, 3x double with re, im, and a mag value used for glitch detection. It also stores a pointer that is either null or points to a wide-exponent copy of those three values. If the double precision values underflow (denorm or zero) these get set for that iteration during reference orbit computation. Otherwise the pointer is null. During iteration, if the pointer is not null the next iteration is done using 52-bit mantissa wide-exponent calculations (significantly slower, but for only one iteration).

The circumstances that trigger this are instances like this. Say there's a period-73174 mini at about e380. If the zoom goes very close to that mini, then for images near and deeper than that mini, every 73174th reference orbit iteration is within 1e-380 of zero or so, so it underflows in a double. So wide exponent calculations must be done on those iterations, and those reference orbit entries must be stored with a wide exponent.

For every other iteration below e300, Nanoscope does the calculations with rescaled values that don't need a wide exponent. The rescaling is changed every 500 or so iterations -- this works because until a point is on the verge of escape, its orbit dynamics are dominated by the effects of repelling points in the Julia set nearby, and that typically means its magnitude doubles each iteration. The exponent width of doubles is 11 bits, so from about -1000 to about 1000, representing powers of 2 (not 10), so easily accommodates 500 doublings with plenty of margin for error. Re-rescaling is also done after every iteration that needed a wide-exponent reference point, because the orbit has jumped much closer to 0 again on such iterations.

On "final approach" an escape-bound critical orbit moves faster, with the magnitude squaring each iteration, but by the time this happens the unscaled magnitude is above the e-300 threshold and Nanoscope has switched to bog-standard perturbation calculations without any rescaling or other sneaky tricks. And escape is usually within the next 500 iterations anyway.
« Last Edit: August 14, 2020, 08:45:22 AM by pauldelbrot »

Offline hobold

  • Fractal Frogurt
  • ******
  • Posts: 419
Re: memory bandwidth trade-offs for perturbation rendering
« Reply #4 on: August 14, 2020, 09:24:25 AM »
Okay, so we do not have to store arbitrary precision numbers as references; at most there is a need for an extra int to store an extended exponent? Do real and imaginary part each have their own extended exponent?

Anyway, at those sizes, 8KB tiles can store a few hundred iterations worth of reference orbit. That should easily be enough to amortize the overhead of tiling in the first place.

The tiling overhead can still be substantial, though, because now you write code that has more than one pixel in flight. After iterating one pixel over one tile's worth of reference, you store the state of that pixel somewhere. Then you take some other pixel and iterate that over the same tile of reference orbit. This is where re-use of cached data is supposed to help. However, if you have too many in-flight partially iterated pixels, then their state information will eventually spill out of the cache.

Sounds like quite a bit of work. Maybe one should do a bit more research beforehand. Say, measuring actual L1D cache hit rates with the hardware performance counters. Note that an ordinary Mandelbrot innermost loop (using hardware FP, no coloring decisions) has approximately zero L1D misses. So even if the measurement says there is a low single digit percentage cache miss rate, the optimization can already be worth doing. The misses in that scenario could be undercounted due to automatic prefetching, but still be rather costly (e.g. because the whole reference orbit spills out of an L2 cache shared with threads on more than one processor core).

(All this ancient & obscure trivia suddenly feels useful for a change :).)


xx
"Time Span"

Started by cricke49 on Fractal Image Gallery

0 Replies
900 Views
Last post August 02, 2018, 07:05:21 AM
by cricke49
clip
A new style of fractal imagery using fractal neural style transfer

Started by iRyanBell on Fractal Image Gallery

3 Replies
843 Views
Last post October 03, 2020, 10:50:39 PM
by Jimw338
cheesy
You Got Something - In Memory Of JJ Cale

Started by Tas_mania on Fractal movie gallery

0 Replies
554 Views
Last post December 09, 2020, 10:26:56 AM
by Tas_mania
sad
Thread Error Count: Not Enough Memory

Started by eat vyvanse on Fractal Programs Discussion, Help, & Support

1 Replies
395 Views
Last post March 01, 2019, 02:54:00 AM
by eat vyvanse
clip
FragM does not use video card memory

Started by SCORPION on Fragmentarium

1 Replies
198 Views
Last post October 27, 2019, 03:01:20 AM
by 3DickUlus