Fractal Software > Programming

 fast extended range types

(1/2) > >>

fast is easy, correct is easy, fast and correct at the same time is hard

the problem: very deep escape time zooms with perturbation need more dynamic range than hardware (long) double provides

two solutions:

floatexp : use a double normalized to be near 1, together with an int64_t containing the "real" exponent
- sloppy version implemented in KF works almost all of the time, fixing it to be 100% correct gives a huge slowdown
  - align renormalizes the double and propagates changes to the exponent (for correctness it needs to do special things for 0, inf, nan, and on under/overflow of the separate exponent: these checks are slow)
  - setExp used in addition, essentially an implementation of >> for double, used only with a limited set of values so easy to be both fast and correct for those

softfloat : emulate floating point using a pair of int64_t
  - align renormalizes the mantissa int64_t to have 0 leading redundant sign bits; gcc has __builtin_clrsb() which can count them; exponent needs to be modified correspondingly with a saturating subtraction
  - add and subtract need to downshift the smaller value, and maybe both values by one bit to avoid overflow; then align afterwards
  - multiplication needs to calculate only the high 64 bits of the 128bit result of the multiply, exponents are added with saturation, then realigned

question: which is faster? (result may vary depending on hardware)


Are there any better functions or routines to use for floatexp? currently using frexp/ldexp.

I think the main performance issues arising from these types come from addition/multiplication operations; for an addition between two floatexp numbers it requires a branch to determine which has the larger exponent (could or is there a branchless version of this) followed by the relative normalisation of the smaller number


--- Quote ---Are there any better functions or routines to use for floatexp? currently using frexp/ldexp.
--- End quote ---

bit twiddling is faster than those standard library implementations, but is itself undefined by the c++ standard.

--- Quote ---could or is there a branchless version
--- End quote ---

i actually wrote a branchless implementation using binary logic as a ridiculous experiment  :skeptical:  just in case there might be a performance win to be had.  there didnt seem to be a win there when i tested it.  i think claude wrote a gpu implementation of floatexp, i wonder if he used binary logic for that?

i also wrote a simd version, which itself necessitates its own sort of binary logic ish control flow, which is what gave me the idea to try binary logic for the non simd version.  simd isnt as massive of a win here as you hope for.  in synthetic benchmarks i get like maybe 2.5x performance with a 4-wide simd vector for instance.  realworld application tends to be even less, but also tends to still be a win.  it seems to tend toward being a definite win with vector size > 4.

fun story: when looking at the code gen from different compilers, clang actually took my binary logic implementation, saw what i was doing, and wrote its own branching code to replace it.  the result was like 10x slower than other compilers.  that gave me a good laugh

A shortcut is available here for z2 + c. There, the perturbation iteration is new-delta equals 2*delta*refpt + delta2, which has the property that if delta is smaller than 1e-300, which is when you'd start needing floatexp, delta2 is too tiny to matter and you can simply use 2*delta*refpt as the update function. But then you can simply rescale delta and the output of that will be correspondingly rescaled. So there's no need to realign after every iteration. Instead you store an exponent somewhere, rescale delta by 10 to that exponent, calculate with delta for a few hundred iterations, and then check if delta is getting close to 1e+300. If it is, then it's necessary to realign. The order of magnitude of delta tends to double every iteration when delta is very small, as refpt is O(1) for an orbit that either doesn't escape, or doesn't escape soon, which is why this works.

There are two caveats. One, you need to anticipate when the true value of delta will be getting above 1e-300 and switch back to normal double at that time, and start using the delta2 term. Two, when calculating the reference orbit you need to build a list of [iter full-precision-refpt] pairs for each iteration where the refpt itself is very close to zero (below 1e-300 in modulus) -- this happens if there are flybys of deep minis (past 1e300 magnification). When iterating, when the next iter in that list is hit, that one iteration must be done at full precision and delta2 must be used, as 2*delta*refpt is likely to be small enough not to swamp that term this one time. After that "special" iteration, returning to floatexp with a fresh realignment should be done.

The iterations with tiny refpts will be a mere handful, so all told this allows running an inner loop that stops after either several hundred further iterations, the iter that is projected to bring the magnitude of delta above 1e-300, or the next iter in the list of iters with tiny refpts, whichever comes first; and inside that inner loop the exponent shift is fixed, and the calculations can be done with simple rescaled doubles and no delta2 term. An outer loop determines the iter to run the inner loop 'til and then does so, then folds the exponent into the delta value as a high precision value and if it's over 1e-300 turns it to a normal double and runs the rest of the way until escape. Otherwise if the iter is in the list with tiny refpts it runs one iteration at full precision, and, regardless, returns to floatexp with a new exponent chosen to make the float part near 1e-300 in magnitude before looping the outer loop again.

The only other thing in the inner loop should be glitch check. The final, magnitude-over-1e-300 iteration loop, done using normal double, needs also periodicity detection. This can be done by saving delta in a variable on the first, 2nd, 4th, 8th etc. loop and checking for delta to differ from the saved delta by less than a threshold on every other loop. If the actual iterate enters a cycle and the reference doesn't escape, the reference should be in a cycle as well, and true-delta will be in a cycle whose period divides the product of the periods of the reference and the iterate.

The deeper-than-1e-300 iterations don't need this, because attracting cycles in the Mandelbrot set always are scattered about the radius-2 disk and never tightly clustered, aside from fixed points. Fixed points occur only in the cardioid, which is generally not in the field of view of zooms deep enough to use perturbation. Otherwise, once approaching the cycle the orbit has large delta magnitudes if the reference orbit isn't trapped in essentially the same cycle. So the only cycles that might be caught earlier (and might prevent delta even leaving the sub-1e-300 domain) are when the iterated point is in the same minibrot as the reference orbit. This case can be detected by detecting if the reference orbit tends to a cycle, once when the reference orbit is calculated, and noting the iteration where that occurs. The outer loop can halt the inner loop at that iteration, save a copy of delta, and thereafter halt the inner loop every some multiple of the reference orbit cycle's period and check for delta to be the same as the saved value, to within a small margin, on subsequent exits of the inner loop, to detect if the iterated point has entered a cycle of the same period. The saved copy might be updated after 2 halts of the inner loop, then 4, then 8 etc. to help catch cases where the iterated point converges somewhat more slowly than the reference orbit did, and perhaps to a longer cycle (is in a bulb with ref in the cardioid, e.g.).

Finally, the reference orbit can be stored at merely double precision (though it must be calculated at arbitrary precision), save for the short list of [iter refpt] pairs at iterations where the refpt is smaller than 1e-300 in modulus, which must store the affected refpts at full precision. This saves a huge amount of memory.

The applicability of the above to non-quadratic-Mandelbrot iterations is unknown. It would at least need adapting. Dissimilar cycles will generally produce deltas of magnitude of O(1) allowing the periodicity checking logic to work more or less the same way, I expect. Anything where the perturbation term is (linear term in delta and refpt) + (terms with higher powers of delta) should allow the shortcuts below 1e-300 to be taken for all iterations whose refpt is not very tiny, though. The maximum number of iterations that can be done in the inner loop without risking overflowing delta's exponent will differ, depending on both the typical order of magnitude of the refpt for that fractal and the constant coefficient. For regular Mandelbrot the coefficient was 2 and trapped refpts are in a radius 2 disk so the maximum growth of delta in one iteration (except if about to escape, and in particular when still below 1e-300 in magnitude) is a factor of four. A power-of-2 exponent range of 2048 for double allows up to 1024 quadruplings of delta before it goes from about 1e-300 to about 1e+300; in Nanoscope I chose 500 as a safe maximum with a fat safety margin. (Speedup from doing more inner loop iters per outer loop iter shows steeply diminishing returns past that point anyway.) With other fractals these numbers will vary. Cubic Mandelbrot likely has the linear term have coefficient 3, but the refpts are still in the radius 2 disk, so the maximum jump in delta then becomes a factor of 6, and the maximum safe iters drops from 1024 to about 600.

Disclaimer: I have yet to try to implement perturbation for any other fractal than quadratic Mandelbrot, so that last paragraph is educated guesswork without empirical testing.

Have you tried a square-only implementation for z^2+c in place of general multiplication?

You could use the binomial (a+b)^2=a^2+2*a*b+b^2 (a=real part of z, b=imaginary). to get the first half of the new imaginary part 2*a*b. Maybe this gives a speed-up? (No sign considerations necessary, overflow can be predicted beforehand by looking at one exponent only, new exponent is then mainly a bit-shifted old one). However, at the cost of some additions and the problem of vastly differing a,b values, rendering one "not present" in that addition. But the latter ocurrs for adding the seed anyways.
(I recently read an old article about early 80's computers and a special sqr-function for BASIC. But maybe that's no longer an issue nowadays.)


[0] Message Index

[#] Next page

Go to full version
Powered by SMFPacks Rates (Facepunch) Mod
Powered by SMFPacks SEO Pro Mod | Sitemap
Powered by Advanced Topic Prefix Pro
Powered by SMFPacks Advanced Attachments Uploader Mod