Unfortunately the reference computation code is hard to parallelize very much. For low zoom levels the overhead (synchronization cost) of parallelization outweighs the benefits, even. For the deepest zooms it uses 3 threads, each doing 1 real squaring and 1 or 2 real additions. The trouble is that iterating z <- z^2 + c is inherently a serial task, so the threads have to synchronize after each iteration. (Note that only this quadratic Mandelbrot formula is multi-threaded, the others could be I suppose but that means writing and maintaining more code...)

https://code.mathr.co.uk/kalles-fraktaler-2/blob/HEAD:/fraktal_sft/floatexp_reference.cpp#l73 is the 3-thread code, with 2 barrier waits (synchronization points) in each iteration

If you expect to do many iterations, and if you can justify quite a bit of coding effort, you could get more parallelism the following way:

1. Determine explicit formulas for two iterations, for four iterations, for eight iterations, and so on.

That means compute explicit polynomials of degree four, degree eight, degree sixteen, and so on. You can precompute higher powers of the constant C as well.

2. Those higher order polynomials effectively compute several fractal iterations in one evaluation. This does not inherently lead to any savings (beyond re-using the precomputed powers of C), but the larger polynomial opens up opportunities for parallelization:

https://en.wikipedia.org/wiki/Estrin%27s_scheme3. Eventually one of the multi-iteration evaluations will bail out. So you need to keep track of the previous iteration, because you need to restart from there to determine the exact number of iterations until bailout. This can be done one by one, or by recursively using polynomials of half degree. (Or not at all, of you compute all points of the orbit anyway; see below.)

4. The above is a fairly parallel way of generating every fourth (or eight, or 16th, ...) point of the orbit. The intermediate orbit points can be filled in in parallel, too, using the closest lower iteration point to start from (again recursively with half degree polynomials).

Cautionary note: this is a lot of programming work, and it is also somewhat wasteful (you might do a few redundant computations). But it can scale to a pretty high number of threads with good load balancing (at least in phase 4), so it should be a net win for the modern >= 6 core processors.