If you can explain interval arithmetic, I can explain how to implement it on GPU, if it is even parallelizable, that is.

To arrive at a reliable result for interval seeds [a..b]+[c..d]*i whether the origin escapes for all such individual values, I need to account for the rounding error during squaring and other operations. This can be done by choosing the rounding mode so adding two intervals adds the lower border values with downward rounding and the right ends with upper rounding. Similar definitions exist for other basic operations like multiplication etc. For my purpose I only need addition, subtraction and multiplication.

I tried openMP to perform iteration of several different seed intervals in parallel, but that led to a speed-down.

My main loop looks something like this (pseudocode):

`... set seeds into an array of complex intervals with double endpoints`

for(int32_t iteration=0;iteration<maxit;iteration++) {

...

#pragma omp parallel for

for(int32_t p=0;p<MAXPARALLEL;p++) {

iteration(z_current[p],seed[p],z_new[p]);

z_current[p] = z_new[p];

... some tests (inf, nan etc)

}

...

}

The principal function

*iteration* now performs the actual IA operations and - using my own implementation of a simple IA mini library looks like this (I prefer sequential, reference argument taking functions over operator-overloading when looking for speed).

`int32_t iterationCustIA_z2c(CustComplex& zn,CustComplex& C,CustComplex& erg) {`

CustInterval x2,y2,xy;

int32_t error=0;

error += custMul_ZAB(x2,zn.re,zn.re);

error += custMul_ZAB(y2,zn.im,zn.im);

error += custMul_ZAB(xy,zn.re,zn.im);

CustInterval tmp1;

error += custSub_ZAB(tmp1,x2,y2);

error += custAdd_ZAB(erg.re,tmp1,C.re);

CustInterval tmp2;

error += custAdd_ZAB(tmp2,xy,xy);

error += custAdd_ZAB(erg.im,tmp2,C.im);

return error;

}

Currently my algorithm spends about 20% in iteration and 80% elsewhere. So my idea was, if I could optimize the 80% and arrive at a more balanced time-spent like 50/50, pushing the iteration to the GPU would give the algorithm another boost - and the CPU might then do some of the earlier 80% work in parallel as well.

Now I was wondering: Would it be possible to perform the iteration-function on a graphics card? Data transfer ist only a few references and the end-points, but quite a number of temporary objects. Maybe with a similarly simple statement like

*#pragma do-this-on-GPU in 64 threads*.

Are there GPUs that support directly double precision? I'm currently approaching the limit of single )although I'm using double always since float is no faster in my C++ compiler). If not, I might additionally need to implement a double-single datatype like claude mentioned.

Since OpenGl does not specify rounding modes but "The fraction 0.5 will round in a direction chosen by the implementation, presumably the direction that is fastest." it may be fast but would give less accurate bounds.

That was another question I had. If I can't specify the rounding mode, is it possible to retrieve which rounding mode the GPU currently uses? And is this stable in the sense, that the GPU does not change it itself at some point? And just to be safe, GPUs adhere to the IEEE standard - so virtually the infinite precision operation is performed and then the result rounded with the benefits that some operations are exact?

It is however possible to simulate correct rounding with any rounding mode (just adding, subtracting some 2^-52*value - I remove subnormals and inf by testing beforehand), although it makes things a lot more complicated.

My winner is the repetitive negation method like -( (-a) + (-b) ) for downward rounding addition of a and b when rounding mode is upward, that was a very nice idea I read about