Vector.BlockInit Method ()

Initializes block processing.

Pascal

procedure BlockInit; overload;

Initializes block processing. Because the size of the CPU cache is limited, significant performance gains can be obtained by splitting long vectors in to a series of short ones, which can all fit in the CPU cache entirely. The BlockInit method is to be used together with BlockNext and methods to initialize a block processing while loop. BlockInit will call to obtain subrange of the data in TVec. The Length of the subranged vector is determined by the global Math387.MtxVecBlockSize variable declared in unit. Default value of MtxVecBlockSize is preset to 800 vector elements for double precision and 1600 elements for single precision. BlockInit supports nested calls and from witihin a blocked while loop you can call procedures and functions which are also blocked. If you use block processing, typical performance gains will range from 2 to a maximum of 6. Block processing can not be used, or it is difficult to apply, in cases where vector elements are not independent of each other. The block processing while loop must be written like this:

a.BlockInit; while not A.BlockEnd do begin // .... user defined function a.BlockNext. end;

See Also

BlockNext, BlockEnd

Example

Normal vectorized procedure:

procedure ParetoPDF(const X: Vector; a, b: double;var Res: Vector); overload; begin Res.Size(X); Res.Power(X,-(a+1)); Res.Mul(Power(b,a)*a);; end;

Vectorized and blocked version of the Pareto probability distribution procedure:

procedure ParetoPDF(const X: Vector; a, b: double; var Res: Vector); overload; begin Res.Size(X); Res.BlockInit; X.BlockInit; while not X.BlockEnd do begin Res.Power(X,-(a+1)); Res.Mul(Power(b,a)*a); Res.BlockNext; X.BlockNext; end; end;

The block version of the ParetoPDF will execute faster then the non-blocked version in cases where X contains 5000-10000 elements or more (double precision). Below that value the two versions will perform about the same, except for very short vector sizes (below 50 elements), where the non-blocked version will have a slight advantage, because of the absence of block processing methods overhead. The time is saved between the calls to Res.Power(X,-(a+1)) and Res.Mul(Power(b,a)*a), where the same memory (stored in Res vector) is accesed in two consecutive calls. That memory is loaded in the CPU cache on the first call, if the Length of the Res vector is short enough to fit in. As an excercise you can also try to compare the performance of the vectorized and blocked version of the function with the single value version (ParetoPDF(X: double; a, b: double; Res: double) and measure the execution time of both versions for long vectors (100 000 elements) and short vectors (10 elements).

The differences with block processing will be more noticable on old CPU's without support for SSE2/SSE3.

Vector Record, Vector Members, MtxExpr Namespace, BlockInit Method, Example, See Also

What do you think about this topic? Send feedback!