Quantcast
Viewing latest article 2
Browse Latest Browse All 3

Answer by chtz for SIMD transpose when row size is greater than vector width

If all your matrix dimensions are a multiple of your packet-size you can do the operation block-wise and swap the blocks as needed. Example for 4x4 double matrix using SSE2:

// transpose vectors i0 and i1 and store the result to addresses r0 and r1
void transpose2x2(double *r0, double* r1, __m128d i0, __m128d i1)
{
    __m128d t0 = _mm_unpacklo_pd(i0,i1);
    __m128d t1 = _mm_unpackhi_pd(i0,i1);
    _mm_storeu_pd(r0, t0);
    _mm_storeu_pd(r1, t1);
}


void transpose(double mat[4][4])
{
    // transpose [00]-block in-place
    transpose2x2(mat[0]+0, mat[1]+0,_mm_loadu_pd(mat[0]+0),_mm_loadu_pd(mat[1]+0));

    // load [20]-block
    __m128d t20 = _mm_loadu_pd(mat[2]+0), t30 = _mm_loadu_pd(mat[3]+0);
    // transpose [02]-block and store it to [20] position
    transpose2x2(mat[2]+0,mat[3]+0, _mm_loadu_pd(mat[0]+2),_mm_loadu_pd(mat[1]+2));
    // transpose temp-block and store it to [02] position
    transpose2x2(mat[0]+2,mat[1]+2, t20, t30);

    // transpose [22]-block in-place
    transpose2x2(mat[2]+2, mat[3]+2,_mm_loadu_pd(mat[2]+2),_mm_loadu_pd(mat[3]+2));
}

This should be relatively easy to extend to other square matrices, other scalar types and other architectures. Matrices which are not a multiple of packet sizes are perhaps more complicated (if they are large enough, it will probably be worth it to do most the work with vectorization and just do the last rows/columns manually).

For some sizes, e.g. 3x4 or 3x8 matrices there are special algorithms [1] -- if you have a 1003x1003 matrix, you could exploit that for the last rows/columns (and there are probably algorithms for other odd sizes as well).

With some effort you could also write this for rectangular matrices (some thoughts have to be made how to avoid having to cache more than one block at a time, but it is possible).

Godbolt demo: https://godbolt.org/z/tVk_Bc

[1] https://software.intel.com/en-us/articles/3d-vector-normalization-using-256-bit-intel-advanced-vector-extensions-intel-avx


Viewing latest article 2
Browse Latest Browse All 3

Trending Articles