I am trying to implement a GEMM implmentation using AMD-APP-SDK 2.4 on a ATI HD 6990 card (Cayman architecture).
One of the optimizing techniques is the use of blocking/tiling.
In its implementation, is it faster if we store the sub-matrices in the shared local memory or is it faster when we use a texture cache? If possible please give the reason also.
Please also suggest which is easier to implement.
P.S. I want it for single precision only, if it matters!
Note: The size of the sub matrix is not an issue, however I feel that since the larger it is the better it would be. The only factor to be taken in consideration is that if unit of memory is 128 bit (4 single precision) then, block size should be a multiple of 4.