We study the case of mapping threads efficiently onto triangular domain problems and propose a block-space linear map based on the properties of the lower triangular matrix that reduces the number of unnecessary threads from O(n^2) to O(n). Performance results for global memory accesses show an improvement of up to 18% with respect to the bounding-box approach. For shared memory scenarios the proposed map was the fastest approach achieving 7% of performance improvement while preserving thread locality.