HLSL 5.0 assembly shows tons of ieq / add / or instructions for array indexing in loop

| | August 5, 2015

First of all, I have to apologize for the long code samples, but their content is not so important. Besides they are really simple and I tried to comment as much as possible.

I’m working on a pixel shader doing deferred lighting of the frame (Direct3D 11.0, HLSL 5.0). I noticed some huge FPS drops when I add light sources to the scene, and the more sources I add, the bigger performance impact is. After many hours of trying to find the problem, I decided to check ASM that is generated out of my HLSL code.

After some testing I found out that FPS drops appear when I have too much instructions in output shader. I won’t put here all the code, of course, but here’s simplified part of the point lighting, so you could imagine the structure of the lighting algorithm:

// Total amount of color that pixel receives from all the point light sources.
float3 totalPointLightColor = { 0.0f, 0.0f, 0.0f };

// Loop through the point light sources.
// SF_MAX_POINT_LIGHTS is defined as = 64.
[loop] for (uint i = 0; i < SF_MAX_POINT_LIGHTS; i++)
{
    // All the remaining lights are inactive - break the loop.
    if (g_pointLights[i].brightness == 0.0f)
        break;

    // Vector from light source to the pixel.
    float3 fromLightToPixel = worldPos - g_pointLights[i].position.xyz;
    const float distance = length(fromLightToPixel);

    // Check max light distance.
    if (distance > g_pointLights[i].farZ)
        continue;

    fromLightToPixel = normalize(fromLightToPixel);

    // Angle between the pixel normal and light direction.
    float lightIntensity = saturate(dot(normal, -fromLightToPixel));
    // Check angle and skip if the light is on the back of the pixel.
    if (lightIntensity <= 0.0f)
        continue;

    lightIntensity *= g_pointLights[i].brightness / (distance * distance);

    // Process shadow map and get amount of light the pixel receives, if the light has shadow map.
    // THIS LINE IS MENTIONED BELOW IN QUESTION.
    // Here was shadow map check, but when I remove it, nothing really changes.
    // if (g_pointLights[i].shadowMapIndex >= 0) { }

    // If the pixel is in shadow, skip the light source.
    if (lightIntensity <= 0.0f)
        continue;

    totalPointLightColor += lightIntensity * g_pointLights[i].color.rgb;
}

I compile shaders with D3DCompileFromFile() method, using the following flags:

shaderFlags = D3DCOMPILE_ENABLE_STRICTNESS | D3DCOMPILE_DEBUG
    | D3DCOMPILE_SKIP_OPTIMIZATION | D3DCOMPILE_PREFER_FLOW_CONTROL;

I tried to compile in release config with the following flags:

shaderFlags = D3DCOMPILE_ENABLE_STRICTNESS | D3DCOMPILE_OPTIMIZATION_LEVEL3;

But nothing seriously changes – only a couple less ASM instructions here and there. I should add that I have zero to none knowledge of ASM.

So, when I compile my whole deferred shader, I get 1757 instructions in ASM. If I uncomment only 1 line with CAPSED comment in the section above (which actually does nothing), I get 2296 instructions. But if I check that output ASM code, I can clearly see, that that line is not the only part that changed in output ASM code. That whole part that I wrote above becomes bigger translated to ASM.

For example, the loop above with commented line looks like the following in ASM:

loop
    uge ...
    breakc_nz ...

    # Here instructions for the first check of the loop go.
    # cb6 - constant buffer that stores point lights.
    imul null, r6.w, r5.w, l(3)
    eq r7.w, l(0.000000), cb6[r6.w + 0].w
    if_nz r7.w
        break 
    endif 

    # Instructions for fromLightToPixel, distance.
    add r10.xyz, r3.xyzx, -cb6[r6.w + 1].xyzx
    dp3 r7.w, r10.xyzx, r10.xyzx
    sqrt r8.w, r7.w

    # 6 instructions for the next check.
    # etc.
endloop

When I uncomment that 1 line, the ASM is growing heavily, and this is approximately how it starts to look like:

loop
    uge ...
    breakc_nz ...

    # Here comes some new code...
    ieq r10.xyzw, r8.wwww, l(0, 1, 2, 3)
    ieq r11.xyzw, r8.wwww, l(4, 5, 6, 7)
    # Another 14 lines of such stuff.
    # Numbers go up to 63 - so this corellates to the total number of loop iterations.
    # Another new piece of code.
    and r9.w, r10.x, cb6[0].w
    and r26.x, r10.y, cb6[3].w
    or r9.w, r9.w, r26.x
    and r26.x, r10.z, cb6[6].w
    or r9.w, r9.w, r26.x
    # 100+ more lines of such add / or pairs.
    # I guess it is related to array indexing?

    # Instructions for the first check of the loop - finally!
    eq r26.x, r9.w, l(0.000000)
    if_nz r26.x
        break 
    endif 

    # Here we go again...
    ieq r26.xyzw, r8.wwww, l(0, 32, 1, 33)
    ieq r27.xyzw, r8.wwww, l(2, 34, 3, 35)
    # Another 14 lines of that.
    and r42.xyz, r26.xxxx, cb6[1].xyzx
    and r43.xyz, r26.zzzz, cb6[4].xyzx
    or r42.xyz, r42.xyzx, r43.xyzx
    and r43.xyz, r27.xxxx, cb6[7].xyzx
    or r42.xyz, r42.xyzx, r43.xyzx
    # 100+ more lines of such add / or pairs.

    # 3 instructions for fromLightToPixel, distance - finally!
    add r42.xyz, r3.xyzx, -r42.xyzx
    dp3 r42.w, r42.xyzx, r42.xyzx
    sqrt r43.x, r42.w

    # And so on for every case where I access array via index, as far as I can tell.
endloop

I tried to save current array element to temp variable at the beginning of the loop, but it makes no difference – only several more instructions are added to copy the data.

I just can’t understand, why in the first case loop is translated into such small and logic code and in the second one it expands to such enormous bunch of instructions? And I change only 1 line of code, nothing more (and that line does nothing useful actially)! But every array access seems to be affected with this change.

Can someone please explain that to me? Is there some sort of tricks with array indexing or loops in HLSL? I tried to find something about it on the Internet to no avail. Seems like I miss something very simple and obvious, but I don’t get it. Big thanks in advance for any tips or explanations!

Leave a Reply