This page documents improvements to the i965 driver that we would like to make in the future (time permitting). To see things that we don't intend to fix (e.g. known hardware bugs), see I965Errata.

Gen7 (Ivybridge) and newer

Combine GS_OPCODE_END_THREAD with earlier URB writes (easy)

Currently, we always use a separate URB write message to end geometry shader threads. If there's an immediately preceding URB write message, we can simply set the EOT bit on that, and drop the extra message.

Note that because EmitVertex() may be called from a loop, there might not always be an immediately preceding URB write.

Combine GS URB write messages (moderate)

Every GS EmitVertex() call generates its own URB write messages. When geometry shaders emit multiple vertices in a row, this produces several messages in a row. We could coalesce these into a single, longer message. We would need to ensure the offsets/lengths ensure a contiguous block of URB space, and obey message length limits.

It may be easier to recognize this case if we combine the message header setup, GS_OPCODE_SET_WRITE_OFFSET, and GS_OPCODE_URB_WRITE into a single, logical message.

Optimize CS local ID push constant register usage (moderate)

See comment above gen7_cs_state.c:brw_cs_prog_local_id_payload_dwords.

Gen6 (Sandybridge) and newer

Use SSA form for the scalar backend (hard)

At some point, we want to use SSA form for the scalar backend. Some thoughts for that have beeen collected at I965ScalarSSA.

Improve performance of ARB_shader_atomic_counters

In the fragment shader, if all channels doing the atomic add are to the same address, then an atomic add of the number of channels active and manually producing the per-channel result from that should be more efficient than asking the hardware to do each atomic operation (even though there is only the one SEND instruction for the atomic operation).

See nir_opt_uniform_atomics for code which could help with this.

Improve code generation for if statement conditions (easy)

Something like "if (a == b && c == d)" produces:

cmp.e.f0(8)     g32<1>D         g6<8,8,1>F      g31<8,8,1>F     { align1 WE_normal 1Q };
cmp.e.f0(8)     g34<1>D         g5<8,8,1>F      g33<8,8,1>F     { align1 WE_normal 1Q };
and(8)          g35<1>D         g32<8,8,1>D     g34<8,8,1>D     { align1 WE_normal 1Q };
and.ne.f0(8)    null            g35<8,8,1>D     1D              { >align1 WE_normal 1Q };
(+f0) if(8) 0 0                 null            0x00000000UD    { >align1 WE_normal 1Q switch };

when it would be better to produce something like:

cmp.e.f0(8)           g32<1>D         g6<8,8,1>F      g31<8,8,1>F     { align1 WE_normal 1Q };
(+f0) cmp.e.f0(8)     g34<1>D         g5<8,8,1>F      g33<8,8,1>F     { align1 WE_normal 1Q };
(+f0) if(8) 0 0                       null            0x00000000UD    { align1 WE_normal 1Q switch };

Return-from-main using HALT (easy)

Right now when there's a "return" in the main() function, we lower all later assignments to be conditional moves. But, using the HALT instruction we can tell the hardware to stop execution for some channels until a certain IP is reached. We use this for discards to have subspans stop executing once they're discarded (for efficiency), and we basically that on a channel-wise basis for return-from-main. Take a look at FS_OPCODE_DISCARD_JUMP, FS_OPCODE_PLACEHOLDER_HALT, and patch_discard_jumps_to_fb_writes()

Loop invariant code motion (hard)

When there's a for loop like

for (int i = 0; i < max; i++) {
    result += texture2D(sampler0, offsets[i]) * texture2D(sampler1, vec2(0.0));
}

It would be nice to recognize that texture2D(sampler1, vec2(0.0)) doesn't depend on the loop iteration, and pull it outside of the loop. This is a standard compiler optimization that we lack.

Experiment with VFCOMP_NO_SRC in vertex fetch.

Right now, if a VS takes two vec2 inputs (say, a 2D position and 2D texcoord), it will get put in the VUE as two vec4s, each formatted as (x, y, 0.0, 1.0).

The VUE could be shrunk if we could notice that and pack the two into one VUE slot, using the VFCOMP_NO_SRC in the VERTEX_ELEMENT_STATE to write half of the VUE slot with each vec2 input. This is assuming VFCOMP_NO_SRC works like we might hope it does (the "no holes" comments are concerning).

[Note that on ILK+ the destination offset in the VUE is no longer controllable, so only things which can share a VERTEX_ELEMENT_STATE can be packed. -chrisf]

Use vertex/fragment shaders in meta.c (easy)

This is partially done now, but using fps and vps for metaops lets us push/pop less state and reduces the cost for mesa and 965 to calculate the state updates that result.

Full accelerated glBitmap() support. (moderate)

You'd take the bitmap and upload it to a texture and put it in a spare surface slot in the brw_wm_surface_state.c-related code. Use meta.c to generate a primitive covering the area to be rasterized by glBitmap(). Set a flag in the driver across the meta.c calls that we're doing bitmap, then in brw_fs_visitor.c when the flag is set you'd prepend the shader with a texture sample from the bitmap and a discard.

Full accelerated glDrawPixels() support (moderate).

Like the glBitmap() above, except you're replacing the incoming color instead of doing a discard.

Full acccelerated glAccum() support. (easy)

Using FBOs in meta.c, this should be fairly easy, except that we don't have tests.

Full accelerated glRenderMode(GL_SELECT) support (moderate).

This seems doable using meta.c and FBO rendering to pick the result out.

Full accelerated glRenderMode(GL_FEEDBACK) support. (hard)

This would involve transform feedback in some way.

Trim down memory allocation.

Right now running a minimal shader program takes up 24MB of memory. There's a big 16MB allocation for swrast spans, then some more 1MB or so allocations for TNL, and 1.5MB for the register allocator, then a bunch of noise.

On a core/GLES3 context, we skip the swrast and tnl allocations, but most apps aren't core apps. If we could delay the swrast/tnl allocations until needed, that would save people a ton of memory. The bitmap/drawpixels/rendermode tasks above are motivated by making it possible to not initialize swrast at all.

Gen4-5 (G965, G45, Ironlake)

Better clipper thread code (hard)

On older hardware, clipping is done via programmable shaders rather than fast fixed-function hardware. Looking at the output of INTEL_DEBUG=clip, one can easily see that the assembly we generate is quite inefficient. On Gen4, only two clipper threads can be active at a time, so we believe this could be a real bottleneck.

Writing a more efficient clipper could help improve performance. Making it work in terms of clip distances and following the algorithms employed by the Gen6+ hardware implementation might be a good idea.

Backport replicated-data color clears (easy but seems useless)

Performing clears using the SIMD16 repdata FB write message is significantly faster than normal FB write messages on Gen6+. The shorter message length doesn't adequately explain the speed-ups, so we believe it's because the repdata message bypasses the color calculator/output merger.

We ported this back to Ironlake and saw no performance benefit. It's possible that it doesn't bypass the color calculator on earlier hardware.

The code is available in Ken's ilk-fast-clear branch.

Backport GL_ARB_blend_func_extended (easy).

While we don't use it in our 2D driver or cairo-gl yet (no glamor support), it should be a significant win when we do.

Backport GL_EXT_transform_feedback (hard)

You have to run the geometry shader and have it write the vertices out to a surface, like the gen6 code does. [You also have to do the accounting yourself, as the SVBI hardware support only exists on gen6+ -chrisf]

Backport HiZ support to Ironlake

On Sandybridge, this was worth a 10-20% performance boost on most apps. However, there are rumors that Gen5 hardware limitations may make HiZ not actually beneficial at all. Expect a lot of work in getting piglit fbo-depthstencil tests working.

Gen4 hardware does not support HiZ at all.

Use transposed URB reads on g4x+ (moderate).

This would cut the URB size from sf->wm, allowing more concurrency. g45-transposed-read branch of ~anholt/mesa