Performance tuning on the Intel graphics driver

The Intel graphics driver has several tools available for it to help OpenGL developers and driver developers track down where performance bottlenecks are in the driver stack.

Step 1: Known issues

The first tool to look at is running your application with the environment variable INTEL_DEBUG=perf set. This will report debug information for many known causes of performance problems on the console. Not all of them will cause visible performance improvements when fixed, but it's a good first step to see what might going wrong.

Step 2: CPU vs GPU

The primary question is figuring out whether the CPU is busy in your application, the CPU is busy in the GL driver, the GPU is waiting for the CPU, or the CPU is waiting for the GPU. Ideally, you get to the point where the CPU is waiting for the GPU infrequently but for a significant amount of time (however long it takes the GPU to draw a frame).

There are two initial tools: top and intel_gpu_top. They have related readouts. Pull up top while your application is running. Is the CPU usage around 90%+? If so, then our performance analysis will be with sysprof.

sysprof

If the CPU is totally busy and the GPU isn't terribly busy, there is an excellent tool for debugging: sysprof. Just install, run as root (so you can get system-wide profiling), hit play and later stop. The top-left area shows the flat profile sorted by total time of that symbol plus its descendents. The top few are generally uninteresting (main() and its descendents consuming a lot), but eventually you can get down to something interesting. Click it, and to the right you get the callchains to descendents -- where all that time actually went. On the other hand, the lower left shows callers -- double-clicking those selects that as the symbol to view, instead.

Note that you need debug symbols for the callgraphs sysprof to work, which is where most of its value is. Most distributions offer debug symbol packages from their builds which can be installed separately, and sysprof will find them.

intel_gpu_top

If the CPU is mostly idle, then run intel_gpu_top from the intel-gpu-tools package. The important number is "ring busy". An app that keeps the GPU busy will see this number fluctuate between 90-100%.

There is a bunch of other data on this screen as well. Not much of it is of a lot of use -- the mysterious acronyms on the left are from the spec for the register we're getting that information from, and we don't even know what some of those acronyms mean (and much less what those units being busy implies for an application or driver developer).

Identifying expensive GPU tasks

If intel_gpu_top says the GPU is busy, there are a few tools available for figuring out where all that GPU time is going.

One is, on gen7 (Ivybridge) and newer, the INTEL_DEBUG=shader_time environment variable. This inserts timestamps at the start and end of every shader that get accumulated through atomic operations. It has a significant runtime performance cost, and there are some inaccuracies (some instructions don't end up counted, and only time in the shaders is counted and not the fixed function hardware calling those shaders), but it can quickly and accurately point to which stage of which shader might be the bottleneck.

If you don't have Ivybridge or if shader_time is failing for some other reason, there is apitrace's -pgpu option. Capture a trace of your application using apitrace trace appname. This will produce appname.trace in the current directory. You can then use the qapitrace GUI's Profile option on it (which produces a timeline of which shaders were used in which draw calls and for how long), or you summarize blame for shader programs at the console with apitrace replay -pgpu appname.trace > appname.pgpu; profileshader.py < appname.pgpu where profileshader.py comes from the apitrace tree's scripts directory. This tool has some inaccuracies as well to shader_time: While it includes fixed function overheads, it doesn't necessarily profile all calls that might result in time being spent on the GPU, and it has no way to pinpoint where the time was spent (vertex shader, fragment shader, or other) in a call.

There is also the GL_ARB_timer_query extension. This is the tool that apitrace's profiling is based off of, and you can instrument your application to track where GPU time is spent across large blocks of your application.

If you have identified a shader that is the bottleneck for your application, INTEL_DEBUG=fs in the environment will dump out the generated fragment shader assembly on the 965, and INTEL_DEBUG=vs in the environment will dump out the vertex shader assembly.

On the other hand, if intel_gpu_top doesn't show the GPU very busy, then you're probably stalling on it.

perf for GPU stalls

Not all stalls on the GPU are recorded in INTEL_DEBUG=perf. To see all cases where this happens, use the perf tool from the Linux kernel (note: unrelated to INTEL_DEBUG=perf).

sudo perf record -f -g -e i915:i915_gem_request_wait_begin -c 1 openarena

If you want to see the whole system's stalls for a period of time (very useful!), use the -a flag instead of a particular command name. Just ^C when you're done capturing data.

At exit, you'll have perf.data in the current directory. You can print out the results with perf report | less:

   100.00%  openarena  [vdso]                          [.] 0x000000ffffe424
                |          
                |--95.96%-- drm_intel_bo_subdata
                |          0xace13d53
                |          _mesa_BufferSubDataARB
                |          _mesa_meta_clear
                |          0xace14f86
                |          _mesa_Clear
                |          0x815d875
                |          
                 --4.04%-- drm_intel_gem_bo_wait_rendering
                           drm_intel_bo_wait_rendering
                           |          
                           |--54.67%-- 0xace167e8
                           |          _mesa_Flush
                           |          glXSwapBuffers
                           |          0xb774e115
                           |          SDL_GL_SwapBuffers
                           |          0x81a4afd
                           |          

This shows that all of the wait events occurred in openarena, and of those, 96% were in drm_intel_bo_subdata triggered by _mesa_Clear (many _mesa_Whatever functions are the implementation of glWhatever -- the actual glWhatever function is a runtime-generated stub so it doesn't get labeled in profiles). This shows a driver bug -- the _mesa_meta_clear() function is calling _mesa_BufferSubDataARB() on a buffer currently in use by the GPU, so the CPU blocks waiting for the GPU to finish.

The stalls under glXSwapBuffers in that trace are desired and expected: you don't want your application to run ahead and request draw 1000 frames of the same time in the game, when it will take a minute to render them all, so the driver stalls you when a whole frame is already queued up for rendering and not started rendering yet.

If you want to see exactly where in the compiled code a wait occurred, you can use perf annotate:

perf annotate _mesa_meta_clear

Some common problems to find

A lot of CPU time spent in mesaReadPixels() or mesaGetTexImage

glReadPixels() and glGetTexImage() are slow. Do almost anything you can to avoid them. Before gen6 (Sandybridge), these reads occur at between 10MB/sec and 40MB/sec. On LLC-coherent hardware like Sandybridge and later, we can get them up to about 1GB/sec, but they still imply synchronization with the GPU.

PBOs, in limited cases, may be a win if you can delay mapping the PBO until later. Note that if you're memcopying out of the PBO, there's an extra copy that's happened compared to a non-PBO readpixels (since we have to blit to the PBO to snapshot the data and to untile it). To get the blit path, the requested format has to match the framebuffer's format -- see _mesa_format_matches_format_and_type() for the actual test, and GL_OES_read_format for a way to request what the right format ought to be.

A lot of CPU time spent in brw_validate_state()

This is the computation of GL state to hardware state on Gen4 hardware. On previous generations, as each OpenGL call is made we would compute the updated hardware state and queue it to be emitted. Now, we queue up all hardware updates and only calculate the updated hardware state when we do drawing. This helps avoid extra computation for common OpenGL idioms like functions popping their state on exit. However, the flagging of what hardware state needs to be recomputed is handled by a limited number of state flags -- the strips and fans unit says that it needs to be recomputed for any change of viewport, scissor, or drawbuffers state, for example. Some stages may indicate that they need recalculation when another stage changes. There are plenty of places here for innocuous GL or driver state changes to cascade recomputation. There's a tool for looking into this: run your program with INTEL_DEBUG=state in the environment. You'll get a dump of how many times each state flag was set when drawing occurred. This can help narrow down why CPU was used in state validation. Note that for every BRW_NEW_CONTEXT we re-flag all state, and BRW_NEW_BATCH triggers a significant subset of those as well. So things triggering pipeline flushes (glFlush, glFinish, glReadPixels, or most anything reporting access of busy BOs under INTEL_DEBUG=perf) will cause a lot of recomputation to occur.

A lot of CPU time is spent in drm_clflush_pages()

This usually means that buffers are getting evicted from the aperture -- it's found as a descendent of evict_something(), or being brought back into the aperture at i915_gem_execbuffer() time. Try tuning the graphics memory usage of the application at this point. Be sure to free unused GL objects -- freed textures and buffer objects get put back in the BO cache so that their memory never has to be evicted before being reused.

A lot of CPU time is spent in drm_intel_bo_exec()

If the caller of drm_intel_bo_exec is require_space, then you've just accumulated enough state updates that it's time to flush the batchbuffer. Generally for busy applications this can take up 10% or so of the time, as the kernel has to do a bunch of validation and preparation to get the batchbuffer ready for submission. If it's taking more than that, a descendent is usually drm_clflush_pages() above.