Performance tuning on the Intel graphics driver
The Intel graphics driver has several tools available for it now to help OpenGL developers and driver developers track down where performance bottlenecks are in the driver stack.
The primary question is figuring out whether the CPU is busy in your application, the CPU is busy in the GL driver, the GPU is waiting for the CPU, or the CPU is waiting for the GPU. Ideally, you get to the point where the CPU is waiting for the GPU infrequently but for a significant amount of time (however long it takes the GPU to draw a frame).
There are two initial tools: top and intel_gpu_top. They have related readouts. Pull up top while your application is running. Is the CPU usage around 90%+? If so, then our performance measurement will be with sysprof.
If the CPU is mostly idle, then run intel_gpu_top from the intel-gpu-tools package. The important number is "ring idle". An app that keeps the GPU busy will see this number fluctuate between 0-10%. If so, then the place to start tuning will be in your shaders or vertex data load. INTEL_DEBUG=wm in the environment will dump out the fragment shader compilation on the 965, and INTEL_DEBUG=vs in the environment will dump out the vertex shader compilation. This will show what the OpenGL shaders end up mapping to in terms of GPU assembly.
If the CPU is idle and intel_gpu_top showed that the GPU was also significantly idle, the next step is to use perf to find where in the driver we ended up blocking on the GPU. This requires kernel 2.6.32 (or an RC of it) with the following options enabled:
CONFIG_EVENT_TRACING=y CONFIG_PERF_EVENTS=y CONFIG_TRACING=y CONFIG_TRACING_SUPPORT=y
Then, run your application under perf, which is found in linux-2.6/tools/perf:
perf record -f -g -e i915:i915_gem_request_wait_begin -c 1 openarena
At exit, you'll have perf.data in the current directory. You can print out the results with perf report:
100.00% openarena [vdso] [.] 0x000000ffffe424
|
|--95.96%-- drm_intel_bo_subdata
| 0xace13d53
| _mesa_BufferSubDataARB
| _mesa_meta_clear
| 0xace14f86
| _mesa_Clear
| 0x815d875
|
--4.04%-- drm_intel_gem_bo_wait_rendering
drm_intel_bo_wait_rendering
|
|--54.67%-- 0xace167e8
| _mesa_Flush
| glXSwapBuffers
| 0xb774e115
| SDL_GL_SwapBuffers
| 0x81a4afd
| So, all of the wait events occurred in openarena, and of those, 96% were in drm_intel_bo_subdata triggered by _mesa_Clear (many _mesa_Whatever functions are the implementation of glWhatever -- the actual glWhatever function is a runtime-generated stub so it doesn't get labeled in profiles). This shows a driver bug -- the _mesa_meta_clear() function is calling _mesa_BufferSubDataARB() on a buffer currently in use by the GPU, so the CPU blocks waiting for the GPU to finish.
If you want to see exactly where in the compiled code the wait occurred, you can use perf annotate (though I've had some issues with getting results from this tool)
perf annotate _mesa_meta_clear
Now, if instead of CPU idle and GPU idle, you have CPU busy and GPU idle, you can use the same tool you use for other CPU consumption issues: sysprof. Sysprof as of kernel 2.6.32 is built on top of the same perf framework as used above. Just install, run as root (so you can get system-wide profiling), hit play and later stop. The top-left area shows the flat profile sorted by total of that symbol plus its descendents. The top few are generally uninteresting (main() and its descendents consuming a lot), but eventually you can get down to something interesting. Click it, and to the right you get the callchains to descendents -- where all that time actually went. On the other hand, the lower left shows callers -- double-clicking those selects that as the symbol to view, instead.
Some common CPU usage problems to find:
A lot of time spent in _mesa_ReadPixels() -> span access.
glReadPixels is slow. Do almost anything you can to avoid it, short of glGetTexImage (which is just as bad, and probably costs you more to do). PBOs are not currently a win for glReadPixels.
A lot of time spent in brw_validate_state()
This is the computation of GL state to hardware state on Gen4 hardware. On previous generations, as each OpenGL call is made we would compute the updated hardware state and queue it to be emitted. Now, we queue up all hardware updates and only calculate the updated hardware state when we do drawing. This helps avoid extra computation for common OpenGL idioms like functions popping their state on exit. However, the flagging of what hardware state needs to be recomputed is handled by a limited number of state flags -- the strips and fans unit says that it needs to be recomputed for any change of viewport, scissor, or drawbuffers state, for example. Some stages may indicate that they need recalculation when another stage changes. There are plenty of places here for innocuous GL or driver state changes to cascade recomputation. There's a tool for looking into this: run your program with INTEL_DEBUG=state in the environment. You'll get a dump of how many times each state flag was set when drawing occurred. This can help narrow down why CPU was used in state validation. Note that for every BRW_NEW_CONTEXT we re-flag all state. So things triggering pipeline flushes (glFlush, glReadPixels, glTexSubImage, glGetBufferData, etc) will cause a lot of recomputation to occur.
A lot of time is spent in drm_clfush_pages()
This usually means that buffers are getting evicted from the aperture -- it's found as a descendent of evict_something(), or being brought back into the aperture at i915_gem_execbuffer() time. Try tuning the graphics memory usage of the application at this point. Be sure to free unused GL objects -- freed textures and buffer objects get put back in the BO cache so that their memory never has to be evicted before being reused.
A lot of time is spent in drm_intel_bo_exec()
If the caller of drm_intel_bo_exec is require_space, then you've just accumulated enough state updates that it's time to flush the batchbuffer. Generally for busy applications this can take up 10% or so of the time, as the kernel has to do a bunch of validation and preparation to get the batchbuffer ready for submission. If it's taking more than that, a descendent is usually drm_clflush_pages() above.


