How does one write a new Mesa driver?
There are two basic aspects to writing a new driver.
First, define the public OpenGL / window system API. In the case of GLX, these are the glx() functions. For OSMesa these are the OSMesa() functions seen in
include/GL/osmesa.h. You'll basically need functions for specifying frame buffer formats (bits per rgb, bits for Z, bits for stencil, etc.), functions for creating/destroying contexts, binding contexts to windows. etc.
Second, implement the internal functions needed by the "DD" interface. Look at the
osmesa.c file and grep for "ctx->Driver. = ". This is where the driver hooks itself into the core of Mesa. In many cases we hook in fall-back functions (like swrast ).
This isn't simple (or even as straight-forward as it used to be) but the system's designed for efficiency, flexibility and modularity. If the device driver interface were made for simplicity above all else there would probably only be two driver functions: dd_function_table::() and dd_function_table:: ().
The OSMesa driver is pretty simple. The only complexity comes from supporting all the different frame buffer formats like !RGB, !RGBA, !BGRA, !ABGR, etc. I think the Windows driver is in pretty good shape too. The XMesa driver (upon which Mesa's GLX is layered) is rather large because of lots of frame buffer formats and optimized point/line/triangle rendering functions.
Mesa 4.x Implementation Notes
The big changes in Mesa were made between Mesa 3.4.x and Mesa 3.5. That's when KeithWhitwell re-modularized the source code into separate modules for T&L, s/w rasterization, etc.
This document is an overview of the internal structure of Mesa and is meant for those who are interested in modifying or enhancing Mesa, or just curious.
Note: Based on the original Mesa Implementation Notes and corrections by Brian Paul.
Library State and Contexts
OpenGL uses the notion of a state machine. Almost all OpenGL state is contained in one large structure:
The GLcontextRec structure actually contains a number of sub structures which exactly correspond to OpenGL's attribute groups. This organization made glPushAttrib and glPopAttrib trivial to implement and proved to be a good way of organizing the state variables.
The immediate represents everything that can take place between glBegin and glEnd being able to represent multiple glBegin/glEnd pairs. It can be used to losslessly encode this information in display lists. See t_context.h and t_imm_api.c.
When either the vertex buffer becomes filled or a state change outside the glBegin/glEnd is made, we must flush the buffer. That is, we apply the vertex transformations, compute lighting, fog, texture coordinates etc. The various vertex transformations are implemented as software pipeline stages by the t_pipeline.c and
When we're outside of a glBegin/glEnd pair the information in this structure is retained pending either of the flushing events described above.
Note: Originally, Mesa didn't accumulate vertices in this way. Instead, glVertex transformed and lit then buffered each vertex as it was received. When enough vertices to draw the primitive (1 for points, 2 for lines, >2 for polygons) were accumulated the primitive was drawn and the buffer cleared.
The new approach of buffering many vertices and then transforming, lighting and clip testing is faster because it's done in a "vectorized" manner. See gl_transform_points in m_xform.c for an example.
For best performance Mesa clients should try to maximize the number of vertices between glBegin/glEnd pairs and use connected primitives when possible.
The point, line and polygon rasterizers are called via the Point, Line, and Triangle function pointers in the SWcontext structure in s_context.h. Whenever the library state is changed in a significant way, thecontext flag is raised. When glBegin is called is checked. If the flag is set we re-evaluate the state to determine what rasterizers to use. Special purpose rasterizers are selected according to the status of certain state variables such as flat vs smooth shading, depth-buffered vs. non-depth- buffered, etc. The swrast_choose* functions do this analysis. It's up to the device driver to choose optimized or accelerated rasterization functions to replace those in the general software rasterizer.
In general, typical states (depth-buffered & smooth-shading) result in optimized rasterizers being selected. Non-typical states (stenciling, blending, stippling) result in slower, general purpose rasterizers being selected.
- Point, Line, Triangle, glDrawPixel, glCopyPixels and glBitmap all use the sw_span structure and functions in s_span.c generate horizontal runs of pixels called spans. Processing includes window clipping, depth testing, stenciling, texturing, etc. After processing the span is written to the frame buffer by calling a device driver function. The goal is to maximize the number of pixels processed inside loops and to minimize the number of function calls. Note: Pixel buffers are no longer present in the latest Mesa code (4.1). All fragment (pixels plus color, depth, texture coordinates) processing is done via the span functions in s_span.c.
There are three Mesa data types which are meant to be used by device drivers:
- GLcontext: this contains the Mesa rendering state
- GLvisual: this describes the color buffer (RGB vs. CI), whether or not there's a depth buffer, stencil buffer, etc.
- GLframebuffer: contains pointers to the depth buffer, stencil buffer, accum buffer and alpha buffers. These types should be encapsulated by corresponding device driver data types. See xmesa.h and xmesaP.h for an example.
In OOP terms, GLcontext, GLvisual, and GLframebuffer are base classes which the device driver must derive from.
The structure dd_function_table seen in dd.h, defines the device driver functions [^1]. By using a table of pointers, the device driver can be changed dynamically at runtime. For example, the X/Mesa and OS/Mesa (Off-Screen rendering) device drivers can co-exist in one library and be selected at runtime.
In addition to the device driver table functions, each Mesa driver has its own set of unique interface functions. For example, the X/Mesa driver has the XMesaCreateContext, XMesaBindWindow, and XMesaSwapBuffers functions while the Windows/Mesa interface has WMesaCreateContext, WMesaPaletteChange and WMesaSwapBuffers. New Mesa drivers need to both implement the dd_function_table functions and define a set of unique window system or operating system-specific interface functions.
The device driver functions can roughly be divided into three groups:
- . pixel span functions which read or write horizontal runs of RGB or color-index pixels. Each function takes an array of mask flags which indicate whether or not to plot each pixel in the span.
- . miscellaneous functions for window clearing, setting the current drawing color, enabling/disabling dithering, returning the current frame buffer size, specifying the window clear color, synchronization, etc. Most of these functions directly correspond to higher level OpenGL functions.
- . if your graphics hardware or operating system provides accelerated point, line and polygon rendering operations, they can be utilized through the , , and functions. Mesa will call these functions to "ask" the device driver for accelerated functions through the . If the device driver can provide an appropriate renderer, given the current Mesa state, then a pointer to that function can be returned. Otherwise the , , and functions pointers can just be set to NULL. Even if hardware accelerated renderers aren't available, the device driver may implement tuned, special purpose code for common kinds of points, lines or polygons. The X/Mesa device driver does this for a number of lines and polygons. See the xm_line.c and xm_tri.c and files.
The overall relation of the core Mesa library, X device driver/interface, toolkits and application programs is shown in this diagram:
+-----------------------------------------------------------+ | | | Application Programs | | | | +- glu.h -+------ glut.h -------+ | | | | | | | | GLU | GLUT | | | | | toolkits | | | | | | | +---------- gl.h ------------+-------- glx.h ----+ | | | | | | Mesa core | GLX functions | | | | | | +---------- dd.h ------------+------------- xmesa.h --------+ | | | XMesa* and device driver functions | | | +-----------------------------------------------------------+ | Hardware/OS/Window System | +-----------------------------------------------------------+
The work starts on t_pipeline.c where a driver configurable pipeline is run in response to either the vertex buffer filling up, or a statechange.
The pipeline stages operate on context variables (suchs as vertices coord, colors, normals, textures coords, etc), applying the necessary operations in a OpenGL pipeline (such as coord transformation, lighting, etc.).
The last stage - rendering -, calls
*_vb.c which applies the viewport transformation, perpective divide, data type convertion and packs the vertex data in the context (in the arrays tnl->vb->Ptr->data) into a driver dependent buffer with just the information relevent for the current OpenGL state (e.g., with/without texture, fog, etc). The template
t_dd_vbtmp.h does this into a Direct3D alike vertex structure format.
For instance, if we needed to premultiply the textures coordinates, as it is done in the tdfx and mach64 driver, we will need to make a customized version of
t_dd_vbtmp.h for that effect, or change it and supply a configuration parameter to control that behavior.
This buffer is then used to render the primitives in
*_tris.c. This vertex data is intended to be copied almost verbatim into DMA buffers, with a header command, in most chips with DMA.
But in the case of Mach64, where the commands are interleaved with each of the vertex data elements, it will be necessary to use a different structure of *Vertex to do the same, and probably to come up with a rather different implementation of t_dd_vbtmp.h as well.
Indeed, if the chip expects something quite different to the d3d vertices, one will certainly want to look at this. In the meantime, it may be simplest to go with a "normal-looking"
*_vb.c and do some extra stuff in the triangle/line/point functions. The ffb and glint drivers are a bit like this, I think.
All this mechanism is controlled with function pointers in the context which are rechosen whenever the OpenGL state changes enough. These functions pointers can also be overwritten with those in the
sw_* modules to fallback to software rendering.
How about the main X drawing surface? Are 2 extra "window sized" buffers allocated for primary and secondary buffers in a page-flipping configuration?
Right now, we don't do page flipping at all. Everything is a blit from back to front. The biggest problem with page flipping is detecting when you're in full screen mode, since OpenGL doesn't really have a concept of full screen mode. We want a solution that works for existing games. So we've been designing a solution for it. It should get implemented fairly soon since we need it for antialiasing on the V5.
In the current implementation the X front buffer is the 3D front buffer. When we do page flipping we'll continue to do the same thing. Since you have an X window that covers the screen it is safe for us to use the X surface's memory. Then we'll do page flipping. The only issue will be falling back to blitting if the window is ever moved from covering the whole screen.
This section gives some notions about the several concepts associated to clipping.
- -- LeifDelgass
The scissors are register settings that determine a hardware clipping rect in window coords. Any part of a primitive or other drawing operation that extends beyond the scissors is not drawn. The scissors can be set through GL commands. This has nothing to do with perspective clipping in the pipeline, just the final window coordinates.
Cliprects are used to determine what parts of the context/window should be redrawn to handle overlapping windows. The more overlapping windows, the more cliprects you have. These need to be passed to the drm. It does a clear or swap for each cliprect. Again these are for 2D clipping after rasterization and not part of the pipeline. Things get a bit complicated by the fact that there can be separate clip rects for the front and back buffers.
The cliprects are stored in device-independent structures, hence the code is abstracted out of the individual drivers.
The viewport array holds values to determine how to translate transformed, clipped, and projected vertex coordinates into window coordinates. This is the last stage of the pipeline. The values are based on the size and position of the "drawable", also known as the drawing area of the window for the context.
What follows is a description of the texture management system in the DRI. This is all based on the Radeon driver in the 11-Feb-2002 CVS of the mesa-4-0-branch. While it is based on the Radeon code, all drivers except gamma and tdfx seem to use the same scheme (and virtually identical code).
- Excluding the texture backing store, which is managed by Mesa, texture data is tracked in two places. The per-screen (card) SAREA divides each type of texturable memory (on-card, AGP, etc.) into an array of fixed sized chunks (RADEONSAREAPriv.texList in
programs/Xserver/hw/xfree86/drivers/ati/radeon_sarea.h). The number of these chunks is a compile-time constant, and it cannot be changed without destroying the universe. That is, any changes here will present major compatibility issues. Currently, this constant is 64. So, for the new 128MB Radeon 8500 cards, each block of memory will likely be 1MB or more. This is not as bad as it may first seem, see below. The usage of each type of memory is also tracked per-context.
- The per-context memory tracking is done using a memHeap_t. Allocations from the memHeap_t (see lib/GL/mesa/src/drv/common/mm.c) are done at byte granularity. When a context needs a block of texture memory, it is allocated from the memHeap_t. This results in very little memory fragmentation (within a context). After the allocation is made, the map of allocated memory in the SAREA is updated (radeonUpdateTexLRU in
lib/GL/mesa/src/drv/radeon/radeon_texmem.c). Basically, each block of memory in texList that corresponds to an allocated region (in the per-context memHeap_t) is marked as allocated.
- The texList isn't just an array of blocks. It's also a priority queue (linked list). As each texture is accessed, the blocks that it occupies are moved to the head of the queue. In the Radeon code, each time a texture is uploaded or bound to a texture unit, the blocks of memory (in AGP space or on-card) are moved to the head of the texList queue.
- If an allocation (via the memHeap_t) fails when texture space is allocated (radeonUploadTexImages in
lib/GL/mesa/src/drv/radeon/radeon_texmem.c), blocks at the end of the texList queue are freed until the allocation can succeed.This may be an area where the algorithm could be improved. For example, it might be better to find the largest free block (in the memHeap_t) and release memory around that block in LRU or least-often-used fashion until the allocation can succeed. This may be too difficult to get right or too slow when done right. Someone would have to try it and see.
- Each time a direct-client detects that another client has held the per-screen lock, radeonGetLock is called. This synchronizes the per-context vision of the hardware state. Part of this synchronization is synchronizing the view of texture memory. In addition to the texList, the SAREA holds a texAge array. This array stores the generation number of each of the texture heaps. If a client detects that the generation number of a heap has changed in radeonGetLock, it calls radeonAgeTextures for that heap. radeonAgeTextures runs through the texList looking for blocks with a more recent generation number. Each block that has a newer generation is passed to radeonTexturesGone.
- radeonTexturesGone searches the per-context memHeap_t for an allocated region matching the block with the updated generation. When a matching region is found, it is freed, and if the region was for a texture in the local context, the local state of that texture is updated. If the updated block (from the global context) is "in-use" (i.e., some other context has stolen that block from the current context), the block is re-allocated and marked as in-use by another context.
It seems that about 2 years ago a few people (from the CVS log) had taken a stab at factoring the common code out and put it in
lib/GL/mesa/src/drv/commonbut it isn't used and hasn't been touched (at least not in CVS) since 4-April-2000. Note: Just FYI: the tdfx texture memory management code is different because:
. It was originally developed before the scheme KeithWhitwell implemented for the i810 and mga drivers (and later used for the R128 and radeon).
- . There are some idiosynchracies with the Voodoo3 such as two separate banks of TRAM and needing to store alternate mipmap levels in alternate banks.
- . We did everything through the Glide interface, rather than working directly with the hardware. -- IanRomanick
How often are checks done to see if things need clipped/redrawn/redisplayed?
The locking system is designed to be highly efficient. It is based on a two tiered lock. Basically it works like this:
The client wants the lock. The use the CAS (I was corrected that the instruction is compare and swap, I knew that was the functionality, but I got the name wrong) If the client was the last application to hold the lock, you're done you move on.
If it wasn't the last one, then we use an IOCTL to the kernel to arbitrate the lock.
In this case some or all of the state on the card may have changed. The shared memory carries a stamp number for the X server. When the X server does a window operation it increments the stamp. If the client sees that the stamp has changed, it uses a DRI X protocol request to get new window location and clip rects. This only happens on a window move. Assuming your clip rects/window position hasn't changed, the redisplay happens entirely in the client.
The client may have other state to restore as well. In the case of the tdfx driver we have three more flags for command fifo invalid, 3D state invalid, textures invalid. If those are set the corresponding state is restored.
So, if the X server wakes up to process input, it currently grabs the lock but doesn't invalidate any state. I'm actually fixing this now so that it doesn't grab the lock for input processing.
If the X server draws, it grabs the lock and invalidates the command FIFO.
If the X server moves a window, it grabs the lock, updates the stamp, and invalidates the command FIFO.
If another 3D app runs, it grabs the lock, invalidates the command FIFO, invalidates the 3D state and possibly invalidates the texture state.
[^1] Many of the functions which used to be in the dd_function_table are now moved into the tnl or swrast modules.