Accelerated Graphics Port (AGP)

AGP is a dedicated high-speed bus that allows the graphics controller to fetch large amounts of data directly from system memory. It uses a Graphics Address Re-Mapping Table (GART) to provide a physically-contiguous view of scattered pages in system memory for DMA transfers.

With AGP, main memory is specifically used for advanced three-dimensional features, such as textures, alpha buffers, and ZBuffer``s.

There are two primary AGP usage models for 3D rendering that have to do with how data are partitioned and accessed, and the resultant interface data flow characteristics.
In the DMA model, the primary graphics memory is the local memory associated with the accelerator, referred to as the local frame buffer. 3D structures are stored in system memory, but are not used (or executed) directly from this memory; rather they are copied to primary (local) memory (the DMA operation) to which the rendering engine's address generator makes its references. This implies that the traffic on the AGP tends to be long, sequential transfers, serving the purpose of bulk data transport from system memory to primary graphics (local) memory. This sort of access model is amenable to a linked list of physical addresses provided by software (similar to the operation of a disk or network I/O device), and is generally not sensitive to a non-contiguous view of the memory space.
In the execute model, the accelerator uses both the local memory and the system memory as primary graphics memory. From the accelerator's perspective, the two memory systems are logically equivalent; any data structure may be allocated in either memory, with performance optimization as the only criterion for selection. In general, structures in system memory space are not copied into the local memory prior to use by the accelerator, but are executed in place. This implies that the traffic on the AGP tends to be short, random accesses, which are not amenable to an access model based on software resolved lists of physical addresses. Because the accelerator generates direct references into system memory, a contiguous view of that space is essential; however, since system memory is dynamically allocated in random 4K pages, it is necessary in the execute model to provide an address mapping mechanism that maps random 4K pages into a single contiguous, physical address space.

Note: The AGP supports both the DMA and the execute model. However, since a primary motivation of the AGP is to reduce growth pressure on local memory, the execute model is the design focus.

AGP also allows to issue several access requests in a pipelined fashion while waiting for the data transfers to occur. Such pipelining of access requests results in having several read and/or write requests outstanding in the corelogic's request queue at any point in time.


Frequently Asked Questions

Why not use the existing XFree86 AGP manipulation calls?

You have to understand that the DRI functions have a different purpose than the ones in XFree86. The DRM has to know about AGP, so it talks to the AGP kernel module itself. It has to be able to protect certain regions of AGP memory from the client side 3D drivers, yet it has to export some regions of it as well. While most of this functionality (most, not all) can be accomplished with the /dev/agpgart interface, it makes sense to use the DRM's current authentication mechanism. This means that there is less complexity on the client side. If we used /dev/agpgart, then the client would have to open two devices, authenticate to both of them, and make half a dozen calls to agpgart, and only then care about the DRM device.

Note: As a side note, the XFree86 calls were written after the DRM functions.

Also to answer a previous question about not using XFree86 calls for memory mapping, you have to understand that under most OS's (probably Solaris as well), XFree86's functions will only work for root privileged processes. The whole point of the DRI is to allow processes that can connect to the X server to do some form of direct to hardware rendering. If we limited ourselves to using XFree86's functionality, we would not be able to do this. We don't want everyone to be root.

How do I use AGP?

You can also use this test program as a bit more documentation as to how agpgart is used.

How to allocate AGP memory?

Generally programs do the following:

  1. open /dev/agpgart
  2. ioctl(ACQUIRE)
  3. ioctl(INFO) to determine amount of memory for AGP
  4. mmap the device
  5. ioctl(SETUP) to set the AGP mode
  6. ioctl(ALLOCATE) a chunk o memory, specifying offset in aperture
  7. ioctl(BIND) that same chunk o memory Every time you update the GART, you have to flush the cache and/or TLB's. This is expensive. Therefore, you allocate and bind the pages you'll use, and mmap() just returns the right pages when needed.

Then you need to have a remap of the AGP aperture in the kernel which you can access. Use ioremap to do that.

After that you have access to the AGP memory. You probably want to make sure that there is a write-combining MTRR over the aperture. There is code in mga_drv.c in our kernel directory that shows you how to do that.

If one has to insert pages in order to check for -EBUSY errors and loop through the entire GART, wouldn't it be better if the driver filled up ''pg_start'' of the ''agp_bind'' structure instead of the user filling it up?

All this allocation should be done by only one process. If you need memory in the GART you should be asking the X server for it (or whatever your controlling process is). Things are implemented this way so that the controlling process can know intimate details of how memory is laid out. This is very important for the I810, since you want to set tiled memory on certain regions of the aperture. If you made the kernel do the layout, then you would have to create device specific code in the kernel to make sure that the backbuffer/dcache are aligned for tiled memory. This adds complexity to the kernel that doesn't need to be there, and imposes restrictions on what you can do with AGP memory. Also, the current X server implementation (4.0) actually locks out other applications from adding to the GART. While the X server is active, the X server is the only one who can add memory. Only the controlling process may add things to the GART, and while a controlling process is active, no other application can be the controlling process.

Microsoft's VGART does things like you are describing I believe. I think it's bad design. It enforces a policy on whoever uses it, and is not flexible. When you are designing low level system routines I think it is very important to make sure your design has the minimum of policy. Otherwise when you want to do something different you have to change the interface, or create custom drivers for each application that needs to do things differently.

How does the DMA transfer mechanism works?

Here's a proposal for an zero-ioctl (best case) DMA transfer mechanism.

Let's call it 'kernel ringbuffers'. The premise is to replace the calls to the 'fire-vertex-buffer' ioctl with code to write to a client-private mapping shared by the kernel (like the current SAREA, but for each client).

Starting from the beginning:

  • Each client has a private piece of AGP memory, into which it will put secure commands (typically vertices and texture data). The client may expand or shrink this region according to load.
  • Each client has a shared user/kernel region of cached memory. (Per-context SAREA). This is managed like a ring, with head and tail pointers.
  • The client emits vertices to AGP memory (as it currently does with DMA buffers).
  • When a state change, clear, swap, flush, or other event occurs, the client:Grabs the hardware lock.Re-emits any invalidated state to the head of the ring.Emits a command to fire the portion of AGP space as vertices.Updates the head pointer in the ring.Releases the lock.
  • The kernel is responsible for processing all of the rings. Several events might cause the kernel to examine active rings for commands to be dispatched: * A flush ioctl. (Called by impatient clients) * A periodic timer. (If this is low overhead?) * An interrupt previously emitted by the kernel. (If timers don't work) Additionally, for those who've been paying attention, you'll notice that some of the assumptions that we currently use to manage hardware state between multiple active contexts are broken if client commands to hardware aren't executed serially in an order which is knowable to the clients. Otherwise, a client that grabs the heavy lock doesn't know what state has been invalidated or textures swapped out by other clients.

This could be solved by keeping per-context state in the kernel and implementing a proper texture manager. That's something we need to do anyway, but it's not a requirement for this mechanism to work.

Instead, force the kernel to fire all outstanding commands on client ringbuffers whenever the heavyweight lock changes hands. This provides the same serialized semantics as the current mechanism, and also simplifies the kernel's task as it knows that only a single context has an active ring buffer (the one last to hold the lock).

An additional mechanism is required to allow clients to know which pieces of their AGP buffer is pending execution by the hardware, and which pieces of the buffer are available to be reused. This is also exactly what NVvertex_array_range_ requires.

Kernel patches

Kernel patch to get AGP 8X working on a VIA KT400. The patch is for a 2.4.21pre5 but patches, compiles and runs perfectly in a 2.4.21. This patch is needed for this chipset because it works only in 8X when you plug a 8X card.

CategoryGlossary, CategoryFaq