20130703

Thinking About a Better Graphics API

Public documentation on AMD's GCN arch,
AMD's Public Southern Islands Instruction Set Architecture
src/gallium/drivers/radeonsi (Linux Open Source Driver Source)

At home I've been tempted to take a Linux box with and AMD GCN GPU, and just go direct to the hardware since AMD has opened up a bunch of required documentation (ISA and driver source). The ultimate in graphics API is virtually no API, and no CPU work to draw anything. The engine would simply leverage the 64-bit virtual address space of the GPU, give all possible resources a unique virtual address, then for any given GPU target, pre-compile (meaning at author time) not just the shaders, but the command buffer chunks required to draw a resource in any of the render targets in the game's pipeline. For each render target in the rendering pipeline, the engine would maintain a GPU side array of 64-bit pointers to resources to test for visibility. Then for GPU side command buffer generation, for the GPU sorted array of visible resources, copy the command buffer chunk required to render the resource (or just copy a pointer to the chunk if the GPU supported hierarchical command buffers). Command and uniform buffers could be compiled double buffered if required for easy constant updates. Loading and rendering a resource is as easy as streaming into physical pages, updating page tables on the GPU and/or CPU (collected and done once per frame), then adding 64-bit pointer(s) to the buffer(s) for visibility testing.

Constants in the virtual address space would be mapped to write-combined memory on the CPU, so either the GPU or the CPU could update constants at run time. All the CPU overhead for generating command buffers (and resource constants for GCN) goes away. This removes the power and work done by typically 2 or more CPU threads and greatly increases the maximum amount of draw calls possible per frame.

This same concept can easily be extended to things CPU side also. Why allocate memory at runtime? Just setup large enough resources in virtual memory space to cover the worst case, and use physical memory backing at runtime to switch between common backed pages of "non-resident resources" and resident physically backed pages. Lay everything in the data flow network out into linear streams which are trivial pre-fetched by the CPU. Duplicate data if required, use compression on the packed source data sitting on disk, network, or solid state.

6 comments:

  1. "pre-compile (meaning at author time) not just the shaders, but the command buffer chunks"

    But something like this requires binary compatibility of the GPUs on the shader side as well as on the command processor side. The more low-level the API gets, the more restricted the GPU design will be. This kind of API works well on consoles where the hardware is constant for a couple of years but desktop hardware is a moving target.

    I agree that we could get a more flexible and faster direct API compared to e.g. OpenGL but we have to find a tradeoff between the current state of APIs and this ideal direct access to a fixed hardware design. More direct state access, exposing the 64bit GPU resource pointers, complete shader controlled sampling etc. would be a great start. Also: a shader intermediate format for OpenGL and fetching and replaying command streams at runtime (similar to the idea of display lists)...

    ReplyDelete
    Replies
    1. If the aim was portability, could build a translator which would translate the cooked data from the original target GPU chipset to another. In the age of digital download of content, given user's platform just stream the cooked data matching the GPU chipset. Think of this as just a cook time emulator for different platforms.

      Delete
  2. As a long time console developer and current mobile developer I support 100% any effort towards a lean driver/shim layer on top of HW, instead of the slow and heavy drivers right now that consume unnecessary CPU and bus resources for no real reason. Please let me know if you need contributors :)

    Although HW compatibility is always what's behind this layering process, I believe there are concepts that are universal that somehow didn't make it (or were removed) from the API, and that could make for a light CPU processing: command buffers (with execution logic for complex calls and loops and similar), GPU-only synchronization and fences that DO NOT use CPU interrupts, precompiled command buffers and binaries (pre-swizzled textures, pre-built shaders, etc), as well as device caps (which at least can help developers switch to generic APIs for unsupported HW).

    Good luck!

    ReplyDelete
  3. Having worked on drivers in the past with the mantra "Get out of the application's way", I would find it useful to read about specific examples of "slow and heavy" drivers today.

    What were conditions under which you saw high overhead?
    How were you measuring the driver's overhead?
    What GPUs were being used?

    Appreciate any details folks can share!

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. As I see it, the overhead comes from the need to pass everything back and forth between the GPU and CPU, and kernel mode and user mode within the OS. These issues are not derived from poorly written drivers, but from the driver model, API designs, and OS architecture.
    I think, ultimately we need to design an entire new system architecture (OS, driver model, API) that takes modern use cases and hardware capabilities into account from the kernel level up. Some abstraction is necessary unless you want to rewrite code for every new piece of hardware or stick with what you have till it dies, but the real issue is that people keep adding new abstraction layers, new virtual machine layers, new API layers, etc. That is what needs fixing: the belief that another layer won't hurt because computers are fast enough to handle it.

    ReplyDelete