20140530

Re: Joshua Barczak's "OpenGL Is Broken"

Reply to Joshua Barczak's "OpenGL Is Broken",

Driver Quality
Driver quality and performance problems with DX are largely hidden because hardware vendors are actively fixing things when they get early pre-release builds of major games before release and then put out driver updates when those games ship. This same process could work equally well for GL.

Valve's VOGL project is going to be a key enabler of this process in that developers will be able to send GL captures of their application directly to hardware vendors. Vendors won't need to get builds, and traces are way easier to track down and debug.

Shader Compiling
Both the DX and GL solution to shader compiling have advantages and disadvantages,

(a.) DX byte code lacks support for features in {AMD,Intel,NVIDIA} GPUs.
(b.) GLSL supports those features through extensions.
(c.) DX has no built-in extension support.
(d.) DX byte code is a "vector" ISA, and modern GPUs are "scalar".
(e.) DX compile step removes important information for the JIT optimization step.
(f.) GLSL maintains all information useful for JIT optimization.
(g.) DX does not require parsing.
(h.) GLSL offline compiled to minimal GLSL does not require expensive parsing.

The following could greatly help the situation for GL,

First remove or reduce the draw state based shader recompile problem. Add explicit GLSL layout for draw state which results in run-time shader based compiles (advantage GLSL). Then move shader compiling and any reflection to an external process (not the standard GL interface). The external interface would spit out {deviceAndDriverUID, binaryBlob} which the app can send to GL. This opens up explicit application parallel compile and caching of shaders and ability to ship on launch with binaries for the most important GPUs and drivers. In the short term, developers use GLSL to GLSL offline pre-compiling to minimize run-time parsing for the external compiler interface. Then longer term switch to an extendable portable scalar IR.

Threading
Breaking this down into core sub-problems,
(a.) Streaming using DMA engines to transform from linear to tiled.
(b.) Streaming using GPU or DMA to copy long-living static data to VRAM.
(c.) Submission of state changes, draws and compute dispatches.

Both (a.) and (b.) do not require any CPU load, more of synchronization issue. Explicit application memory management, resource lifetime management, and synchronization can solve those issues. Problem (c.) is the core usage case for threading. Starting with various aspects of command buffer generation,

(d.) Building constants. First do not use global uniforms, use uniform buffers only. Leverage ARB_buffer_storage to setup persistent mapped memory and carve out for building constant buffers. Now the driver is decoupled, constant buffers can be built in parallel on any thread with zero driver involvement.

(e.) Binding of textures/samplers. Go bindless. Place texture handles into constant buffers directly. So all texture state changes can be done in parallel from another thread simply by writing into a constant buffer.

(f.) Vertex attribute fetch. Switch to explicit manual attribute fetch in the vertex shader. No more binding fixed function vertex buffers. Again can use another thread to write bindless handles to vertex data into a constant buffer.

(g.) Application can optionally cache these bits of constant buffer, things which do not change each frame do not need to get updated.

(h.) Index buffer binding. Switch to using a giant buffer for index data, carve up manually for meshes. No need to bind new index buffer per draw. No state changes.

(i.) The majority of what is left is the following sequence {change shader(s), bind constant buffer(s), draw} when a material changes, or {bind constant buffer(s), draw} when drawing a new mesh for the same material. The binding of constant buffers uses the same buffer each time with a different offset, a case which is trivial for a driver to optimize for.

So with the exception of (i.), GL already supports "threading", and note there is no suggestion of using the existing multi-draw or uber-shaders. On that topic, one possibility for future GL or hardware would be a version of multi-draw which supports binding new constant buffers per draw in the multi-draw. I highly doubt these 50K draws/frame at 33ms cases involve 50K shader changes, because the GPU would be bound on context changes. Instead it is a lot of meshes, so that version of multi-draw would cover this case well.

Another issue to think about with "threading" is the possibility of a latency difference between single threaded issue, and parallel generation plus kick. In the single thread issue case, the driver in theory can kick often simply by writing the next memory address which the GPU front-end is set to block on if it runs out of work. Parallel generation likely involves batching commands then syncing CPU threads, then kicking off chunks of commands. This could in theory have higher latency if the single thread approach was able to saturate the GPU frontend on its own (think in terms of granularity before a kick).

Not attempting to argue against threading here, in fact it might make sense to be able to do both immediate command buffer generation with low latency, and larger granularity parallel command generation. However I strongly believe this issue requires very careful consideration. If there is possibility to complete a given amount of work on one thread in the same walk clock time it takes another API to complete in many threads, the best answer might be to stick to one thread and do the work efficient approach.

Texture And Sampler State Are Orthogonal
EDITED: Only min LOD is needed for streaming, removed some of the prior points which had false logic.

In many ways combined {texture,sampler} has advantages for optimized pipelines. Here is one such example which would require a few changes to GL's bindless interface (such as moving to 32-bit handles).

Expected future usage case is GPU-side handles referencing combined {texture,sampler} with a huge (effectively limit-less) descriptor table which holds the physical descriptors. The descriptor table would get updated at fence sync points. Applications would relatively infrequently update the descriptor table. Recycling entries in the table for texture streaming in/out of resources. Some entries would get updates each frame. For instance after texture stream-in and before stream-out, the min LOD would get fractionally changed to avoid a visible pop.

This setup has some key advantages for typical engines which do texture streaming. The GPU handle would remain constant for a loaded-in resource in the descriptor table. Even under min LOD update and stream in/out more/less mip levels (which might change the underlining virtual memory address). An engine can effectively fill in the 32-bit handles in constant buffer memory for the resources and reuse (texture streaming has no effect on GPU handles).

Engines utilizing clustered forward shading might end up using a lot of different samplers in a given shader. For instance different anisotropy on textures of a given material. Or tiling (material textures) vs non-tiling (non-surface textures, like lightmaps or decals).

Resource descriptors might be effectively random accessed with respect to the locations accessed at runtime. However these are block loads and well cached. So random access is not expected to be a problem.

Combined {texture, sampler} minimizes the number of block loads, but increases the total amount loaded (a little more constant cache utilization). Each texture descriptor is paired in aligned memory with the sampler descriptor. One block load for both.

Separate {texture}, {sampler} maximizes the number of block loads, but can decrease the total amount loaded. Using 8 textures with 3 samplers for instance requires an extra 3 block loads compared to the combined case. Also if the compiler repurposes sampler descriptor scalars for constants at runtime, that sampler might need to get reloaded again later (maybe same sampler used in both the beginning and end of the shader), which could increase the example case beyond 3 extra block loads. Also separate texture and sampler requires loading more handles from constant buffers. However this is likely only a fractional change because proper setup of constants will leverage block loads to get the amortized cost to a fractal scalar block load per constant.

There could be advantages of lowering the number of block loads in that it would in theory reduce the number of scalar operations. GCN, when live wavefront occupancy gets low, the SIMD units cannot do as efficient multi-issue of different classes of instructions (scalar, vector, vector memory, etc). Scalar ops can start to take away from ALU throughput when using lots of registers/wave.

6 comments:

  1. " DX byte code is a "vector" ISA, and modern GPUs are "scalar", I thought modern GPUs were essentially clusters of SIMD units, can you explain what you mean by that please.

    ReplyDelete
    Replies
    1. Invocations of shaders see "scalar" registers, instructions do one floating point or integer operation at a time. In the PS3 and 360 days, invocations of shaders see "vector" registers, operations did more than one floating point operation at a time (vectorized). In both cases (new and old hardware) many invocations of shaders are running in parallel using underlining SIMD vector registers on the hardware.

      Delete
  2. I don't think people usually adjust max LOD for streaming, do they? The way I've always seen it done, you keep the coarsest levels around all the time (they take only a tiny amount of memory) and just stream the finest level or two. So you only need to adjust min LOD: decreasing it after stream-in, and increasing it before stream-out.

    ReplyDelete
    Replies
    1. Thanks, yeah, you are right here. Edited the post.

      Delete
  3. The section on threading is interesting, but need some clarification on point (f). Are you suggesting writing your own handles/index (not gl handles) into a constant buffer or does GL actually support hooking up vertex attribute binding through a constant buffer?

    ReplyDelete
    Replies
    1. Suggestion was to stop using vertex fetch of interleaved attributes and just manually fetch using texture buffers of non-interleaved attributes. This is what I've been doing personally on desktop GPUs. Not sure this is sound advice yet for things like mobile because it might decrease the GPUs ability to hide latency.

      Delete