20140602

Metal

Apple Metal API Docs | Apple Metal Shading Language Docs

Took a few minutes this late evening to see what Apple's Metal API actually looks like. Some notes on a quick first look (could be missing stuff),

Single queue for the GPU.
Queue changes between three modes {gfx, compute, copy}.
API offers parallel command buffer generation.
Command buffers get enqueued first which provides ordering.
Then committed later which marks completion.
No command buffer reuse (must regenerate each frame).
Only trackable completion is at command buffer boundary.
Ability to create views of textures.
Ability to create texture views of buffers.
Standard numbered binding slots for resources, but all stages share same binding table.
Cached state objects.
State objects: {texture, buffer, sampler, shader library, compute, framebuffer, pipeline, depth/stencil}.
Pipeline state: {vertex fetch, shaders, blend state, framebuffer format}.
Ability to create the expensive pipeline state objects asynchronously.
Framebuffer attachment options at start of rendering: {clear, load, don't care}.
Framebuffer attachment options at end: {store, MSAA resolve, don't care}.

Thoughts
Feels a lot like the design of DX11 with desktop GPU features removed, but with working parallel command buffer generation. Seems like pipeline state covers all the graphics state which requires a run-time recompile or patch of shader binaries or complex configuration of the graphics pipeline. I guessing the pipeline state is where they get lots of perf compared to ES. I'm assuming this hardware has no fixed function blending, so I'm not sure why they did not just remove blend state and simply require shaders to use framebuffer fetch and framebuffer store? Looks like the primary advantages of this API over their existing ES API: pre-compiled shaders, super aggressive caching of state objects, working parallel command buffer generation, compute, and finally really optimizing the driver.

3 comments:

  1. Note that there is absolutely no ordering guarantee between different encoders in the same command buffer (the API makes it very clear that the only ordering guarantees exist at command buffer granularity).

    There's a reason there's absolutely no synchronization at finer granularity than CB level: PowerVR HW doesn't provide any - it's a tile-based deferred renderer; there's no way to e.g. schedule a blit between two draw calls, since rendering proceeds in two phases (producing triangles, then rasterizing+shading them), and the latter half proceeds tile by tile. So there's really no "between" batch 3 and 4, since their processing might well be interleaved. For that reason, processing of blits (and compute too) is really not ordered in any useful way relative to rendering. It's probably more useful to think of it as three separate queues (gfx, compute, blit) that happen to all get fed from the same command buffer, not one queue that switches modes; if this is like other PVR HW I've worked with, blits submitted from a blit encoder can (and often will) finish well before the first batch sent from a rendering encoder that *precedes* the blit in the same CB.

    ReplyDelete
    Replies
    1. And I've just been told that this all "entirely incorrect". Never mind then. :)

      Delete
    2. Okay, Gokhan Avkarogullari ‏at Apple just corrected my misconceptions and cleared the matter up. To wit:

      There *are* ordering guarantees, because apparently Metal does keep track of resource hazards between encoder switches (there can only be one encoder type active at a time, which I didn't mention in my previous post) - so if I say render to a texture then blit from it, or follow up with another rendering job that uses said texture, the appropriate synchronization will get inserted.

      Consequently, from the perspective of the app, events always happen in the order they are encoded, in the sense that all observable sequential dependencies are respected. However, if there is no such sequential dependency between two separate "encoder blocks" (my term), the GPU HW may overlap their execution (so it might be faster than the strictly sequential execution implied by the API model).

      Delete