20160721

Why Motion Blur Trivially Breaks Tone-Mapping

Some food for visual thought ...

You have a scene with a bunch of High Dynamic Range (HDR) light sources, or bright secondary reflections. Average Picture Level (APL) of the scene as displayed is relatively low, which is standard practice and expected behavior for displayed HDR scenes. Because the display cannot output the full dynamic range of the scene, the scene is tone-mapped before display, and thus the groups of pixels representing highlights appear not as bright as they should.

Now the camera moves, and the scene gets motion blur applied in engine prior to tone-mapping. All the sudden the scene gets brighter. The APL increases. In fact the larger the motion, the brighter the scene gets.

You, yes you the reader, have seen this before in many games. And technically speaking the game engine isn't doing anything wrong.

What is happening is those small bright areas of the scene, get blurred, distributing their energy over more pixels. Each of these blurred pixels have less intensity than the original source. As the intensity lowers, it falls more away from the aggressive areas of highlight compression, approaching tonality which can be reproduced by the display. So the APL increases, because less of the scene's output energy is getting limited by the tone-mapper.

The irony in this situation is that as motion and motion blur increases, the scene is actually getting closer to it's correct energy conserving visual representation as displayed.

Note the same effect applies as more of the scene gets out of focus during application of Depth of Field.

Re Twitter: Thoughts on Vulkan Command Buffers

Because twitter is too short ...

API Review
In Vulkan the application is free to create multiple VkCommandPools each of which can be used to allocate multiple VkCommandBuffers. However effectively only one VkCommandBuffer per VkCommandPool can be actively recording commands at a given time. The intent of this design is to avoid having a mutex when command buffer recording needs to allocate new CPU|GPU memory.

Usage/Problem Case
The following hypothetical situation is my best understanding of the usage case as presented in fragmented max 140 character messages on twitter. Say one had a 16-core CPU, where each core did a variable amount of command buffer recording. The application will need at a minimum 16 VkCommandPools in order to have 16 instances of command buffer recording going in parallel (one per core). Say the application has a peak of 256 command buffers generated per frame, and cores pull a job to write a command buffer from some central queue. Now given CPU threading and preemption is effectively random, it is possible in the worst case that only one thread on the machine has to generate all 256 command buffers. In Vulkan there are two obvious methods one could attempt to manage this situation,

(1.) Could pre-allocate 256 VkCommandBuffers on the 16 VkCommandPools, resulting in needing 4096 VkCommandBuffer objects total. Unfortunately AMD's Vulkan driver currently has higher than desired minimum allocated memory for each VkCommandBuffer. On the plus side there is an active bug, number 98777 (if you want to reference this in an email to AMD), for resolving this issue.

(2.) Could alternatively allocate then free VkCommandBuffers at run-time each frame.

Once bug 98777 is resolved with a driver fix, option (1.) would be the preferred solution from the above two options.

Digging Deeper
Part of what concerns me personally about this usage case is that it implies building an engine where VkCommandPool is effectively pinned to a specific CPU thread, and then randomly asymmetrically loading each VkCommandPool! For example, say in typical case each CPU thread builds on average the same amount of command buffers in terms of CPU and GPU memory consumption. In this mostly symmetrical load pattern, the total memory utilization of each VkCommandPool will be relatively balanced. Now say at some frequency one of the threads chosen randomly, and it's associated VkCommandPool, is loaded with 50% of the frame's command buffers in terms of memory utilization. If VkCommandPools "pool" memory and keep it, then over time each VkCommandPool would end up "pooling" 50% of the memory required for all the frame's command buffers. Which in this case would be roughly 8 times what is required.

This problem isn't really Vulkan specific, it is a fundamental problem on anything which does deferred freeing of a resource. The amount of over-subscription in random asymmetrical load is a function of the delay before deferred free. Which ultimately becomes a balancing act between the overhead in run-time or synchronization cost for dynamic allocation, against the extra memory required.

Possible Better Solution?
Might be better to un-pin VkCommandPool from CPU thread. Then instead use a few more VkCommandPools than CPU threads, and have each CPU grab exclusive access to a random VkCommandPool at run-time to use to build command buffers for jobs until after a set timeout, at which point it releases a given VkCommandPool, and then chooses the next free one to start work again. Note there is no mutex in here for acquire/release pool, but rather a lock-free atomic access to a bit array in say a 64-bit word.

In this situation, assuming CPU/GPU memory overhead for a command buffer scales roughly with CPU load of filling said command buffer, regardless of how asymmetrical the mapping is of jobs to CPU threads, the VkCommandPools get loaded relatively symmetrically.

Another thing about CPU threading which is rather important IMO, is that the OS will preempt CPU threads randomly after they have taken a job, which can cause random pipeline bubbles. As long as this is a problem, it might be desirable to preempt the OS's preemption and instead manually yield execution to another CPU thread at a point which ensures no pipeline bubbles (ie after finishing a job and releasing a lock on a queue, etc). The idea being to transform the OS's perception of the thread from being "compute-bound" thread (something which always runs until preemption) to something which looks like an interactive "IO-bound" thread (something which ends in self blocking). Maybe it is possible to do this by having more worker threads than physical/virtual CPU threads, and waking another worker, then blocking until woken again. Something to think about...

Transferring Command Buffers Across Pools?
I'll admit here I've been so Vulkan focused that I'm current out of touch with how exactly DX12 works. Seems like the twitter claim is that the Vulkan design is fundamentally flawed because VkCommandBuffer is locked to a VkCommandPool at allocation-time, instead of being set at begin-recording-time like DX12. This sounds to me the same as (2.) at the top of this post, effectively making "Allocate" and "Free" very fast for command buffers in a given pool, just "Allocate" is now effectively "Begin Recording" in the DX12 model. Meaning just shuffling work around to different API entry points. Assigning the Pool at "Begin Recording" time does not do anything to solve the asymmetric Pool loading problem caused by the desire to have Pools pinned to CPU threads for this usage case.

Baking Command Buffers - And Replaying
As the number of command buffers increases, one is effectively factoring out the sorting/predication of commands which would otherwise be baked into one command buffer, and deferring that sorting/predication until batch command buffer submit time. As command buffer size gets smaller, it can cross the threshold where it becomes more expensive to generate the tiny command buffers, than to cache them and just place them into the batch submit. So if say one had roughly 256 command buffers in effectively everything outside of shadow generation and drawing, meaning everything from compute based lighting through post processing, it is likely better to just cache baked command buffers instead of always regenerating them.

My personal preference is effectively "compute-generated-graphics", rending with compute only, mixed with fully baked command buffer replay (no command buffer generation after init time), and indirect dispatch to manage adjusting amount of work to run per frame ...

20160715

LED Displays

Gathering information to attempt to understand what is required to drive indoor LED sign based displays...

Target
256x128 2:1 letter box display (NES was 256 pixels wide).

How do LED Modules Work?
Adafruit provides one description how to drive a 32x16 LED module. Attempting a rough translation. LEDs are either on or off. The 32x16 panel can only drive 64 LEDs at one time, organized as two 32x1 lines 8 rows apart. Scanning starts with lines {0,9}, then {1,10}, then {2,11}, and so on.

Panels are designed to be chained, driven by a 16-bit connector which provides 2 pixels per clock (one pixel for top and one for bottom scan-line). Looks like some other grouped LED panels go up to 128x128, driven by 4 row chunks of 128x32, each built from two chained 64x32 panels. Seems like the 64x32 panels are driven with 2 lines of 64 pixels (based on the addition of one extra address bit). Could not find a good description of chaining yet.

Seems like the 64x32 panels have roughly a 1/16 duty cycle (meaning only 1/16 of the LEDs are active at any one time). LED displays are low-persistence high-frame-rate displays with binary pixels. Based on this thread they can drive one cable at 40 MHz. So a 128x128 panel with 4 cables would be roughly 80M pixels / (128*32 pixel/frame) = 19.5 thousand frames per second.

The basic Pulse Width Modulation (PWM) to modulate brightness would transform this low-persistence display into something effectively scan-and-hold, just with a lot of micro-strobed sub-frames doing PWM across the effective "scan-and-hold" period. Getting something truly low-persistence is more of a challenge. These displays can be over 1500 nits (even with a 1/16 duty cycle). So one option for lower persistence is to actually insert black frames between frames, dropping the scan-and-hold time.

A 120 Hz frame rate provides 8.333 ms of frame time, switching to half black frames would drop to 4.16 ms (which isn't yet low persistence IMO), and would reduce to a 750 nit display (half the contrast), leaving roughly 80 or so sub-frames for PWM.

A 240 Hz frame rate at half black frames could be at the right compromise between lost contrast and low persistence. A 480 Hz frame rate with no black frames might be able to provide full contrast, and low enough persistence, but likely would need some seriously good temporal dithering.

20160706

Low Cost Branching to Factoring Out Loop Exit Check

Threading to hide pipeline depth combined with an ISA which makes branching cheep is one goal. Specifically absolute branching with immediate destination in the opcode word (single word branch/call, no adder), and instructions which include a return flag (make returns free). Enables easy computed branches, both for jump tables, and loops. Can factor out loop check into hierarchical call tree,

Do4: Unroll work four times; Return;
Do16: Call Do4; Call Do4; Call Do4; Jump Do4;
Do64: Call Do16; Call Do16; Call Do16; Jump Do16;
... etc ...


Can use a computed branch to jump into the tree for other loop counts.

20160705

CPU Threading to Hide Pipelining

If a CPU has a 4 stage pipeline, would be nice to have 4 CPU threads round robin scheduled to ensure pipeline delays for {memory, alu, branches, etc} do not have to be programmer visible, and to avoid complexities such as forwarding.

According to docs, Xilinx 7-series DSPs need a 3 stage pipeline for full speed MAC, and BRAMs (Block RAMs) need 1 cycle delay for reads.

Working from high level constraints, I'm planning using the following for the CPU-side of the project,

16-bit or 18-bit machine
1 DSP
X BRAMs of Instruction RAM (2 ports, read or write for either)
Y BRAMs of Data RAM (2 ports, read or write for either)


Which suggests the following 4 stage pipeline (with 4 CPU threads running in parallel, one on each pipeline stage),

[0]
Instruction BRAM Read -> Instruction BRAM Registers
DSP MUL -> DSP {M,C} Registers (from prior instruction)

[1]
Instruction Decode
DSP ALU -> DSP {P} Registers (from prior instruction)

[2]
Data BRAM Write(s) (results from prior instruction)
Data BRAM Read(s) -> Data BRAM Registers

[3]
DSP Input -> DSP {A,B,D} Registers


With an ISA which can do something as complex as the following (below) in one instruction. A focus on instruction forms which can leverage both ports on the Instruction BRAMs (opcode and separate optional immediate), as well as both ports on the Data BRAMs. Using dedicated address registers to provide windows into the Data BRAMs for immediate access instead of a conventional register file, and leveraging a special high-bit-width accumulator to maintain precision of intermediate fixed-point operations.

[addressRegister[2bitImmediate]^nbitImmediate] = accumulator;
accumulator = [addressRegister[2bitImmediate]^nbitImmediate]] OP 18bitImmediate;

Relative Addressing With XOR - Removing an Adder

Could be an interesting compromise: use XOR instead of ADD for relative addressing to remove an adder. Specifically,

Address = AddressRegister XOR Immediate

Forces the programmer to keep address register on some power of two alignment. With caches and/or parallel access to banked memory, this would be a bad. But likely fine for a core with a private memory, and code written in assembly.

20160704

SpartanMC

SpartanMC - An FPGA soft core with an 18-bit word size, with a SPARC like sliding register window.

"Forth" of July Reboot

Used the nuclear option on the blog, starting over, synchronizing with an internal reboot, an attempt to completely refocus personal hobby time on FPGA based hardware design. This blog serving as a place to collect thoughts and ideas as I stumble towards something to synthesize...