Black Friday vs Display Technology

My wife and I are in the Toronto area in Canada visiting family for the the US Thanksgiving holiday. Naturally food and shopping resulted in a trip to Costco and the mall, where I ran across an Apple store, and managed to entertain myself by getting back up to speed again with the state of 4K...

I'm Lost, Where is the 4K?
In some Markham Apple store, I tried one of these 27" 5K iMacs. Was very confused at first, looking at the desktop photo backgrounds, everything is up-sampled. But the rendered text in the UI, that was sharp, actually too sharp. Had to find out what GPU was driving this 5K display. Opened Safari and went to www.apple.com, went to the website with iMac tech specs. Couldn't read the text. Solved this problem by literally sticking my face to the screen. AMD GPU, awesome.

Second Costco, everything is 4K, and yet nothing shows 4K content. Seems like out of desperation they were attempting to show some time-lapse videos from DLSR still frames (easier to get high resolution from stills than from video). Except even those looked massively over-processed and up-sampled (looked like less than Canon 5D Mark I source resolution). All the BluRay content looks like someone accidentally set the median filter strength to 1000%. All the live digital TV content looks like "larger-than-life" video compression artifacts post processed by some haloing sharpening up-sampling filter. There is a tremendous opportunity for improvement here driving the 4K TV from the PC, and doing better image processing.

Ok So Lets Face Reality, Consumers Have Never Seen Real 4K Video Content
Well except for IMAX in the theater. Scratch that, consumers haven't even seen "real" 2K content (again other than at the movie theater). Artifact-free 2K content simply doesn't exist at the bit-rate people stream content. Down-sample to 540p and things look massively better.

What About Photographs
Highest-end "prosumer" Canon 5DS's 8688x5792 output isn't truly 8K. It is under 4K by the time it is down-sampled enough to remove artifacts. At a minimum the Bayer sampling pattern results in technically something around 8K/2 in pixel area, but typical Bayer reconstruction introduces serious chroma aliasing artifacts. Go see for yourself, zoom into the white on black text pixels on dpreview's studio scene web app. But that is just ideal lighting, at sharpest ideal aperture, on studio stills. Try keeping pixel-perfect focus everywhere in hard lighting in a real scene with a 8K 35mm full-frame camera... Anyway most consumers have relatively low resolution cameras in comparison. Even the Canon 5D Mark III is only 5760x3840 which is well under 4K after down-sampling to reach pixel perfection. Ultimately a Canon 5D Mark III still looks better full screen on a 2K panel because the down-sampling removes all the artifacts which the 4K panel makes crystal clear.

4K Artifact Amplification Machine Problem
Judging by actual results on Black Friday, no one is doing a good job of real-time video super-resolution in 4K TVs. It is like trying to squeeze blood from a stone, more so when placing in the context that the source is typically suffering from severe compression artifacts. The HDTV industry needs it's form of the "Organic" food trend: please bring back some non-synthetic image processing. The technology to do a good job of image enlargement was actually perfected a long time ago, it is called the slide projector. Hint use a DOF filter to up-sample.

4K and Rendered Content
What would sell 4K as a TV to me: a Pixar movie rendered at 4K natively, played back from losslessly compressed video. That is how I define "real" content.

What about games? I've been too busy to try Battlefront yet except the beta, and I'm on vacation now without access to my machine (1st world problems). But there are some perf reviews for 4K online. Looks like Fury X is the top single-GPU score at 45 fps. Given there are no 45 Hz 4K panels, that at best translates to 30 Hz without tearing. Seems like variable refresh is even a more important feature for 4K than it was at 1080p, this is something IMO TV OEMs should get on board with. Personally speaking, 60 Hz is the minimum fps I'm willing to accept for a FPS, so I'm probably only going to play at 1080p down-sampled to 540p (for the anti-aliasing) on a CRT scanning out around 85 Hz (to provide a window to avoid v-sync misses).

Up-Sampling Quality and Panel Resolution
The display industry seems to have adopted a one-highest-resolution fits all model, which is quite scary, because for gamers like myself, FPS and anti-aliasing quality is the most important metric. Panel resolution beyond the capacity to drive perfect pixels is actually what destroys quality, because it is impossible to up-sample with good quality.

CRT can handle variable resolutions because the reconstruction filtering is at an infinite resolution. Beam has a nice filtered falloff which blends scan-lines perfectly. 1080p up-sampled on a 4K panel will never look great in comparison, because filtering quality is limited to alignment to two square display pixels. 4K however probably will provide a better quality 540p. Stylized up-sampling is only going to get more important as resolutions increase. Which reminds me, I need to release the optimized versions of all my CRT and stylized up-sampling algorithms...

CRT vs LCD or OLED in the Context of HDR
Perhaps it might be possible to say that HDR would finally bring forward something superior to CRTs. Except there is one problem, display persistence. A typical 400 nit PC LCD driven at 120 Hz has a 8.33 ms refresh. Dropping this to a 1 ms low persistence strobed frame would result in 1/8 the brightness (aka it would act like a 50 nit panel). With LCDs there is also the problem of strobe to scanning pattern (that changes the pixel value) mismatch resulting in ghosting towards the top and bottom of the screen. Add on top, large displays like TV panels have power per screen area problems resulting in global dimming. So I highly doubt LCD displaces the CRT any time soon even in the realm of 1000 nit panels.

OLED is a wild card. Works great for low persistence in VR, but what about large panels? Seems like the WRGB LG 4K OLED panel is the one option right now (it is apparently in the Panasonic as well). Based on displaymate.com results, OLED is not there yet. Blacks should be awesome, and better than CRTs, but according to hdtvtest.co.uk's review of the latest LG OLED, there is a serious problem with near black vertical banding after 200+ hours of use. Also with only around 400 some nits peak with more global dimming issues than typical LCDs, looks like large-panel low-persistence is going to be a problem for now. Hopefully this tech gets improved and they eventually release a 1080p panel which does low persistence at as low as 80 Hz.

Reading tech specs of these 4K HDR TVs paints a clear picture of what HDR means for Black Friday consumers. Around a 400-500 nit peak, just like current PC panels, but PC panels don't have global dimming problems. Newest TV LCD panels seem to have a 1-stop ANSI contrast advantage, perhaps less back-light leakage with larger pixels. Screen reflectance has been dropping (this is an important step in getting to better blacks). TV LCD panels are approaching P3 gamut. PC panels have been approaching Adobe RGB gamut. Both are similar in area. Adobe RGB mostly adds green area to sRGB/Rec709 primaries, where P3 adds less green, but more red. So ultimately if you grab an Black Friday OLED and don't get the near black 200+ hour problem, HDR translates to literally "better shadows".

Rec 2020 Gamuts and Metamerism
This Observer Variability in Color Image Matching on a LCD monitor and a Laser Projector paper is a great read. The laser projector is a Microvision, the same tech in a Pico Pro, has a huge Rec 2020 like gamut. The route to this gamut is via really narrow band primaries. As the primaries get narrow and move towards the extremes of the visible spectrum, the human perception of the color generated by the RGB signal becomes quite divergent, see Figure 6. Note it is impossible to use a measurement tool to calibrate this problem away. The only way to fix it is via manual user adjustment visually: probably via selecting between 2 or 3 stages of 3x3 swatches on the screen. And note that manual "calibration" would only be good for one user... anyone else looking at the screen is probably seeing a really strangely tinted image.

HDR and Resolution and Anti-Aliasing
While it will still take maybe 5 to 10 years before the industry realizes this, HDR + high resolution has already killed off the current screen-grid aligned shading. Lets start with the concept of "nyquist frequency" in the context of still and moving images. For a set display resolution, stills can have 2x the spatial resolution of video if they either align features to pixels (aka text rendering) or accept that pixel sized features not aligned to pixel center will disappear as they align with pixel boarder. LCDs adopted "square pixels" and amplified the spatial sharpness of pixel-aligned text, and this works to the disadvantage of moving video. Video without temporal aliasing can only safely resolve a 2 pixel wide edge at full contrast, as one pixel edges under proper filtering would disappear as they move towards pixel boarder (causing temporal aliasing). So contrast as a function of frequency of detail, needs to approach zero as features approach 1 pixel in width. HDR can greatly amplify temporal aliasing, making this filtering many times more important.

Screen-grid aligned shading via raster pipe requires N samples to represent N graduations between pixels. LDR white on black requires roughly 64 samples/pixel to avoid visible temporal aliasing, in this case worst aliasing being around 1/64 of white (possible to mask the remaining temporal aliasing in film grain). With HDR the number of samples scales by the contrast between the lightest and darkest sample. To improve this situation requires factoring in the fractional coverage of the shaded sample. And simply counting post-z sample coverage won't work (not enough samples for HDR). Maybe using Barycentric distance of triangle edges to compute a high precision coverage might be able to improve things...

The other aspect of graphics pipelines which needs to evolve is the gap between the high frequency MSAA resolve filter and low frequency bloom. The MSAA resolve filter cannot afford to get wide enough to properly resolve an HDR signal. The more the contrast, the larger the resolve filter kernel must be. MSAA resolve to not have temporal aliasing with LDR requires a 2 pixel window. With HDR the 1/luma bias is used which creates a wrong image. The correct way is to factor the larger than 2 pixel window into a bloom filter which starts at pixel frequency (instead of say half pixel).

But these are really just bandages, shading screen aligned samples doesn't scale. Switching to object/texture space shading with sub-pixel precision reconstruction is the only way to decouple resolution from the problem. And after reaching this point, the video contrast rule of approaching zero contrast around 1 pixel wide features starts to work to a major advantage, as it reduces the number of shaded samples required across the image...


CRT Inventory

Makvision M2929 ----------------- 29", _4:3, _800x600__, 30-40_ KHz, 47-90_ Hz, 90___ MHz, slot mask_____, _____0.73 mm pitch, VGA_
HP "1024" D2813 ----------------- 14", _4:3, 1024x768__, 30-49_ KHz, 50-100 Hz, _____ MHz, shadow mask___, ____________ pitch, VGA_
Sony Wega KV-30HS420 ------------ 30", 16:9, _853x1080i, ______ KHz, ______ Hz, _____ MHz, aperture grill, ____________ pitch, HDMI
ViewSonic G75f ------------------ 17", _4:3, 1600x1200_, 30-86_ KHz, 50-180 Hz, 135__ MHz, shadow mask___, 0.21-0.25 mm pitch, VGA_
ViewSonic PS790 ----------------- 19", _4:3, 1600x1200_, 30-95_ KHz, 50-180 Hz, 202.5 MHz, shadow mask___, _____0.25 mm pitch, VGA_
Dell Ultrascan 1600HS D1626HT --- 21", _4:3, 1600x1200_, 30-107 KHz, 48-160 Hz, _____ Mhz, aperture grill, 0.25-0.27 mm pitch, VGA_
Dell Ultrascan 20TX D2026T-HS --- 20", _4:3, 1600x1200_, 31-96_ KHz, 50-100 Hz, _____ MHz, aperture grill, _____0.26 mm pitch, VGA_

Cross-Invocation Data Sharing Portability

A general look at the possibility of portability for dGPUs with regards to cross-invocation data sharing (aka to go next after ARB_shader_ballot which starts exposing useful SIMD-level programming constructs). As always I'd like any feedback anyone has on this topic, feel free to write comments or contact me directly. Warning this was typed up fast to collect ideas, might be some errors in here...

References: NV_shader_thread_group | NV_shader_thread_shuffle | AMD GCN3 ISA Docs

NV Quad Swizzle (supported on Fermi and beyond)
shuffledData = quadSwizzle{mode}NV({type} data, [{type} operand])
(1.) "mode" is {0,1,2,3,X,Y}
(2.) "type" must be a floating point type (implies possible NaN issues issues with integers)
(3.) "operand" is an optional extra unshuffled operand which can be added to the result
The "mode" is either a direct index into the 2x2 fragment quad, or a swap in the X or Y directions.

swizzledData = quadSwizzleAMD({type} data, mode)
(1.) "mode" is a bit array, can be any permutation (not limited to just what NVIDIA exposes)
(2.) "type" can be integer or floating point

Possible Portable Swizzle Interface
bool allQuad(bool value) // returns true if all invocations in quad are true
bool anyQuad(bool value) // returns true for entire quad if any invocations are true
swizzledData = quadSwizzleFloat{mode}({type} data)
swizzledData = quadSwizzle{mode}({type} data)
(1.) "mode" is the portable subset {0,1,2,3,X,Y} (same as NV)
(2.) "type" is limited to float based types only for quadSwizzleFloat()
This is the direct union of common functionality from both dGPU vendors. NV's swizzled data returns 0 for "swizzledData" if any invocation in the quad is inactive according to the GL extension. AMD returns 0 for "swizzledData" only for inactive invocations. So the portable spec would have undefined results for "swizzledData" if any invocation in the fragment quad is inactive. This is a perfectly acceptable compromise IMO. Would work on all AMD GCN GPUs and any NVIDIA GPU since Fermi for quadSwizzlefloat(), and since Maxwell for quadSwizzle() (using shuffle, see below), this implies two extensions. Quads in non fragment shaders are defined by directly splitting the SIMD vector into aligned groups of 4 invocations.

NV Shuffle (supported starting with Maxwell)
shuffledData = shuffle{mode}NV({type} data, uint index, uint width, [out bool valid])
(1.) "mode" is one of {up, down, xor, indexed}
(2.) "data" is what to shuffle
(3.) "index" is a invocation index in the SIMD vector (0 to 31 on NV GPUs)
(4.) "width" is {2,4,8,16, or 32}, divides the SIMD vector into equal sized segments
(5.) "valid" is optional return which is false if the shuffle was out-of-segment
Below the "startOfSegmentIndex" is the invocation index of where the segment starts in the SIMD vector. The "selfIndex" is the invocation's own index in the SIMD vector. Each invocation computes a "shuffleIndex" of another invocation to read "data" from, then returns the read "data". Out-of-segment means that "shuffleIndex" is out of the local segment defined by "width". Out-of-segment shuffles result in "valid = false" and sets "shuffleIndex = selfIndex" (to return un-shuffled "data"). The computation of "shuffleIndex" before the out-of-segment check depends on "mode".
(indexed) shuffleIndex = startOfSegmentIndex + index
(_____up) shuffleIndex = selfIndex - index
(___down) shuffleIndex = selfIndex + index
(____xor) shuffleIndex = selfIndex ^ index

Also can do swizzle across segments of 32 invocations using the following math.
and_mask = offset[4:0];
or_mask = offset[9:5];
xor_mask = offset[14:10];
for (i = 0; i < 32; i++) {
j = ((i & and_mask) | or_mask) ^ xor_mask;
thread_out[i] = thread_valid[j] ? thread_in[j] : 0; }

The "_mask" values are compile time immediate values encoded into the instruction.

AMD VOP_DPP (starts with GCN3: Tonga, Fiji, etc)
DPP can do many things,
For a segment size of 4, can do full permutation by immediate operand.
For a segment size of 16, can shift invocations left by an immediate operand count.
For a segment size of 16, can shift invocations right by an immediate operand count.
For a segment size of 16, can rotation invocations right by and immediate operand count.
For a segment size of 64, can shift or rotate, left or right, by 1 invocation.
For a segment size of 16, can reverse the order of invocations.
For a segment size of 8, can reverse the order of invocations.
For a segment size of 16, can broadcast the 15th segment invocation to fill the next segment.
Can broadcast invocation 31 to all invocations after 31.
Has option of either using "selfIndex" on out-of-segment, or forcing return of zero.
Has option to force on invocations for the operation.

AMD DS_PERMUTE_B32 / DS_BPERMUTE_B32 (starts with GCN3: Tonga, Fiji, etc)
Supports something like this (where "temp" is in hardware),
bpermute(data, uint index) { temp[selfIndex] = data; return temp[index]; }
permute(data, uint index) { temp[index] = data; return temp[selfIndex]; }

Possible Portable Shuffle Interface : AMD GCN + NV Maxwell
This is just a start of ideas, have not had time to fully explore the options, feedback welcomed...
SIMD width would be different for each platform so developer would need to build shader permutations for different platform SIMD width in some cases.

butterflyData = butterfly{width}({type} data)
Where "width" is {2,4,8,16,32}. This is "xor" mode for shuffle on NV, and DS_SWIZZLE_B32 on AMD (with and_mask = ~0, and or_mask = 0) with possible DPP optimizations on GCN3 for "width"={2 or 4}. The XOR "mask" field for both NV and AMD is "width>>1". This can be used to implement a bitonic sort (see slide 19 here).

TODO: Weather is nice outside, will write up later...

reducedData = reduce{op}{width}({type} data)
(1.) "op" specifies the operation to use in the reduction (add, min, max, and, ... etc)
(2.) "width" specifies the segment width
At the end of this operation only the largest indexed invocation in each segment has the result, the values for all other invocations in the segment are undefined. This enables both NV and AMD to have optimal paths. This uses "up" or "xor" mode on NV for log2("width") operations. Implementation on AMD GCN uses DS_SWIZZLE_B32 as follows,
32 to 16 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=16
16 to 8 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=8
8 to 4 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=4
4 to 2 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=2
2 to 1 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=1
64 from finalized 32 => V_READFIRSTLANE_B32 to grab invocation 0 to apply to all invocations

Implementation on AMD GCN3 uses DPP as follows,
16 to 8 => reverse order of 16-wide (DPP_ROW_MIRROR)
8 to 4 => reverse order of 8-wide (DPP_ROW_HALF_MIRROR)
4 to 2 => reverse order using full 4-wide permutation mode
2 to 1 => reverse order using full 4-wide permutation mode
32 from finalized 16 => DPP_ROW_BCAST15
64 from finalized 32 => DPP_ROW_BCAST32

reducedData = allReduce{op}{width}({type} data)
The difference being that all invocations end up with the result. Uses "xor" mode on NV for log2("width") operations. On AMD this is the same as "reduce" except for "width"={32 or 64}. The 64 case can use V_READLANE_B32 from the "reduce" version to keep the result in an SGPR to save from using a VGPR. The 32 case can use DS_SWIZZLE_B32 for the 32 to 16 step.

Possible Portable Shuffle Interface 2nd Extension : AMD GCN3 + NV Maxwell
This is just a start of ideas, have not had time to fully explore the options, feedback welcomed...
SIMD width would be different for each platform so developer would need to build shader permutations for different platform SIMD width in various cases.

Backwards permutation of full SIMD width is portable across platforms, maps on NV to shuffleNV(data, index, 32), and DS_BPERMUTE_B32 on AMD,
permutedData = bpermute(data, index)


ISA Toolbox

For years now I have found that nearly everything I work on can be made better by leveraging ISA features which are not always exposed in all the graphics APIs. For example, currently working on a project now which could use the combination of the following,

(1.) From AMD_shader_trinary_minmax, max3(). Direct access to max of three values in a single V_MAX3_F32 operation. If the GPU has 3 read ports on the register file for FMA, might at well take advantage of that for min/max/median. AMD's DX driver shader compiler automatically optimizes these cases, for example "min(x,min(y,z))" gets transformed to "min3(x,y,z)".

(2.) Direct exposure of V_SIN_F32 and V_COS_F32, which have a range of +/- 512 PI and take normalized input. Avoids and extra V_MUL_F32 and V_FRACT_F32 per operation. Nearly all the time I use sin() or cos() I'm in range (no need for V_FRACT_F32). Nearly all the time I'm in the {0 to 1} range for 360 degrees, and need to scale by 2 PI only so code generation can later scale back by 1/2 PI. Portable fallback for machines without V_SIN_F32 and V_COS_F32 like functionality looks like,

float sinNormalized(float x) { return sin(x * 2.0 * PI); }
float cosNormalized(float x) { return cos(x * 2.0 * PI); }

(3.) Branching if any or all of the SIMD vector want to do something. Massively important tool to avoid divergence. For example in a full screen triangle, if any pixel needs the more complex path, just have the full SIMD vector only do the complex path instead of divergently processing both complex and simple. API can be quite simple,

bool anyInvocations(bool x)
bool allInvocations(bool x)

Example of how these could map in GCN (these scalar instructions execute in parallel with vector instructions, so low cost),

// S_CMP_NEQ_U64 x,0
if(anyInvocations(x)) { }

// S_CMP_EQ_U64 x,-1
if(allInvocations(x)) { }

(4.) Quad swizzle for fragment shaders for cross-invocation communication is super useful. Given a 2x2 fragment quad as follows,


These functions would be quite useful (they map to DS_SWIZZLE_B32 in GCN),

// Swap value horizontally.
type quadSwizzle1032(type x)

// Swap value vertically.
type quadSwizzle2301(type x)

For example one could simultaneously write out the results of a fragment shader to the standard full screen pass and write out the 1/2 x 1/2 resolution next smaller mip level at the same time using an extra image store. Just use the following to do a 2x2 box filter in the shader,

boxFilterColor = quadSwizzle1032(color) + color;
boxFilterColor += quadSwizzle2301(boxFilterColor);


Mixing Temporal AA and Transparency

Jon Greenberg asks on twitter, "Okay, so here's the TemporalAA question of the day - transparency isn't TAA'd - how do you manage the jittered camera? Ignore it? Oy..."

The context of this question is often the following graphics pipeline,

(1.) Render low-poly proxy geometry for some of the major occluders in a depth pre-pass.
(2.) Render g-buffer without MSAA, each frame using a different jittered sub-pixel viewport offset.
(3.) Render transparency (without viewport jitter) in a separate render target.
(4.) Later apply temporal AA to opaque and somehow composite over the separate transparent layer.

Here are some ideas on paths which might solve the associated problems,

Soft Transparency Only
If the transparent layer has soft-particle depth intersection only (no triangle windows, etc), then things are a lot easier. Could attempt to apply temporal AA to the depth buffer, creating a "soft" depth buffer where edges are part eroded towards far background neighborhood of a pixel. Then do a reduction on this "soft" depth buffer, getting smaller resolution near and far depth value for the local neighborhood (with some overlap between neighborhoods). Then render particles into two smaller resolution color buffers (soft blending a particle to both near and far layers). Can use the far depth reduction as the Z buffer to test against. Later composite into the back-buffer over the temporal AA using the "soft" full-res depth buffer to choose a value between the colors in "near" and "far". Note there is an up-sample involved inline in this process, and various quality/performance trade-offs in how this combined up-sample/blend/composite operation happens. I say "back-buffer" because don't want to feedback the transparency into the next temporal AA pass.

Hard Transparency
Meaning what to do about windows, glasses, and other things which require full-resolution hard intersections with other opaque geometry. Any working solution here also needs an anti-aliased mask for post temporal AA composite. There are no great solutions to my knowledge with the traditional raster based rendering pipelines with viewport jitter. One option is to work around the problem in the art side, to make glass surfaces mostly opaque and render with matching viewport jitter over the lit g-buffer, also correcting so reprojection or motion vectors pick up the new glass surface instead of what is behind it. So glass goes down the temporal AA path.

Another option might be to use the "soft" depth buffer technique but at full resolution. Probably need to build a full resolution "far" erosion depth buffer (take the far depth of the local neighborhood), then depth test against that. Note depth buffer generated by a shader will have an associated perf cost when tested against. Then when rendering transparency can blend directly over the temporal AA output in the back-buffer. In the shader, fetch the pre-generated "near" and "far" reductions, and soft-blend the hard triangle with both. Then take those two results, lookup the "soft" depth from the full resolution, and use as a guide to lerp between the "near" and "far" result. This will enable a "soft" anti-aliased edge, in theory, assuming all the details that matter are correct...

Note on "Soft" Depth
The "soft" depth probably requires that temporal AA not be applied to linear depth, but perhaps some non-linear function of depth. I don't remember any more which transform works the best, but guessing if you take this transformed depth and output it to a color channel and see a clear depth version of the scene with anti-aliased edges, from nearest to far objects, that is a good sign.


Rethinking the Symbolic Dictionary

Another permutation of dictionary implementation for forth like languages...

Exported source is composed of two parts,

(1.) Token array, where tokens can reference a local symbol by index into local hash table.
(2.) Local symbol hash table, has string for each entry.

Strings are 64-bits maximum and are stored in a reversible nearly pre-hashed form. So hashing of a string is just an AND operation. Tokens are 32-bits. Local symbol hash is after the token array, so it can be trivially discarded after import.

Global Dictionary
Global dictionary maps 32-bit index to 32-bit value. Each 32-bit index has an associated 64-bit string stored in the same reversible nearly pre-hashed form. Dictionary entries are allocated by just taking the next entry in a line. There is no deletion. Just two arrays (32-bit value array, and 32-bit string array), and an index for the top.

Source Import
Starts with loaded source in memory and allocated space for one extra array,

(1.) Source token array, gets translated to loaded-in-memory form.
(2.) Source local symbol hash table, with each entry being a 64-bit string.
(3.) Remap space, extra zeroed array with a 32-bit value per entry in local hash table.

Import streams through the global dictionary, checking for a match in the source's local symbol hash table. Upon finding a match, it writes the global index for the symbol into the associated remap space entry. Import next streams through the source token array, replacing the local symbol index with the global index from the remap space entry. When the remap space entry is zero, a new symbol is allocated in the global dictionary (this involves adding a symbol to the end of the dictionary, and coping over the string from the local symbol hash table to the global dictionary string array). After import the local symbol hash table and remap space are discarded.

This solves many of the core problems from a more conventional design where the global dictionary is a giant hash table. That conventional design suffers from bad cache locality (because of the huge hash table). This new design maintains a cache packed global dictionary (no gaps). That conventional design can have worst case first load behavior, each initial lookup of a new word in the dictionary on load would miss through to DRAM, adding 100 ns per lookup. This new design is composed of either linear streaming operations for big data (global dictionary, source token array, etc) all of which get hardware auto-prefetch. The source local symbol hash table is expected to be not too big and easily stay in cache (the only thing with random access).

Note with this new design, interpreting source at run-time no longer has any hash lookup, just a direct lookup.

First Source Import
First source import (after machine reboot) has effectively an empty dictionary, so import can be optimized.

Edit time operations, such as find the index for an existing symbol, check if a symbol already exists, or tab complete a symbol, is done via a full stream through the global dictionary string table. This is a linear operation with full auto-prefetch, so expected to be quite fast in practice. Edit time operations are limited by human factors, so not a problem.

Source Export
Source export requires first checking how many unique symbols are in the chunk of source. Use a bit array with one bit per global dictionary entry. Zero the bit array. Stream through the chunk of source tokens and check for a clear bit in the bit array. For each clear bit, set the bit, and advance the count of unique words.

Setup space for the local symbol hash. Scale up the unique symbol count to make sure the hashing is efficient. Pad up to the next power of 2 in size. Stream through the source tokens, using the token index to get a global dictionary string, hash the string into the local symbol hash, writing the associated string if new entry, and remapping the source token index to the local hash.

Export is the most complex part of the design, but still quite simple.


Continuing on the TS Blog Conversation Chain

Re Joshua Barczak - Texel Shader Discussion...

"I’m suggesting that the calling wave be re-used to service the TS misses (if any), so instead of waiting for scheduling and execution, it can jump into a TS and do the execution itself."

I'm going to attempt to digest actually building this on something similar to current hardware and see where the pitfalls would be. Basically the PS stage shader gets recompiled to include conditional TS execution. This would roughly look like,

(1.) Do some special IMAGE_LOADS which set bit on miss in a wave bitmask stored in a pair of SGPRs.
(2.) Do standard independent ALU work to help hide latency.
(3.) Do S_WAITCNT to wait for IMAGE_LOADS to return.
(4.) Check if bitmask in SGPR is non-zero, if so do wave coherent branch to TS execution (this needs to activate inactive lanes).

Continuing with TS execution,

(5.) Loop while bitmask is non-zero.
(6.) Find first one bit.
(7.) Start TS shader wave-wide corresponding to the lane with the one bit.
(8.) Use the TEX return {x,y} to get an 8x8 tile coordinate to re-generate and {z} for mip level.
(9.) Do TS work and write results directly back into L1.
(10.) When (5.) ends, re-issue IMAGE_LOADS.
(11.) Do S_WAITCNT to wait for loads to return.
(12.) For any new invocations which didn't pass before, save off successful results to other registers.
(13.) Check again if bitmask in SGPR is non-zero, if so go back to (5.).
(14.) Branch back to PS execution (which needs to disable inactive lanes).

This kind of design has a bunch of issues, getting into a few of them,

(A.) Step (10.) has no post load ALU before S_WAITCNT, so it hides less of it's own latency (even though it will hit in the cache).

(B.) Need to assume texture can miss at the point where the wave has already peak register usage in the shader, this implies a shader needs that plus the TS needs in terms of total VGPR usage. Given the frequency of PS work which is VGPR pressure limited without any TS pass compiled in, this is quite scary. Cannot afford to save out the PS registers. Also cannot afford to make hardware to dynamically allocate registers at run-time just for TS (deadlock issues, worst case problem of too many waves attempting to run TS code paths at same time, etc). So VGPR usage would be a real problem with TS embedded in PS.

(C.) Need to hardware change to ensure TS results are in pinned cache lines until after first access finishing serving a given invocation. This way the IMAGE_LOAD in (10.) is ensured a hit to have some guaranteed forward progress. There is a real problem that 8x8 tiles generated early in the (5.) loop might normally be evicted by the time all the data was generated.

(D.) Attempt to do random access at 64-bit/texel (aka fetch from 64 8x8 tiles) which all miss. That's 64*8*8*8 bytes (32KB) or double the size of the L1 cache. There are multiple major terminal design problems related to this causing the wasteful (12.) step: need to support the possibility that one cannot service all texture loads for 64-invocations from one pass.

(E.) The TS embedded in PS option would lead to some radically extreme cases like multiple waves missing on the same 8x8 tiles, and possibly attempting to regenerate same tiles in parallel.

(F.) The TS embedded in PS option would result in extreme variation in PS execution time. Causing a requirement for more buffering in the in-order ROP fixed function pipeline.

So gut feeling is that this isn't practical.

I feel like many of these ideas fall into a similar design trap: the idea of borrowing the concept of "call and return" from CPUs. Devs have decades of experience solving problems taking advantage of a "stack", depending on what seems like "free" saving of state and later restoring of state. That idea only applies to hardware which has tiny register files, and massive caches. GPUs are the opposite, there is no room in any cache for saving working state. And working state of a kernel is massive in comparison to the bandwidth used to fetch inputs and write outputs, so one never wants working set to ever go off-chip. Any time anyone builds a GPU based API which has a "return", or a "join", this is an immediate red flag for me. GPUs instead require "fire and forget" solutions, things that are "stackless" and things which look more like message passing where a job never waits for the return. The message needs to include something which triggers the thing which ultimately consumes the data, or the consumer is pre-scheduled to run and waits on some kind of signal which blocks launch.

Location of "Filtering" in the Graphics Pipeline

This is a tangent related to the prior thread...

The hardware provides fast filtering from the texture unit, but only the worst-case filter: in terms of classic re-sampling filters, bilinear filtering has horrible quality. A proper filter would take many taps and would be phase adaptive: way too expensive for fixed-function texture fetch.

Second issue, filtering components before shading is fundamentally flawed. This practice exists also as an evolution of a practical trade-off to enable real-time graphics. The extent at which developers now attempt to workaround this problem can be quite amazing (for example CLEAN mapping).

An alternative to all of this is to shade at fixed positions in object space, then defer filtering into some final screen-space reconstruction pass. The object space positions get translated into world space then later view space, and sub-pixel precision view space coordinates get passed to the reconstruction pass. This provides one critical advantage: anti-aliasing quality becomes decoupled from sampling rate!

Traditionally, if one renders a scene with 8xMSAA using a high quality resolve filter with around a 1 pixel radius, that resolve filter is going to use somewhere around 16 to say 24 taps (or sample reads) per output pixel. Anything which does re-sizing during the resolve needs to compute filter weights per tap (phase adaptive). Output of the filter is a weighted average. If one uses temporal AA with jittered rendering, typically 9 taps are required for registration (to remove the jitter), with another 1 or many more taps required for re-projection (depending on filter quality). Ultimately an even more complex filter. If one completely avoids AA and attempts to make a frame look less ugly with MLAA/FXAA/etc, same situation, many filter taps, and a complex filter this time involving searching for edges.

The common element here in post process anti-aliasing is a complex filter which sources many samples and has become increasingly expensive over the years.

So why limit those samples to shaded points at fixed positions on the screen?

Effectively some amount of cost for a non-fixed position reconstruction filter is already taken by some AA filter. Some more cost is taken by buffering data across graphics pipeline stages. Even more cost is taken by the complexity introduced by filtering before shading. Reconstruction filter is going to do the same process of computing distance of sample to pixel center to compute sample weight, then take a weighted average for final output color. Except this time take into consideration the projected size of a sample. The giant leap required is to transform from shading at interpolated positions in screen space on a triangle, to taking shaded samples in object (or texture space, aka the texture shade cache talked about in prior posts), directly binning them into screen-space, then doing frame reconstruction from the bins. Or more specifically, bin the shaded tiles from the texture cache into screen space tiles, then for each screen space tile, build up a per pixel list of samples in local memory for reconstruction, then do reconstruction without going out to ram again. That is one possible method, there are others.

What prior might have been 8xMSAA with massive amounts of sub-pixel triangles wasting huge amounts of GPU performance, which still cannot produce a perfect anti-aliased scene, gets transformed into something where even less samples than pixels can produce a perfect anti-aliased image which has absolutely no artifacts in motion.

This is where I would expect the ultimate in engine design could hit on current hardware when API restrictions are removed. Enter Vulkan, make sure enough of the ISA instructions for wave-level programming are available in SPIR-V, and perhaps someone will realize this kind of design. Certainly a major challenge which would require some massive transforms from current engine design if anyone wanted to crawl to the destination. I expect the right way to enable such a move is to systematically bite off chunks of the problem, and prove out optimized shader kernels...

Cached Texture Space Shading and Consistency?

Budget so in the common case enough tiles are processed to completely update the screen at some sample density per pixel. When less tiles are needed to be updated (better scene temporal coherence) there is an ability to,
(a.) save power and battery life, or alternatively
(b.) get increased sample to pixel density for increased quality.

Core point is to have full scalability in engine.

Worst Case?
Worst case is really when OS grabs some percentage of GPU processing for some other task or OS doesn't schedule game on the GPU at the right time (perhaps because CPU ran late generating draw calls because a task got preempted, etc). Engines designed around constant workloads are guaranteed to hitch. However in this case, want engine to gracefully degrade quality while maintaining locked frame rate (this is an absolute must for VR). For example in this "worst case" hitch situation, engine could maintain say 60Hz, 90Hz, 120Hz or 144Hz, and amortize shading update to across multiple frames. If fixed costs (aka drawing the frame using the shade cache) are low, then the game has more opportunity to maintain a consistent frame rate in the worst case. Shading into the cache simply fills whatever is left over. This enables better ability to deal with variable amount of geometric complexity.

Want content creation to just be able to toss anything at an engine, with the engine automatically scaling to maintain the frame-rate requirement regardless of having a high-end or low-end GPU in a machine, or having simple or massively complex content. Likewise want game to transparently look better in the future as faster GPUs are released. Shaded sample density both spatially and temporally can be the buffer which enables scalability.

Perceptual Masking
EDIT (added). There are natural limits to human perception. Some limits are related to eye scanning speed, eye attention, others perhaps to how fast the mind gains a full understanding of a situation. If a title has a context sensitive idea of where a players attention is, the game could bias increased sample density in that area. Obvious candidates: the player in 3rd person, the player's active target, etc. Likewise on a scene change (or under fast motion which is not directly tracked by the eye) there are limits to the amount of detail the mind can perceive in just one frame. So if an image is initially not detailed in rapid change areas in such a way that does not directly trigger the mind's sense of a visual artifact (for example blocky high contrast edges), then rapidly converges to high quality faster than perceptual limits under visual coherence, human limits can mask lack of ability to produce the ideal still image in one frame.

Random Thoughts on TS vs Alternatives

In the context of: Joshua Barczak - Thoughts on Texel Shaders.

Here is where my mind gets stuck when I think about any kind of TS-like stage...

Looking at a single compute unit in GCN,

256 KB of vector registers
16 KB of L1

If a shader occupies 32 waves (relatively good occupancy out of 40 possible) that is a tiny 512 bytes of L1 cache on average per wave. Dividing that out into the 64 lanes of a wave provides just 8 bytes on average per invocation in the L1 cache. Interesting to think about these 8 bytes in the context of how many textures a fragment (or pixel) shader invocation accesses in it's lifetime. The ratio of vector register per L1 cache is 16:1. This working state to cache ratio provides a strong indication that data lifetime in the L1 cache is typically very short. L1 cache serves to collect coherence in a relatively small window temporally, and the SIMD lockstep execution of a full wave guarantees the tight timing requirements. Suggesting likely one could not cache TS stage results in L1. L2 is also relatively tiny in comparison with the amount of vector register state of the full machine...

Going back to R9 Nano ratios: 16 ops to 2 bytes to 1 texture fetch. The "op" in this context is a vector instruction (1 FMA instruction provides 2 flops). Lets work with the assumption of a balanced shader using those numbers. Say a shader uses 256 vector operations, it has capacity for 16 texture fetches, and lets assume those 16 fetches are batched into 4 sets of 4 fetches. Lets simplify scheduling to exact round robin. Then simplify to assume magically 5 waves can always be scheduled (enough active waves to run 5 function units like scalar, vector, memory, export, etc). Then simplify to average texture return latency of 384 cycles (made that up). Given vector ops take 4 clocks, we can ballpark shader runtime as,

4 clocks per op * 256 operations * 5 waves interleaved + 4 batches of fetch * 384 cycles of latency
= 6.6 thousand cycles of run-time

This made up example is just to point out that adding a TS stage serves as an amplifier on the amount of latency a texture miss can take to service. Instead of pulling in memory, the shader waiting on the miss now waits on another program instead. Assuming TS dumps results to L2 (which auto-backs to memory),

Dump out arguments for TS shading request
Schedule new TS job
Wait until machine has resources available to run scheduled work (free wave)
Wait for shader execution to finish filling in the missing cache lines
Send back an ack that the TS job is finished

If a TS shader can access a procedural texture, in theory that TS shader could also miss, resulting in a compounding amount of latency. The 16:1 ratio of vector register to L1 cache, hints at another trouble: the shader has a huge amount of state. Any attempt to save out wave state and later restore (for wave which needs to sleep for many 1000's or maybe many 10000's of cycles for a TS shader to service a miss), is likely to use more bandwidth to save/restore than is used to fetch textures in the shader itself running without a TS stage. Ultimately suggesting it would be better to service what would be expected TS misses long before a shader runs, instead of preemptively attempting to service while a shader is running...

The majority of visual coherence is temporal, not spatial. Comparing compression ratios of video to still image provides an idea of the magnitude. Might be more powerful to engineer around enabling temporal coherence instead just very limited spatial coherence. Suggests the optimal end game caches all the way through to DRAM in some kind of view independent parameterization to enable some amount of reuse across frames in common case. This also could be a major stepping stone in decoupling shading rate from both refresh-rate and screen resolution. Suggesting again a pipeline of caching what would be TS results across frames...

Gut feeling based on a tremendous amount of hand waving is pointing to something which doesn't actually new any new hardware, something which can be done quite well on existing GCN GPUs for example. Unique virtual shading cache shaded in the same 8x8 texel tiles one might imagine for TS shaders, but in this case async shaded in CS instead. With background mechanism which is actively pruning and expanding the tree structure of the cache based on the needs of view visibility. Each 8x8 tile with a high precision {scale, translation, quaternion}, paired with a compressed texture feeding a 3D object space displacement, providing texel world space position for rigid bodies or pre-fabs. Skinned objects perhaps have an additional per tile bone list, per tile base weights, and compressed texture feeding per texel modifications to base weights. Lots of shading complexity is factored out into per tile work. For example with traditional lights, can cull lights to fully in/out of shadow to skip shadow sampling. Each frame can classify the highest priority tiles which need update, then shade them: tiles with actively changing shadow, tiles reflecting quickly changing specular, etc.


PS4 Dreams

Tech background from one month ago, Alex at Umbra Ignite 2015: Learning From Failure,


Finer Points of Living in the Raleigh North Carolina Area

Backyard BBQ Pit in Durham - Probably the best pulled pork I have ever had. Fatatarian friendly, can ask for extra fat, and classic Eastern North Carolina vinegar based sauce.
Second Empire - Best restaurant in Raleigh. Year round specials of Filet Mignon and Foie Gras plates constantly changing. Classic 1879 building in the heart of Raleigh.
Rush Hour Karting - Race all night Thursdays for $42. 5 hours interleaved with other customers, but often nearly non-stop.
Fickle Creek Farm - My primary local source for eggs, pork, lamb, duck, chicken, beef. Found at the Western Wake Farmers Market and others around the area.
Locals Seafood - Rainbow Trout, Scallops, etc. Also at Western Wake Farmers Market.
Earps Seafood - Local fish market, has fresh head-on shrimp.
Umstead State Park - 5 minute drive (for me), great trails.
Lake Crabtree County Park - And other areas to mountain bike.
Sarah P. Duke Gardens - Nice place to unwind on the weekend.


Thoughts on Minimal Filesystem Design

One of the problems that modern systems have which I suspect a minimal single user system would not have is ultra high file counts in a file system. I make this statement under the grounds that minimal design is carried out through the system from OS through applications. There is a practical limit to either the amount of "things" consumed by an individual or produced by an individual. Also when quantity becomes ultra high, often it is better to have application manual packing instead of splitting "things" into separate files.

Minimal design as always. Files are always linear (no fragmentation). Files written in groups as a linear stream can be re-read as a linear stream (no seeks between files). Only one seek (one write) to finalize (in storage) adding of a file entry (or group of entries) to the file system. Writing or reading to a file that has an existing entry, only costs the read/write (no extra metadata modifications).

Storage device split into two regions: path stack (starts at the beginning of the device), and file stack (starts at the end of the device, grows towards the path stack). When a device is mounted (aka opened), the path stack is loaded into RAM and remains in RAM. This contains the entire filesystem structure. The path stack contains {path of file, offset of file on device, size of file on device, deleted flag}. Files are created by adding an entry to the top of the path stack, allocates room for the file on the top of the file stack. Files are deleted by just marking the delete flag. Files are moved or renamed by changing the path in the file entry. Files can be resized smaller by adjusting the size entry. When the path stack or the drive fills up, either clone the device (empty device fills fully compacted), or run a local compaction. Minimal OS runs 100% in RAM anyway so drive crunching away compacting after long time of usage is not a problem IMO.

Challenge to conventional design: why bother with organizing file structure in storage to match directory structure? RAM and compute (to keep organized forms in RAM) is super cheep. Complexity is super expensive. So just load the complete file structure (path stack) into memory on mount. Then create acceleration structures to serve basic access patterns. Hash table (key = path, value = index to path stack entry) when the application knows name of file to open, etc.

GLSL Language Evolution

Feel free to post in comments, I'm always attempting to collect feedback from other developers. Often the missing link between getting the priority for changes is community feedback!

Curious where developers feel GLSL needs to go in the future. Specifically what are the desired improvements to the base language for continuing use in WebGL or in OpenGL, or even using GLSL to translate into SPIR-V. Guessing the listing of requested changes by the recent twitter thread on GLSL spec strictness is as follows:

(a.) Would like GLSL to not require the "U" appended on unsigned integer literals.

(b.) Would like GLSL to support automatic scalar to vector, or smaller vector to larger vector, conversion by implied repetition of last scalar value, when there is otherwise no ambiguity possible. For instance if GLSL can spit out an exacting error message telling what conversion is required, it could in theory be modified to just do the conversion.

(c.) Might like other forms of typecast to work automatically. Would be good to get a listing of these.

Some of the desired reasons for these changes are that they improve programmer productivity and portability of source code across languages. Perhaps a good compromise for those who still desire to not have implicit conversion/typecast, is to still support that via some enable (many of those looking to use implied conversions have the need to minimize total shader size in bytes).