20160127

Local/Global Dimming, Blooming, Contrast Ratios - CRT/LCD/Plasma/OLED

Global Dimming
OLED HDTVs drop to about 33% of peak brightness on a full white screen. My Samsung Plasma HDTV has similar issues. The random made-in-1996 CRT I just profiled at home drops to about 60% of peak brightness on a full white screen. Global dimming has been around for a long time.

LCD Local Dimming vs CRT Blooming
LCD TVs often have LED back-lights split into zones (on the order of dozens for poor quality and low hundreds for better quality), with individual control of the brightness per zone. This zone brightness control simultaneously moves up or down the {black level, peak brightness} as a joined pair. A single-pixel thickness peak brightness line on a black background forces black level high in a blooming pattern of where the line intersects zones. For this reason I always turn off local dimming if it is an option on a LCD TV display. ANSI contrasts can still reach 4000:1 with the best LCD panels without any local dimming.

CRTs in contrast have local blooming (talking about the large diffuse effect, not the shadow-mask-scale effect). Lets look at some measured numbers from my Sony Wega CRT TV, with ANSI contrast being the standard 4x4 rectangle checkerboard pattern of black and white boxes.
  • 317 nits on a small white box on a black screen.
  • 302 nits on a ANSI contrast white test rectangle (50% APL).
  • 1.91 nits on a ANSI contrast black test rectangle.
  • 0.12 nits on the bottom of a black background with a large white bar up top.
  • 0.01 nits on a black background with only a mouse on screen.
ANSI contrast ratio for this Wega is a poor 158:1 (my random 1996 CRT gets 265:1 in comparison). And if one stops here and compares to the great LCD, the LCD has over 4 extra stops of dynamic range? But this isn't the full story. For the Wega as Average Picture Level (APL) drops from 50% with ANSI contrast test, to something more natural to look at, contrast ratio improves. Seeing 2500:1 for black with the large white bar case, and then almost up to 32000:1 measuring a black screen with only a white mouse away from the sensor. And all these tests done in a room at night lit by a few bright lamps.

The CRT does not suffer from the LCD local dimming problem, a single pixel line won't cause blooming as it does not add enough energy (APL stays low). The CRTs blooming looks totally natural in comparison.

Temporal AA Neighborhood Clamp

Thoughts too long for the twitter thread on this, back from my memory of working on this problem...

In TAA the purpose of the neighborhood clamp is to remove ghosting. Using the surrounding local neighborhood of the current frame to bound the color of the reprojection of the prior frame. Many techniques have been tried here, including using the min and max of the neighborhood as an AABB to clip to. Clipping to two AABBs one in RGB and the other in YCoCg, for a tigher bound. Etc. It is possible to continue down this path adding more complex and tigher bounds shapes, and doing better clipping for when colors are out of bounds. But ultimately it seemed like this exploration had passed the optimum !/$ point, and settled on the following low-tech solution for the fast path,
  • Work in YCoCg.
  • Can reduce to 5 sample neighboorhood in a + pattern.
  • Use simple min and max instead of clipping to AABB.

Why?
Lets reduce the problem to strictly monochrome. The complex 3D bound shape is now reduced to a 1D segment defined by the min and max of the neighborhood. When the neighborhood is high frequency and high contrast, which is common say with a field of grass, the min can be a shadow tone, and the max can be a bright highlight. Then just about any reprojection passes the neighborhood clamp without adjustment, and ghosting results. Or even if the reprojection gets clampped, it will be at the extreme ends of the large segment, resulting in distortion to the image.

Ultimately this kind of approach cannot solve the problem at hand, so I stopped searching in this specific solution space...

20160118

Avegant Glyph

Curious what the real tech specs are for the Avegant Glyph. Working from comments from the Polygon article, apparently since it doesn't obscure the entire field of view of vision, you are still grounded in the real world, so the stereo video need not be head tracked and won't cause motion sickness. Site says 1280x720p per eye, side-by-side stereo on HDMI (guessing that is 1280x720p @ 60Hz). Projected directly into the eye via something like DLP where the retina is the screen.

Questions
WHAT IS THE STEREO FOCUS? Do you focus on infinity, or does this have the same fail as traditional shutter glasses where 3D depth is limited to a virtual box (meaning everything in the distance flattens out to a wall projected at some distance from your face)?

WHAT IS THE FOCUS LIKE? Does the image look like a bunch of square pixels like regular DLP, or does the pixels blur together like a CRT to create an artifact-free image?

WHAT IS THE REFRESH QUALITY? Guessing this is a 60 Hz device (media playback), but how long is the image displayed in the eye? For the full 16.6 ms? Can this device do better than 60 Hz (HDMI 1.4a has enough bandwidth for stereo 1280x720 at 90Hz based on my calculations).

20160112

20160109

Driving NTSC TVs From Modern GPUs at 59.94 Hz Without Interlacing



Equipment
Using a HDFury 3 (purchase directly from HDFury.com and DHL shipping is free and only a few days from Taiwan), along with a Crescendo Systems: TC1600. Unfortunately it is only possible to the get TC1600 used now. There are few converters which will output progressive NTSC signals...



The HDFury 3 was designed for HDTV component output. I was not able to get it working for the lower resolution NTSC TVs. So I drive standard VGA output into the TC1600 which converts to component for the NTSC TV.

Limitations
NTSC TVs are fixed frequency 15.734 KHz horizontal rate doing either around 480 lines at 27.97 Hz interlaced or 240 lines at 59.94 Hz without interlacing. The horizontal resolution is variable, but vertical resolution is fixed. Also there is an overscan region which scans outside the visible area of the screen. The lowest horizontal resolution I could drive from this setup was 512 pixels using the following modeline pulled from Useful Modelines,

Modeline "512x240_60,0Hz 15,7KHz (60Hz)" 9.935 512 520 567 632 240 241 244 262 -hsync -vsync

Guessing anything under that has a dot clock which is too slow for the GPU I tried, or too slow for the HDMI output, or something in the pipeline. Going higher, it is possible to easily drive 720x240 but the TV's slot mask is not fine enough to resolve the extra pixels. I drive the TV using a mode which starts out as a 640x240 mode. Pixels are near 1:2 aspect ratio, which is fine for custom rendering. Here is the starting modeline,

Modeline "640x240" 12.312 640 662 719 784 240 244 247 262 -hsync -vsync

The first and last 8 lines tend to be in overscan, so I drop vertical resolution to 224 by moving lines into the blanking regions. Then I drop horizonal overscan which drops the width to 592, and then finally adjust the modeline to center the image horizontally by changing the h-sync region. Here is the final modeline next to the origional modeline,

Modeline "592x224" 12.312 592 654 711 784 224 236 239 262 -hsync -vsync
Modeline "640x240" 12.312 640 662 719 784 240 244 247 262 -hsync -vsync


20160104

Not Happy About This

Engadget: Lawsuit Over 4K Copy Protection
According to Engadget, Intel and Warner Bros are suing LegendSky (makers of HDFury devices). The Engadget article claims, "it's not very likely that people are buying HDFury solely to reclaim some convenience". Actually, analog gamers, like myself, depend on HDFury devices to drive CRTs from more modern devices which lack VGA outputs. It is a 100% legitimate usage case which has nothing to do with illegal ripping and distribution of encrypted content.

I'm not following how going after this company has any effect on movie piracy. First, anyone using the Integral 4K would also need a 4K video capture setup. Based on a quick browsing of bhphotovideo.com the total hardware investment is going to be pricey enough to cull out the casual would-be pirate with intent to distribute. Second, I'm assuming there are plenty of countries in which LegendSky can continue to legally sell devices like the Integral 4K. So at best they would be blocked from selling the device in places like the US. Ultimately it just takes one rip, it makes no difference if that rip originated in the US. Third, I'm going to take a guess that the more serious rippers are always going to find another way to rip 4K movies, regardless of the existence of the Integral 4K device. Perhaps via 100% software crack, or pulling the signal from display hardware after decrypt, etc.

Shooting Yourself in the Foot
4K was a great opportunity to do the opposite of what the industry ultimately did. The alternative I'd like as a consumer is losslessly compressed original quality content on physical media, with TVs which display unmolested (aka unprocessed) pass-through video, with some baseline factory color calibration. What the industry did instead is focus on higher amounts of compression, more random image molestation in the TV, cheaper streaming costs, more ads, less ownership, less freedom, and now putting my CRT obsession at risk by suing LegendSky. In many ways the industry eroded the quality difference between the purchased content and pirated re-compressed content, to the point where the consumer doesn't know the difference.

In my opinion this is another example of the wrong way to attempt to solve the problem. Instead of focusing on limiting freedom to force consumers into some revenue stream, how about using technology to provide legitimate advantage which consumers actually desire to buy into...

20151222

Random Holiday 2015

Google Photos = Fail Great
Minus has basically imploded loosing a lot of the image content on this blog, unfortunately Google Photos is not a great alternative, while it seems to support original quality, it does not, see the 2nd image (got low-passed)!



Vintage Interactive Realtime Operating System
Taking part of my Christmas vacation time to reboot my personal console x86-64 OS project. I've rebooted this project more times than I'm willing to disclose on this blog. Hopefully this reboot will stick long enough to build something real with it. Otherwise it's been great practice, and the evolution of ideas has certainly been enlightening. Here is a shot of the binary data for the boot sector in the editor,


My goal is to rewind to the principals of the Commodore 64, just applied to modern x86-64 hardware and evolved down a different path. In my case to build a free OS with open source code which boots into a programming interface, which fully exposes the machine, and provides some minimal interface to access hardware including modern AMD GPUs. With a feature list which is the inverse of a modern operating system: single-user, single-tasking, no paging, no memory protection ... the application gets total control over the machine, free to use parallel CPU cores and GPU(s) for whatever it wants. Ultimately manifesting as a thumb drive which can be plugged into a x86-64 machine with compatible hardware and boot the OS to run 100% from RAM, to run software.

A Tale of Two Languages
C64 provided Basic at boot, I'm doing something vastly different. Booting into an editor which is a marriage between a spreadsheet, hex editor, raw memory view, debugging tool, interactive live programming environment, annotated sourceless binary editor with automatic relink on edit, and a forth-like language. Effectively I've embedded a forth-like language in a sourceless binary framework. The editor runs in a virtualized console which can easily be embedded inside an application. The editor shows 8 32-bit "cells" of memory per line (half cacheline), with a pair of rows per cell. Top row has a 10-character annotation (with automatic green highlight if the cell is referenced in the binary), the botton row shows the data formatted based on a 4-bit tag stored in the annotation. Note the screen shot showing the boot sector was hand assembled 8086 (so is embedded data), built from bits of NASM code chunks then disassembled (it's not showing any of the embedded language source). Tags are as follows,

unsigned 32-bit word
signed 32-bit word
four unsigned 8-bit bytes
32-bit floating point value
-----
unsigned 32-bit word with live update
signed 32-bit word with live update
four unsigned 8-bit bytes with live update
32-bit floating point value with live update
-----
32-bit absolute memory address
32-bit relative memory address [RIP+imm32]
toe language (subset of x86-64 with 32-bit padded opcodes)
ear language (forth-like language, encoded in binary form)
-----
5-character word (6-bits per character)
last 3 saved for GPU binary disassembly

Editor designed to edit an annotated copy of the live binary, with a frameword designed to allow realtime update of the live binary as a snapshot of the edited copy. The "with live update" tags mean that the editor copy saves a 32-bit address to the live data in it's copy of the binary (instead of the data itself). This allows for direct edit and visualization of the live data, with ability to still move bits of the binary around in memory.

The "toe" and "ear" tagged cells show editable disassembled x86-64 code in the form of these languages. The "ear" language is a zero-operand forth-like language. Current design,
regs
----
rax = top
rcx = temp for shift
rbx = 2nd item on data stack, grows up
rbp = 4
rdi = points to last written 32-bit item on compile stack

bin word  name  x86-64 meaning
--------  ----  ----------------------------
0389dd03  ,     add ebx,ebp; mov [rbx],eax;
dd2b3e3e  \     sub ebx,ebp;
c3f2c3f2  ;     ret;
dd2b0303  add\  add eax,[rbx]; sub ebx,ebp;
dd2b0323  and\  and eax,[rbx]; sub ebx,ebp;
07c7fd0c  dat#  add edi,ebp; mov [rdi],imm;
d0ff3e3e  cal   call rax;
15ff3e3e  cal#  call [rip+imm];
058b3e3e  get#  mov eax,[rip+imm];
890fc085  jns#  test eax,eax; jns imm; 
850fc085  jnz#  test eax,eax; jnz imm;
880fc085  js#   test eax,eax; js imm;
840fc085  jz#   test eax,eax; jz imm;
c0c73e3e  lit#  mov eax,imm;
03af0f3e  mul   imul eax,[rbx];
d8f73e3e  neg   neg eax;
00401f0f  nop   nop;
d0f73e3e  not   not eax;
dd2b030b  or\   or eax,[rbx]; sub ebx,ebp;
05893e3e  put#  mov [rip+imm],eax;
f8d20b8b  sar   mov ecx,[rbx]; sar eax,cl;
e0d20b8b  shl   mov ecx,[rbx]; shl eax,cl;
e8d20b8b  shr   mov ecx,[rbx]; shr eax,cl;
dd2b032b  sub\  sub eax,[rbx]; sub ebx,ebp;
dd2b0333  xor\  xor eax,[rbx]; sub ebx,ebp;

For symbols the immediate operands are all 32-bit and use "absolute" or "relative" tagged cells following the "ear" tagged cell. Likewise for "dat#" which pushes an immediate on the compile stack, and "lit#" which pushes an immediate data on the data stack, the following cell would have a data tag. The dictionary is directly embedded in the binary, using the edit-time relinking support. No interpretation is done at run-time, only edit-time, as the language is kept in an executable form.

After building so many prototypes which "compile" source to a binary form, I've come to the conclusion that it is many times easier to just keep the source in the binary form, then disassemble and reassemble in the editor on edit. The "toe" subset of x64-64 uses 32-bit opcodes with null "3e" prefix to pad to alignment. Disassembly/reassembly uses a table with,

{opcode form index, 32-bit opcode base, 32-bit 5-character name}

Then under 16 opcode forms, which use a pair of functions, one for disassembly to a common {name,r,rm,s/imm8} form, and the other for assembly. Other than entering data for the table it was trivial engineering work.

A Tale of Two Editors
To fast track this reboot I've built the first editor as a fully non-live prototype of the editor for the OS. This enables me to bring up the OS using an emulator with the standard pass/fail instant debug method (it works if the emulator doesn't enter in the infinite reboot loop). I'm using the same custom VGA font, just different colors because of the bad 256-color palette in xterm. The C based version of the editor was built using the "as a child would do it" method of development over a few days, just start building without any planning, writing code as fast as possible to bring up a prototype. See what works, refine, repeat. It took around 1900 lines total, and produces an insanely large "gcc -Os" compiled binary (optimized for smallest size) at around 160 KB. Later will see how this compares to the editor fully re-written in the "toe/ear" language in the editor itself...


20151215

GPUOpen!

TechReport | TomsHardware | PCPer
Full disclosure, I'm part of this project at AMD working in RTG's ISV Engineering Team lead by Nicolas Thibieroz ...

GPUOpen from my Personal Perspective as a Developer
There is a transformation happening in the PC space reaching critical mass, catalyzed by Mantle, continuing with two explicit APIs: DX12 and Vulkan. A process of opening up the ability to leverage the GPU. I've been looking forward to this for a long time: the Next Renaissance for GPU Programming on PC. It is a very exciting time to be working in Raja Koduri's Radeon Technology Group (RTG) at AMD, with an entire company driven and aligned to make this happen.

From TechReport: "...GPUOpen resources will get a dedicated portal with links to open-source content hosted on GitHub. The company [AMD] also plans to offer blog posts related to its resources and the graphics development community. The company was adamant that those posts will be written by developers..."

Oh yeah! Open hardware docs, tools, direct access, in combination with a portal for free flow of information. Translation: Open Season on Promoting Advanced GPU Techniques.

For me, the GPUOpen portal is the ideal home to do technical blogging of all the ideas/knowledge/techniques I've been accumulating over the years in the form of self-contained open-source portable and extended shader header files backed with GPU disassembly, timing results, images, video, and/or discussion of how things directly map to the hardware. I'll be one of the developers posting on GPUOpen next year, looking forward to 2016!

20151202

Quality MSAA Resolve

When MSAA is done right I avoid all morphological AA's like FXAA because they add artifacts. Here are some hints on how to do MSAA right. These techniques work in HLSL or GLSL, using the HLSL terms here.

When using MSAA with alpha test, you can run at the standard pixel rate, but manually use EvaluateAttributeSnapped() to super-sample the alpha test inside the shader and then manually set SV_Coverage to get an MSAAed edge. There are other tricks for soft particle blending, but out of time right now...

Describing the 4xMSAA version of a proper filtered resolve below, it is trivial to extend to 8xMSAA. Read samples in a 2 pixel diameter (minimum suggested, larger can improve quality at more cost). That means start with the 3x3 pixel neighborhood,

. N . .  . N . .  . N . . 
. . . E  . . . E  . . . E
W . . .  W . . .  W . . . 
. . S .  . . S .  . . S .

. N . .  . N . .  . N . . 
. . . E  . . . E  . . . E
W . . .  W . . .  W . . . 
. . S .  . . S .  . . S .

. N . .  . N . .  . N . . 
. . . E  . . . E  . . . E
W . . .  W . . .  W . . . 
. . S .  . . S .  . . S .

And read the following 16 samples (make sure to have linear data after the read for the filtering process),

. . . .  . . . .  . . . . 
. . . .  . . . .  . . . .
. . . .  W . . .  W . . . 
. . S .  . . S .  . . . .

. . . .  . N . .  . N . . 
. . . E  . . . E  . . . .
. . . .  W . . .  W . . . 
. . S .  . . S .  . . . .

. . . .  . N . .  . N . . 
. . . E  . . . E  . . . .
. . . .  . . . .  . . . . 
. . . .  . . . .  . . . .

Filter weights can be any filter you want (see Reconstruction Filters in Computer Graphics for many options, but note negative lobes with only 4xMSAA tends to not work well, and play with the radius until it looks right). Each of the four following lettered sample groups shares the same filter weight (same distance from center of the pixel),

. . . .  . . . .  . . . . 
. . . .  . . . .  . . . .
. . . .  c . . .  d . . . 
. . d .  . . b .  . . . .

. . . .  . a . .  . c . . 
. . . b  . . . a  . . . .
. . . .  a . . .  b . . . 
. . c .  . . a .  . . . .

. . . .  . b . .  . d . . 
. . . d  . . . c  . . . .
. . . .  . . . .  . . . . 
. . . .  . . . .  . . . .

Filter weights should be pre-computed. You end up with a resolve filter which does the following math where {a,b,c,d} are pre-computed filter weights,

outColor.rgb =
sample0 * a +
sample1 * a +
sample2 * a +
sample3 * a +
sample4 * b +
sample5 * b +
sample6 * b +
sample7 * b +
sample8 * c +
sample9 * c +
sampleA * c +
sampleB * c +
sampleC * d +
sampleD * d +
sampleE * d +
sampleF * d;


I'd highly suggest re-ordering the math and loads so that samples are processed from pixels in this order,

0 1 2
5 4 3
6 7 8


Fetching all samples needed for the pixel and FMAing them to the weighted average output. This way the compiler is pre-warmed to the best fetch ordering.

20151127

Black Friday vs Display Technology

My wife and I are in the Toronto area in Canada visiting family for the the US Thanksgiving holiday. Naturally food and shopping resulted in a trip to Costco and the mall, where I ran across an Apple store, and managed to entertain myself by getting back up to speed again with the state of 4K...

I'm Lost, Where is the 4K?
In some Markham Apple store, I tried one of these 27" 5K iMacs. Was very confused at first, looking at the desktop photo backgrounds, everything is up-sampled. But the rendered text in the UI, that was sharp, actually too sharp. Had to find out what GPU was driving this 5K display. Opened Safari and went to www.apple.com, went to the website with iMac tech specs. Couldn't read the text. Solved this problem by literally sticking my face to the screen. AMD GPU, awesome.

Second Costco, everything is 4K, and yet nothing shows 4K content. Seems like out of desperation they were attempting to show some time-lapse videos from DLSR still frames (easier to get high resolution from stills than from video). Except even those looked massively over-processed and up-sampled (looked like less than Canon 5D Mark I source resolution). All the BluRay content looks like someone accidentally set the median filter strength to 1000%. All the live digital TV content looks like "larger-than-life" video compression artifacts post processed by some haloing sharpening up-sampling filter. There is a tremendous opportunity for improvement here driving the 4K TV from the PC, and doing better image processing.

Ok So Lets Face Reality, Consumers Have Never Seen Real 4K Video Content
Well except for IMAX in the theater. Scratch that, consumers haven't even seen "real" 2K content (again other than at the movie theater). Artifact-free 2K content simply doesn't exist at the bit-rate people stream content. Down-sample to 540p and things look massively better.

What About Photographs
Highest-end "prosumer" Canon 5DS's 8688x5792 output isn't truly 8K. It is under 4K by the time it is down-sampled enough to remove artifacts. At a minimum the Bayer sampling pattern results in technically something around 8K/2 in pixel area, but typical Bayer reconstruction introduces serious chroma aliasing artifacts. Go see for yourself, zoom into the white on black text pixels on dpreview's studio scene web app. But that is just ideal lighting, at sharpest ideal aperture, on studio stills. Try keeping pixel-perfect focus everywhere in hard lighting in a real scene with a 8K 35mm full-frame camera... Anyway most consumers have relatively low resolution cameras in comparison. Even the Canon 5D Mark III is only 5760x3840 which is well under 4K after down-sampling to reach pixel perfection. Ultimately a Canon 5D Mark III still looks better full screen on a 2K panel because the down-sampling removes all the artifacts which the 4K panel makes crystal clear.

4K Artifact Amplification Machine Problem
Judging by actual results on Black Friday, no one is doing a good job of real-time video super-resolution in 4K TVs. It is like trying to squeeze blood from a stone, more so when placing in the context that the source is typically suffering from severe compression artifacts. The HDTV industry needs it's form of the "Organic" food trend: please bring back some non-synthetic image processing. The technology to do a good job of image enlargement was actually perfected a long time ago, it is called the slide projector. Hint use a DOF filter to up-sample.

4K and Rendered Content
What would sell 4K as a TV to me: a Pixar movie rendered at 4K natively, played back from losslessly compressed video. That is how I define "real" content.

What about games? I've been too busy to try Battlefront yet except the beta, and I'm on vacation now without access to my machine (1st world problems). But there are some perf reviews for 4K online. Looks like Fury X is the top single-GPU score at 45 fps. Given there are no 45 Hz 4K panels, that at best translates to 30 Hz without tearing. Seems like variable refresh is even a more important feature for 4K than it was at 1080p, this is something IMO TV OEMs should get on board with. Personally speaking, 60 Hz is the minimum fps I'm willing to accept for a FPS, so I'm probably only going to play at 1080p down-sampled to 540p (for the anti-aliasing) on a CRT scanning out around 85 Hz (to provide a window to avoid v-sync misses).

Up-Sampling Quality and Panel Resolution
The display industry seems to have adopted a one-highest-resolution fits all model, which is quite scary, because for gamers like myself, FPS and anti-aliasing quality is the most important metric. Panel resolution beyond the capacity to drive perfect pixels is actually what destroys quality, because it is impossible to up-sample with good quality.

CRT can handle variable resolutions because the reconstruction filtering is at an infinite resolution. Beam has a nice filtered falloff which blends scan-lines perfectly. 1080p up-sampled on a 4K panel will never look great in comparison, because filtering quality is limited to alignment to two square display pixels. 4K however probably will provide a better quality 540p. Stylized up-sampling is only going to get more important as resolutions increase. Which reminds me, I need to release the optimized versions of all my CRT and stylized up-sampling algorithms...

CRT vs LCD or OLED in the Context of HDR
Perhaps it might be possible to say that HDR would finally bring forward something superior to CRTs. Except there is one problem, display persistence. A typical 400 nit PC LCD driven at 120 Hz has a 8.33 ms refresh. Dropping this to a 1 ms low persistence strobed frame would result in 1/8 the brightness (aka it would act like a 50 nit panel). With LCDs there is also the problem of strobe to scanning pattern (that changes the pixel value) mismatch resulting in ghosting towards the top and bottom of the screen. Add on top, large displays like TV panels have power per screen area problems resulting in global dimming. So I highly doubt LCD displaces the CRT any time soon even in the realm of 1000 nit panels.

OLED is a wild card. Works great for low persistence in VR, but what about large panels? Seems like the WRGB LG 4K OLED panel is the one option right now (it is apparently in the Panasonic as well). Based on displaymate.com results, OLED is not there yet. Blacks should be awesome, and better than CRTs, but according to hdtvtest.co.uk's review of the latest LG OLED, there is a serious problem with near black vertical banding after 200+ hours of use. Also with only around 400 some nits peak with more global dimming issues than typical LCDs, looks like large-panel low-persistence is going to be a problem for now. Hopefully this tech gets improved and they eventually release a 1080p panel which does low persistence at as low as 80 Hz.

Reading tech specs of these 4K HDR TVs paints a clear picture of what HDR means for Black Friday consumers. Around a 400-500 nit peak, just like current PC panels, but PC panels don't have global dimming problems. Newest TV LCD panels seem to have a 1-stop ANSI contrast advantage, perhaps less back-light leakage with larger pixels. Screen reflectance has been dropping (this is an important step in getting to better blacks). TV LCD panels are approaching P3 gamut. PC panels have been approaching Adobe RGB gamut. Both are similar in area. Adobe RGB mostly adds green area to sRGB/Rec709 primaries, where P3 adds less green, but more red. So ultimately if you grab an Black Friday OLED and don't get the near black 200+ hour problem, HDR translates to literally "better shadows".

Rec 2020 Gamuts and Metamerism
This Observer Variability in Color Image Matching on a LCD monitor and a Laser Projector paper is a great read. The laser projector is a Microvision, the same tech in a Pico Pro, has a huge Rec 2020 like gamut. The route to this gamut is via really narrow band primaries. As the primaries get narrow and move towards the extremes of the visible spectrum, the human perception of the color generated by the RGB signal becomes quite divergent, see Figure 6. Note it is impossible to use a measurement tool to calibrate this problem away. The only way to fix it is via manual user adjustment visually: probably via selecting between 2 or 3 stages of 3x3 swatches on the screen. And note that manual "calibration" would only be good for one user... anyone else looking at the screen is probably seeing a really strangely tinted image.

HDR and Resolution and Anti-Aliasing
While it will still take maybe 5 to 10 years before the industry realizes this, HDR + high resolution has already killed off the current screen-grid aligned shading. Lets start with the concept of "nyquist frequency" in the context of still and moving images. For a set display resolution, stills can have 2x the spatial resolution of video if they either align features to pixels (aka text rendering) or accept that pixel sized features not aligned to pixel center will disappear as they align with pixel boarder. LCDs adopted "square pixels" and amplified the spatial sharpness of pixel-aligned text, and this works to the disadvantage of moving video. Video without temporal aliasing can only safely resolve a 2 pixel wide edge at full contrast, as one pixel edges under proper filtering would disappear as they move towards pixel boarder (causing temporal aliasing). So contrast as a function of frequency of detail, needs to approach zero as features approach 1 pixel in width. HDR can greatly amplify temporal aliasing, making this filtering many times more important.

Screen-grid aligned shading via raster pipe requires N samples to represent N graduations between pixels. LDR white on black requires roughly 64 samples/pixel to avoid visible temporal aliasing, in this case worst aliasing being around 1/64 of white (possible to mask the remaining temporal aliasing in film grain). With HDR the number of samples scales by the contrast between the lightest and darkest sample. To improve this situation requires factoring in the fractional coverage of the shaded sample. And simply counting post-z sample coverage won't work (not enough samples for HDR). Maybe using Barycentric distance of triangle edges to compute a high precision coverage might be able to improve things...

The other aspect of graphics pipelines which needs to evolve is the gap between the high frequency MSAA resolve filter and low frequency bloom. The MSAA resolve filter cannot afford to get wide enough to properly resolve an HDR signal. The more the contrast, the larger the resolve filter kernel must be. MSAA resolve to not have temporal aliasing with LDR requires a 2 pixel window. With HDR the 1/luma bias is used which creates a wrong image. The correct way is to factor the larger than 2 pixel window into a bloom filter which starts at pixel frequency (instead of say half pixel).

But these are really just bandages, shading screen aligned samples doesn't scale. Switching to object/texture space shading with sub-pixel precision reconstruction is the only way to decouple resolution from the problem. And after reaching this point, the video contrast rule of approaching zero contrast around 1 pixel wide features starts to work to a major advantage, as it reduces the number of shaded samples required across the image...

20151122

CRT Inventory

Makvision M2929 ----------------- 29", _4:3, _800x600__, 30-40_ KHz, 47-90_ Hz, 90___ MHz, slot mask_____, _____0.73 mm pitch, VGA_
Sony Wega KV-30HS420 ------------ 30", 16:9, _853x1080i, ______ KHz, ______ Hz, _____ MHz, aperture grill, ____________ pitch, HDMI
HP "1024" D2813 ----------------- 14", _4:3, 1024x768__, 30-49_ KHz, 50-100 Hz, _____ MHz, shadow mask___, ____________ pitch, VGA_
MaxTech XT-4800 ----------------- 14", _4:3, 1024x768__, 30-48_ KHz, 50-100 Hz, 80___ MHz, shadow mask___, _____0.28 mm pitch, VGA_
Compaq MV740 -------------------- 17", _4:3, 1280x1024_, 30-70_ KHz, 50-120 Hz, _____ MHz, shadow mask___, _____0.28 mm pitch, VGA_
Future Power 17db77 ------------- 17", _4:3, 1280x1024_, 30-70_ KHz, 50-160 Hz, _____ MHz, shadow mask___, _____0.27 mm pitch, VGA_
Sony Vaio CPD E200 -------------- 17", _4:3, 1600x1200_, 30-85_ KHz, 50-120 Hz, _____ MHz, aperture grill, _____0.24 mm pitch, VGA_
ViewSonic Optiquest Q95---------- 19", _4:3, 1600x1200_, 30-86_ KHz, 50-160 Hz, 202__ MHz, shadow mask___, _____0.21 mm pitch, VGA_
ViewSonic G75f ------------------ 17", _4:3, 1600x1200_, 30-86_ KHz, 50-180 Hz, 135__ MHz, shadow mask___, 0.21-0.25 mm pitch, VGA_
ViewSonic PS790 ----------------- 19", _4:3, 1600x1200_, 30-95_ KHz, 50-180 Hz, 202.5 MHz, shadow mask___, _____0.25 mm pitch, VGA_
Dell Ultrascan 1600HS D1626HT --- 21", _4:3, 1600x1200_, 30-107 KHz, 48-160 Hz, _____ Mhz, aperture grill, 0.25-0.27 mm pitch, VGA_

Possible Pickup List
empty

Dead List
Dell Trinitron UltraScan P991 --- 19", _4:3, 1600x1200_, 30-107 KHz, 48-120 Hz, _____ Mhz, aperture grill, _________ mm pitch, VGA_
Dell Ultrascan 20TX D2026T-HS --- 20", _4:3, 1600x1200_, 31-96_ KHz, 50-100 Hz, _____ MHz, aperture grill, _____0.26 mm pitch, VGA_
eMachines eView 17f3 ------------ 17", _4:3, 1280x1024_, 30-72_ KHz, 50-160 Hz, _____ Mhz, shadow mask___, _____0.25 mm pitch, VGA_
KDS VS-195 ---------------------- 19", _4:3, 1600x1200_, 30-95_ KHz, __-120 Hz, _____ Mhz, shadow mask___, _____0.26 mm pitch, VGA_

Comments
Picked up the MaxTech XT-4800 as never been used. Works awesome. The manual didn't have peak KHz and Hz, but did have max bandwidth. Wasn't able to hit peak bandwidth, but could get to 120 Hz. Internet searched peak says 100 Hz.

The KDS VS-195 is total garbage as it has a forced strong sharpening filter which cannot be disabled. The Sony Vaio CPD E200 also has a forced less strong sharpening filter which cannot be disabled.

The eMachines eView was never used by the prior owner, and promptly died after one day of my use. It was a nice CRT, flat screen with a shadow mask. Would like to figure out a way to fix it.

My HP 1024 does not have an internal degaussing coil and now has a color problem thanks to being stored too close to an old TV. Looking for an external degaussing coil, probably going to place near another monitor and degauss that one to get the HP 1024 fixed.

The Dell Ultrascan 20TX was picked up on Craigslist, owner sold me a broken CRT without disclosing that, signal goes black for part of the middle of the screen.

The Dell Trinitron UltraScan P991 is having problems, too used up.

Cross-Invocation Data Sharing Portability

A general look at the possibility of portability for dGPUs with regards to cross-invocation data sharing (aka to go next after ARB_shader_ballot which starts exposing useful SIMD-level programming constructs). As always I'd like any feedback anyone has on this topic, feel free to write comments or contact me directly. Warning this was typed up fast to collect ideas, might be some errors in here...

References: NV_shader_thread_group | NV_shader_thread_shuffle | AMD GCN3 ISA Docs

NV Quad Swizzle (supported on Fermi and beyond)
shuffledData = quadSwizzle{mode}NV({type} data, [{type} operand])
Where,
(1.) "mode" is {0,1,2,3,X,Y}
(2.) "type" must be a floating point type (implies possible NaN issues issues with integers)
(3.) "operand" is an optional extra unshuffled operand which can be added to the result
The "mode" is either a direct index into the 2x2 fragment quad, or a swap in the X or Y directions.

AMD GCN DS_SWIZZLE_B32
swizzledData = quadSwizzleAMD({type} data, mode)
Where,
(1.) "mode" is a bit array, can be any permutation (not limited to just what NVIDIA exposes)
(2.) "type" can be integer or floating point

Possible Portable Swizzle Interface
bool allQuad(bool value) // returns true if all invocations in quad are true
bool anyQuad(bool value) // returns true for entire quad if any invocations are true
swizzledData = quadSwizzleFloat{mode}({type} data)
swizzledData = quadSwizzle{mode}({type} data)
Where,
(1.) "mode" is the portable subset {0,1,2,3,X,Y} (same as NV)
(2.) "type" is limited to float based types only for quadSwizzleFloat()
This is the direct union of common functionality from both dGPU vendors. NV's swizzled data returns 0 for "swizzledData" if any invocation in the quad is inactive according to the GL extension. AMD returns 0 for "swizzledData" only for inactive invocations. So the portable spec would have undefined results for "swizzledData" if any invocation in the fragment quad is inactive. This is a perfectly acceptable compromise IMO. Would work on all AMD GCN GPUs and any NVIDIA GPU since Fermi for quadSwizzlefloat(), and since Maxwell for quadSwizzle() (using shuffle, see below), this implies two extensions. Quads in non fragment shaders are defined by directly splitting the SIMD vector into aligned groups of 4 invocations.



NV Shuffle (supported starting with Maxwell)
shuffledData = shuffle{mode}NV({type} data, uint index, uint width, [out bool valid])
Where,
(1.) "mode" is one of {up, down, xor, indexed}
(2.) "data" is what to shuffle
(3.) "index" is a invocation index in the SIMD vector (0 to 31 on NV GPUs)
(4.) "width" is {2,4,8,16, or 32}, divides the SIMD vector into equal sized segments
(5.) "valid" is optional return which is false if the shuffle was out-of-segment
Below the "startOfSegmentIndex" is the invocation index of where the segment starts in the SIMD vector. The "selfIndex" is the invocation's own index in the SIMD vector. Each invocation computes a "shuffleIndex" of another invocation to read "data" from, then returns the read "data". Out-of-segment means that "shuffleIndex" is out of the local segment defined by "width". Out-of-segment shuffles result in "valid = false" and sets "shuffleIndex = selfIndex" (to return un-shuffled "data"). The computation of "shuffleIndex" before the out-of-segment check depends on "mode".
(indexed) shuffleIndex = startOfSegmentIndex + index
(_____up) shuffleIndex = selfIndex - index
(___down) shuffleIndex = selfIndex + index
(____xor) shuffleIndex = selfIndex ^ index


AMD DS_SWIZZLE_B32 (all GCN)
Also can do swizzle across segments of 32 invocations using the following math.
and_mask = offset[4:0];
or_mask = offset[9:5];
xor_mask = offset[14:10];
for (i = 0; i < 32; i++) {
j = ((i & and_mask) | or_mask) ^ xor_mask;
thread_out[i] = thread_valid[j] ? thread_in[j] : 0; }

The "_mask" values are compile time immediate values encoded into the instruction.

AMD VOP_DPP (starts with GCN3: Tonga, Fiji, etc)
DPP can do many things,
For a segment size of 4, can do full permutation by immediate operand.
For a segment size of 16, can shift invocations left by an immediate operand count.
For a segment size of 16, can shift invocations right by an immediate operand count.
For a segment size of 16, can rotation invocations right by and immediate operand count.
For a segment size of 64, can shift or rotate, left or right, by 1 invocation.
For a segment size of 16, can reverse the order of invocations.
For a segment size of 8, can reverse the order of invocations.
For a segment size of 16, can broadcast the 15th segment invocation to fill the next segment.
Can broadcast invocation 31 to all invocations after 31.
Has option of either using "selfIndex" on out-of-segment, or forcing return of zero.
Has option to force on invocations for the operation.

AMD DS_PERMUTE_B32 / DS_BPERMUTE_B32 (starts with GCN3: Tonga, Fiji, etc)
Supports something like this (where "temp" is in hardware),
bpermute(data, uint index) { temp[selfIndex] = data; return temp[index]; }
permute(data, uint index) { temp[index] = data; return temp[selfIndex]; }


Possible Portable Shuffle Interface : AMD GCN + NV Maxwell
This is just a start of ideas, have not had time to fully explore the options, feedback welcomed...
SIMD width would be different for each platform so developer would need to build shader permutations for different platform SIMD width in some cases.

SEGMENT-WIDE BUTTERFLY
butterflyData = butterfly{width}({type} data)
Where "width" is {2,4,8,16,32}. This is "xor" mode for shuffle on NV, and DS_SWIZZLE_B32 on AMD (with and_mask = ~0, and or_mask = 0) with possible DPP optimizations on GCN3 for "width"={2 or 4}. The XOR "mask" field for both NV and AMD is "width>>1". This can be used to implement a bitonic sort (see slide 19 here).

SEGMENT-WIDE PARALLEL SCAN
TODO: Weather is nice outside, will write up later...

SEGMENT-WIDE PARALLEL REDUCTIONS
reducedData = reduce{op}{width}({type} data)
Where,
(1.) "op" specifies the operation to use in the reduction (add, min, max, and, ... etc)
(2.) "width" specifies the segment width
At the end of this operation only the largest indexed invocation in each segment has the result, the values for all other invocations in the segment are undefined. This enables both NV and AMD to have optimal paths. This uses "up" or "xor" mode on NV for log2("width") operations. Implementation on AMD GCN uses DS_SWIZZLE_B32 as follows,
32 to 16 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=16
16 to 8 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=8
8 to 4 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=4
4 to 2 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=2
2 to 1 => DS_SWIZZLE_B32 and_mask=31, or_mask=0, xor_mask=1
64 from finalized 32 => V_READFIRSTLANE_B32 to grab invocation 0 to apply to all invocations

Implementation on AMD GCN3 uses DPP as follows,
16 to 8 => reverse order of 16-wide (DPP_ROW_MIRROR)
__0123456789abcdef__
__fedcba9876543210__
8 to 4 => reverse order of 8-wide (DPP_ROW_HALF_MIRROR)
_01234567_
_76543210_
4 to 2 => reverse order using full 4-wide permutation mode
__0123__
__3210__
2 to 1 => reverse order using full 4-wide permutation mode
__0123__
__1032__
32 from finalized 16 => DPP_ROW_BCAST15
__...............s...............t...............u................__
__................ssssssssssssssssttttttttttttttttuuuuuuuuuuuuuuuu__
64 from finalized 32 => DPP_ROW_BCAST32
__...............................s................................__
__................................ssssssssssssssssssssssssssssssss__

reducedData = allReduce{op}{width}({type} data)
The difference being that all invocations end up with the result. Uses "xor" mode on NV for log2("width") operations. On AMD this is the same as "reduce" except for "width"={32 or 64}. The 64 case can use V_READLANE_B32 from the "reduce" version to keep the result in an SGPR to save from using a VGPR. The 32 case can use DS_SWIZZLE_B32 for the 32 to 16 step.


Possible Portable Shuffle Interface 2nd Extension : AMD GCN3 + NV Maxwell
This is just a start of ideas, have not had time to fully explore the options, feedback welcomed...
SIMD width would be different for each platform so developer would need to build shader permutations for different platform SIMD width in various cases.

SIMD-WIDE PERMUTE
Backwards permutation of full SIMD width is portable across platforms, maps on NV to shuffleNV(data, index, 32), and DS_BPERMUTE_B32 on AMD,
permutedData = bpermute(data, index)

20151121

ISA Toolbox

For years now I have found that nearly everything I work on can be made better by leveraging ISA features which are not always exposed in all the graphics APIs. For example, currently working on a project now which could use the combination of the following,

(1.) From AMD_shader_trinary_minmax, max3(). Direct access to max of three values in a single V_MAX3_F32 operation. If the GPU has 3 read ports on the register file for FMA, might at well take advantage of that for min/max/median. AMD's DX driver shader compiler automatically optimizes these cases, for example "min(x,min(y,z))" gets transformed to "min3(x,y,z)".

(2.) Direct exposure of V_SIN_F32 and V_COS_F32, which have a range of +/- 512 PI and take normalized input. Avoids and extra V_MUL_F32 and V_FRACT_F32 per operation. Nearly all the time I use sin() or cos() I'm in range (no need for V_FRACT_F32). Nearly all the time I'm in the {0 to 1} range for 360 degrees, and need to scale by 2 PI only so code generation can later scale back by 1/2 PI. Portable fallback for machines without V_SIN_F32 and V_COS_F32 like functionality looks like,

float sinNormalized(float x) { return sin(x * 2.0 * PI); }
float cosNormalized(float x) { return cos(x * 2.0 * PI); }


(3.) Branching if any or all of the SIMD vector want to do something. Massively important tool to avoid divergence. For example in a full screen triangle, if any pixel needs the more complex path, just have the full SIMD vector only do the complex path instead of divergently processing both complex and simple. API can be quite simple,

bool anyInvocations(bool x)
bool allInvocations(bool x)


Example of how these could map in GCN (these scalar instructions execute in parallel with vector instructions, so low cost),

// S_CMP_NEQ_U64 x,0
// S_CBRANCH_SCCNZ
if(anyInvocations(x)) { }

// S_CMP_EQ_U64 x,-1
// S_CBRANCH_SCCNZ
if(allInvocations(x)) { }


(4.) Quad swizzle for fragment shaders for cross-invocation communication is super useful. Given a 2x2 fragment quad as follows,

01
23


These functions would be quite useful (they map to DS_SWIZZLE_B32 in GCN),

// Swap value horizontally.
type quadSwizzle1032(type x)

// Swap value vertically.
type quadSwizzle2301(type x)


For example one could simultaneously write out the results of a fragment shader to the standard full screen pass and write out the 1/2 x 1/2 resolution next smaller mip level at the same time using an extra image store. Just use the following to do a 2x2 box filter in the shader,

boxFilterColor = quadSwizzle1032(color) + color;
boxFilterColor += quadSwizzle2301(boxFilterColor);


20151116

Mixing Temporal AA and Transparency

Jon Greenberg asks on twitter, "Okay, so here's the TemporalAA question of the day - transparency isn't TAA'd - how do you manage the jittered camera? Ignore it? Oy..."

The context of this question is often the following graphics pipeline,

(1.) Render low-poly proxy geometry for some of the major occluders in a depth pre-pass.
(2.) Render g-buffer without MSAA, each frame using a different jittered sub-pixel viewport offset.
(3.) Render transparency (without viewport jitter) in a separate render target.
(4.) Later apply temporal AA to opaque and somehow composite over the separate transparent layer.

Here are some ideas on paths which might solve the associated problems,

Soft Transparency Only
If the transparent layer has soft-particle depth intersection only (no triangle windows, etc), then things are a lot easier. Could attempt to apply temporal AA to the depth buffer, creating a "soft" depth buffer where edges are part eroded towards far background neighborhood of a pixel. Then do a reduction on this "soft" depth buffer, getting smaller resolution near and far depth value for the local neighborhood (with some overlap between neighborhoods). Then render particles into two smaller resolution color buffers (soft blending a particle to both near and far layers). Can use the far depth reduction as the Z buffer to test against. Later composite into the back-buffer over the temporal AA using the "soft" full-res depth buffer to choose a value between the colors in "near" and "far". Note there is an up-sample involved inline in this process, and various quality/performance trade-offs in how this combined up-sample/blend/composite operation happens. I say "back-buffer" because don't want to feedback the transparency into the next temporal AA pass.

Hard Transparency
Meaning what to do about windows, glasses, and other things which require full-resolution hard intersections with other opaque geometry. Any working solution here also needs an anti-aliased mask for post temporal AA composite. There are no great solutions to my knowledge with the traditional raster based rendering pipelines with viewport jitter. One option is to work around the problem in the art side, to make glass surfaces mostly opaque and render with matching viewport jitter over the lit g-buffer, also correcting so reprojection or motion vectors pick up the new glass surface instead of what is behind it. So glass goes down the temporal AA path.

Another option might be to use the "soft" depth buffer technique but at full resolution. Probably need to build a full resolution "far" erosion depth buffer (take the far depth of the local neighborhood), then depth test against that. Note depth buffer generated by a shader will have an associated perf cost when tested against. Then when rendering transparency can blend directly over the temporal AA output in the back-buffer. In the shader, fetch the pre-generated "near" and "far" reductions, and soft-blend the hard triangle with both. Then take those two results, lookup the "soft" depth from the full resolution, and use as a guide to lerp between the "near" and "far" result. This will enable a "soft" anti-aliased edge, in theory, assuming all the details that matter are correct...

Note on "Soft" Depth
The "soft" depth probably requires that temporal AA not be applied to linear depth, but perhaps some non-linear function of depth. I don't remember any more which transform works the best, but guessing if you take this transformed depth and output it to a color channel and see a clear depth version of the scene with anti-aliased edges, from nearest to far objects, that is a good sign.