The Source of the Strange "Win7" Color Distortion?

EDIT: Root caused. Two problems: (1.) The Dell monitors at work have some problems in "Game" mode. They work fine in "Standard" mode. I'm guessing "Standard" uses some (latency adding?) logic to correct for some color distortion of the panel, and this logic gets turned off in "Game" mode. (2.) Displays tested at work are slow IPS panels with larger gamut than the fast displays at home. The hue-to-warm-to-white transition in my algorithm is too punchy in the yellows for large gamut displays. Algorithm needs some more tuning there.

I have a shadertoy program which presents photo editing controls and a different way to handle over-exposure. Fully saturated hues blend towards white instead of clamping at a hue, and all colors in over-exposure take a path towards white which follows a warm hue shift. So red won't go to pink then white (which looks rather unnatural), but rather red to orange to yellow to white. This test program also adds 50% saturation to really stress the hard exposure cases.

The result looks awesome on my wife's MacPowerBook, and on my home Linux laptop using either a CRT or laptop LCD. However at work on a Win7 box with two Dell monitors and on a co-worker's personal Win7 laptop, the result looks like garbage. Specifically some hues on their route towards white have non-contiguous gradient which looks like some color management mapping operation failed.

I tried lots of different things to adjust color management in Win7 to fix it, but was unable. Was certain this had to be a problem in Win7 because two different Win7 machines with completely different displays all had the same problem. Then another co-worker running Win8 with a set of two different Dell monitors (also color calibrated) got a different result: one display had the same problem, the other looked good.

So not a Windows problem, but rather some new problem I had never seen before on any display I personally have owned: LCD displays with really bad distortion on saturated hues. And apparently it is common enough such that 4 of 5 different displays in the office had the problem. 4 of these displays were color calibrated via the GPU's LUT per color channel, but calibrated to maintain maximum brightness. Resetting that LUT made no difference. However changing some of the color temp settings on the display itself did reduce the distortion on one monitor (only tested one monitor).

Maybe the source of the problem is that the display manufactures decided that the "brightness" and "contrast" numbers were more important, so they overdrive the display by default to the point where they have bad distortion? Changing the color temp would reduce the maximum output of one or two channels. Not at work right now, so not able to continue to test the theory, but guessing the solution is to re-calibrate the display at a lower brightness.


Driving NTSC TV from GTX 880M

Got the Crescendo Systems: TC1600 VGA to YPrPb Transcoder in this week, and combined with the HDFury Nano, I was able to generate a 720x240 progressive NTSC component signal and drive a 2003 TV from the HDMI out of my GTX 880M based laptop (using a custom Modeline in X).

Still have some challenges to iron out. The only VGA connector I had was a little thick, so had to open up TC1600 box and add more clearance to the VGA connector cutout (no problem). First attempt would not maintain sync, ended up just trying the second jumper setting, and everything worked enough to get a signal. EDIT, tried manual tuning, not able to resolve a problem where image brightness (input signal) effects h-sync (bright lines have different h-offset?) and what looks like a few random red or blue horizontal streaks in dark but not black regions. Otherwise the signal works well enough to try a bunch of things, but I do not have the tools to track down and resolve the remaining problems.

Impressions vs Memory
While I'm able to also run an interlaced signal with double the vertical resolution, this is effectively useless due to visual artifacts (as expected). However the NTSC TV looks better at 60Hz interlaced than the VGA CRT at 180Hz interlaced. The combined higher persistence and more diffuse beam makes a big difference here. The progressive 60Hz output on the higher persistence TV does visibly flicker a lot more than I remember thanks to viewing the TV a foot away with a white background. Old TVs managed to get away with this probably because content was almost never white, and the TV was far enough away to not trigger the peripheral vision as much.

Believe the crappy TV I have is actually low-pass filtering the component chroma input to match s-video or even horrible composite. Not sure, but single red pixels look very bad. Have a bit of work here to get the vintage arcade feel (probably need a better TV).

Improved Sampling for Gradient-Domain Metropolis Light Transport

Improved Sampling for Gradient-Domain Metropolis Light Transport


Maxwell 2 Extensions

Cyril Crassin: Maxwell GM204 OpenGL extensions

EXT_post_depth_coverage + NV_framebuffer_mixed_samples
Going to enable really fast good transparency blending with MSAA. Render opaque with standard MSAA. Then attach a no-MSAA color buffer to use with the MSAA depth buffer to composite transparent content. When blending, use "post depth coverage" to get the percentage of the pixel which passes the depth test, which is then used to reduce transparency. Render via front-first alpha blending (where at the end, alpha keeps amount of transparency left). Finally during a custom MSAA resolve, composite the final transparency over the resolved MSAA.


Using GDB Without Source and With Runtime Changing of Code

Getting disassembly of code generated at runtime to verify dynamic code generation can be a slow process. On Linux, I just use gdb. Using gdb requires setting the breakpoint for run-time generated code after that code is generated by the application (if there is an easier way, I don't know it). Here is a quick example,

Find the entry point of the ELF: readelf -a filename
Look for "Entry point address:" and write down address
Start gdb: gdb filename
Insert breakpoint, this example using 0x2c3 as entry point: break *0x2c3
Start program which will break at above entry point: run
Show disassembly with raw bytes example: disassemble /r 0x2c3,0x300
The "=>" shows the instruction pointer
Set another breakpoint to something after run-time generated code
Use the "continue" command to continue to that breakpoint
Show disassembly of the run-time generated code


Interlacing at high frame rates and low persistence?

After hearing Carmack's talk at the Oculus keynote, wanted to see first hand if interlacing artifacts are still visible at high rates on low persistence displays. Tried 180Hz interlaced on my CRT (get a pair of even-line and odd-line frames every 90 times per second). Still see massive artifacts. For instance when on-screen vertical motion is some multiple of scan-out, it is possible to see the black gaps between lines. The visible size of the gap is a function of the velocity of motion. Horizontal motion also does not look good. Probably has a lot to do with resolution (running 640x480 on a CRT which can do 1600x1200). Coarse shadow mask interlaced NTSC TVs did not look as bad at low frame rates, probably because the even and odd lines would partly illuminate the same bits of the screen (or maybe they had higher persistence phosphors)? If I use the display's Vertical Moire control to align the even and odd scan-lines to the same position in space, then the output looks much better even with constant black gaps between the lines. To me this whole experiment suggests that some kind of motion adaptive de-interlacing logic is needed at high frame rates (best case) or at a minimum the OLED display controller needs to fill in the missing lines in any interlaced mode using some spatial filter.


GTX 980 from 680

Looking again at the Wikipedia specs,

GTX 980 vs GTX 680: Perf/Area
GTX 980: 11.6 Mflop/ms/mm2, 0.56 MB/ms/mm2, 0.36 Mtex/ms/mm2, 0.18 Mpix/ms/mm2
GTX 680: 10.5 Mflop/ms/mm2, 0.65 MB/ms/mm2, 0.44 Mtex/ms/mm2, 0.11 Mpix/ms/mm2
Performance per total chip area goes down a little for bandwidth and texture, but up for ALU and ROP.

GTX 980 vs GTX 680: Perf/TDP
GTX 980: 28.0 Mflop/ms/W, 1.36 MB/ms/W, 0.87 Mtex/ms/W, 0.44 Mpix/ms/W
GTX 680: 15.8 Mflop/ms/W, 0.99 MB/ms/W, 0.66 Mtex/ms/W, 0.17 Mpix/ms/W
Performance per total chip power goes way up for ALU and ROP, a little less for bandwidth and even less for texture.


GTX 980/970

NVIDIA Whitepaper - Wikipedia Specs - Techreport Review

For framebuffer compression, looks like for 32-bit/pixel a block of 16x16 fragments gets compressed in 3 steps. Implies that samples for 8xMSAA are stored in 2x4 or 4x2 fragment blocks (depending on if the image or text description is correct), and samples for 4xMSAA stored in 2x2 fragment blocks. The three steps: (1.) 8:1 compression if the fragment values are all uniform for 2x4? or 4x2? blocks, (2.) 4:1 compression if fragment values are all uniform for 2x2 blocks, or lastly (3.) 2:1 compression if delta compression works out (stores deltas between fragments for less bits). Looks like Maxwell adds some more options for delta compression. Sounds like 8xMSAA shaded with a forced 2 samples per pixel would nicely fallback to the 4:1 compression case for blocks in triangle interiors. This compression ability is one of the major reasons MSAA is ultra fast on NVIDIA hardware.

Maxwell in the 980/970 has viewport multicast from VS without the need for GS, conservative raster, programmable sample positions for a repeating block of samples, and ability to stall the pixel shader to implement in-raster-order shader processing for a given pixel (could use for programmable blending for example).


Forward vs Deferred From the Perspective of GPU Limits

EDIT: Mb/ms = in this post is bytes (not bits).

Starting with specs from my notebook's GPU,
2930 Mflop/ms : 160 Mb/ms : 122 Mtex/ms : 30 Mpix/ms

Normalized for one pixel written to one render target,
97 flop/pix : 5.3 byte/pix : 4 tex/pix : 1 pix

Now looking at GPU capacity during the time for writing to 5 32-bit targets in a G-buffer,
488 flop/pix (244 op/pix) : 26 byte/pix : 20 tex/pix : 5 pix

G-buffer export eats 4*5=20 bytes of bandwidth leaving 6 left (not counting Z and/or stencil, and assuming color is fully uncompressed). This hints at why it is important to source from compressed textures when filling the G-buffer. Also important to note that G-buffer fill suffers from the same exact invocation occupancy problem (quad packing) for small triangles as clustered forward shading.

Next, GPU capacity during the time for reading the G-buffer back for lighting (again going to approximate and skip bandwidth used for fetching Z),
366 flop/pix (183 op/pix) : 20 byte/pix : 15 tex/pix

Adding G-buffer fill and G-buffer readback for lighting, presents a quick estimation of the floor of overhead for deferred shading in this example,
854 flop/pix (427 op/pix) : 46 byte/pix : 35 tex/pix

Massive amount of GPU capacity in the shadow of G-buffer overhead. I believe this is one of the primary reasons some developers really like clustered forward shading (or other modern varients): ability to choose simple shaders and cut minimum shader cost by a factor of 4 or so.

Defining Feature of the PC Platform?
A core of the PC as a platform is ultra high resolution, high framerate, super-sampling, or all of the above at the same time. Forward shading had a lot to do with building this legacy. Low overhead shading with easy SGSSAA driver override. Ability to play games with great pixel quality. This is often what I personally miss most with the current trend of high pixel overhead games.


Samsung Gear VR Innovator Edition

Samsung Gear VR highlights,

"Custom calibrated sensors talk to a dedicated kernel driver
Enabling real time scheduled multithreaded application processes at guaranteed clock rates
Context prioritized GPU rendering, enabling asynchronous time warp
Facilitating completely unbuffered display surfaces for minimal latency
Supporting low-persistence display mode for improved comfort, visual stability, and reduced motion blur / judder"

No position tracking like the DK2, 2560x1440 screen, and a 600MHz Adreno 420 GPU.

AnandTech talks about the capacity of 25.6 GB/s on Snapdragon 805. Adreno 420 GPU benchmarks for a different product, Pantech IM-A930 Vega show roughly 6 Mtex/ms (using the on-screen benchmark) or roughly 1.6 texture fetches per pixel per millisecond at the native resolution.

For a platform without position tracking: async timewarp paired with the higher resolution screen, high memory bandwidth, and a GPU which is a deferred vertex shading tiler sounds like a really smart setup for mobile VR. The deferred vertex shading tiler reduces the bandwidth taken by vertex outputs (pushing geometry is important for VR), the tiling for MSAA reduces the cost for high quality anti-aliasing. Having an Oculus store removes the problem of applications fighting for space in a infinite sea of non-VR Android application listings.


HDFury Nano GX: Part 2

Continuing to talk about the radical HDFury Nano GX HDMI to VGA converter...

Got a better VGA monitor (ViewSonic G75f) which can do up to a 85 KHz horizontal scan, which is good for 960x540 @ 140 Hz. This works perfectly through the Nano. Getting a 16:9 aspect ratio on display is as easy has changing the monitor's vertical scaling.

Method for Generating Modelines
Using the EDID, or fetching the modes from the monitor automatically, is effectively useless for anything interesting (non-standard). Not sure if this has more to do with the Monitor's EDID settings, what the Nano returns, or what the driver does with the EDID information. Instead I go back to classic "modelines" in the xorg configuration files.

In order to easily generate "modelines", first take the monitors advertised peak specs, in my case, 1600x1200 @ 68Hz, run them through "cvt" and write down the "hsync". Use this as the maximum hsync the monitor can handle,

[/etc/X11/xorg.conf.d] cvt 1600 1200 68
# 1600x1200 67.96 Hz (CVT) hsync: 84.95 kHz; pclk: 183.50 MHz
Modeline "1600x1200_68.00" 183.50 1600 1712 1880 2160 1200 1203 1207 1250 -hsync +vsync

Then use "cvt" to generate what ever modes are desired at a given resolution, making sure to limit the vertical refresh low enough such that the "hsync" value is never exceeded. At 640x480 this monitor can do 160 Hz,

[/etc/X11/xorg.conf.d] cvt 640 480 160
# 640x480 159.42 Hz (CVT) hsync: 84.49 kHz; pclk: 73.00 MHz
Modeline "640x480_160.00" 73.00 640 688 752 864 480 483 487 530 -hsync +vsync

X.org Configuration
Thankfully with NVIDIA drivers, it is possible to completely turn off mode validation. This is what enables the interesting non-standard modes, and also enables users to destroy old VGA monitors which have no out-of-range protection. In the "Device" section add (this is all one line),

Option "ModeValidation" "AllowNon60hzmodesDFPModes, NoEDIDDFPMaxSizeCheck, NoVertRefreshCheck, NoHorizSyncCheck, NoDFPNativeResolutionCheck, NoMaxSizeCheck, NoMaxPClkCheck, AllowNonEdidModes, NoEdidMaxPClkCheck"

Using this, I was able to try something similar to a 15.5 KHz arcade monitor modeline. This is actually for an Amiga I think. Also note "cvt" won't make correct 15.5 KHz modelines, horizontal frequency will be wrong.

Modeline "640x240" 13.22 640 672 736 832 240 243 246 265 -hsync -vsync

Which is out of range of the ViewSonic (to low of a horizontal sync) and triggers the monitor's out-of-range protection. The same mode seems to work on the really old VGA monitor I have (which probably supported CGA, EGA, and VGA horizontal frequencies). This seems to validate that newer NVIDIA GPUs and drivers (in X at least) and the Nano, can be made to generate the lower hsync signals needed to drive CGA arcade monitors. I'm still missing the conversion box required to test this (see the bottom of this post).

Solving the "Broken Panning"
The problem I was having before is the X.org+driver automatically configuring and using both the laptop and HDMI out displays whenever I switch from "Metamode" to classic "Modeline". Since the "Modeline" was not supported by the laptop display, the display would show nothing but still be enabled, and that is why it looked as if the panning was broken. Everything was actually working, the mouse was just going over to the other display which was a black screen.

Fixing this is easy, just disable the laptop screen using the "Monitor-" option (use the "xrandr" command line to list the display connections to find the proper name),

Section "Monitor"
Identifier "nope"
Option "Ignore" "true"
Option "Enable" "false"

Section "Monitor"
Identifier "Monitor0"
VendorName "Unknown"
ModelName "Unknown"
Option "Primary" "true"
Option "Enable" "true"
HorizSync 31-86
VertRefresh 60-162
UseModes "Modes0"
Option "DPMS"

Section "Screen"
Identifier "Screen0"
Device "Device0"
Option "Monitor-DP-2" "nope"
Option "Monitor-HDMI-0" "Monitor0"
Monitor "Monitor0"

Next step is to attempt to see if I can drive both screens at different frequency and resolution. Then I can use one for programming and the other for analog output.

240p Arcade Options
Great place to go for information: scanlines.hazard-city.de. Under the assumption that my setup can generate a proper 240p VGA signal at the standard 15.5KHz used by arcade monitors (and NTSC TVs), this thread post talks about a converter which will convert that VGA signal into component for a US TV: TC1600 VGA to YPrPb Transcoder (no resampling/scaling).


HDFury Nano GX: HDMI to VGA

Got my HDFury Nano GX, now have the ability to run my little CRT from the HDMI out of the laptop with the 120Hz screen and a GTX 880M...

The Nano is simply awesome. Can now also run the PS4 on a VGA CRT with this device, way better than the PlasmaHDTV I've been using. When the device needs to decript HDMI signals, it needs the cord for USB power. My little CRT can do 720p60Hz, and the tiny amount of super-sampling from the PS4's downscaler in combination with the CRT scanline bloom, low latency, and low persistance creates an awesome visual.

Running from the GTX 880M with NVIDIA drivers in Linux worked right out of the box at 720p60Hz also on the little CRT. I ran with 720p GPU-downsampled from 1080p to compare apples to apples. Yes 60Hz flickers with a white screen, but with the typically low intensity content I run, I don't really notice the flicker. Comparing the 120Hz LCD and the CRT at 60Hz is quite interesting. The CRT definitely looks better motion wise. The 120Hz LCD has no stobe backlight, so it has 4-8x higher persistence than the CRT at 60Hz. Very fast motion is easy to track visually on the CRT. When the eye tracking fails, it still does not look as bad as the 120Hz LCD. The 120Hz LCD is harder track in fast motion without seeing what looks similar to full-shutter 4-tap motion blur at 30Hz. It is still visually obvious that the frame sits for a bit in time.

In terms of responsiveness, the 120Hz LCD is direct driven by the 880M. GPU LUT reprogram to reading the value on a color calibration sensor loop is just 8 ms. My test application also minimizes latency of input by reading controller input directly from CPU memory right before view dependent rendering. Even with that, the 120Hz definitely feels more responsive. The 8ms difference between CRT at 60Hz and LCD at 120Hz seems to make an important difference.

Thoughts on Motion Blur
Motion blur on the CRT at 60Hz in my mind is completely not necessary. Motion blur on the 120Hz LCD (or even 60Hz LCD) is something I would not waste perf on any more. However it does seem as if the entire point of motion blur for "scan-and-hold" displays like LCDs is to simply reduce the confusion that the human visual system is subjected to. Specifically just to break the hard edges of objects in motion, as to reduce the blur confusion the mind is getting from the full persistance "hold". Seems like if motion blur is used at 60Hz and above on an LCD, it is much better to just limit it to a very short size with no banding.

Nano and NVIDIA Drivers in X
Had to manually adjust the xorg conf to get anything outside 720p60Hz to work. I disabled the driver from using the EDID data, went back to classic horz and vert ranges. Noticed a bunch of issues with the NVIDIA Drivers and/or hardware,

(a.) Down-sampling at scan-out has some limits, for example going from 1920x1080 to 640x480 won't work. However scaling only height or width with virtual panning in the unscaled direction does work. This implies that either the driver has a bug, or more likely that the hardware does not have enough on-chip line buffer for the scalar to do such high reductions.

(b.) Up-sampling at scan-out from NVIDIA GPUs is completely useless because they all introduce ringing (hopefully some day the will fix that, I haven't tried Maxwell GPUs yet).

(c.) Metamodes won't do under 31KHz horizontal frequency modes. Instead it forces the dead ugly "doublescan".

(d.) Skipping metamodes and fall back to modelines has a bug where small resolutions automatically extend out the virtual panning resolution in width only, but have broken panning. I still have not found a workaround for this. The modelines even under 31KHz seem to work however (must use "ModeValidation" options to turn off the safety checks).

The Nano does work with frequencies outside 60Hz: managed to get 85Hz going at 640x480. Seems as if the Nano also supports low resolutions and arcade horizontal frequencies (a modeline with 320x240 around 60Hz worked). Unfortunately I'm somewhat limited in testing by the limits of the CRT. It also seemed as if I could get the 880M to actually use an arcade horizontal frequency. But I don't have a way to validate this yet. Won't be sure until I eventually grab another converter (VGA to Component supporting 240p) and try driving a NTSC TV with 240p like old consoles did.