AMD64 Assembly Porting Between Linux and Windows

One of the unfortunate differences between Linux and Windows for AMD64 assembly is that both have completely different ABI call conventions when accessing common system libraries like OpenGL. However it is possible to bridge the gap. Arguments 0 through 3 are in registers on both platforms, but in different registers (easy macro workaround),

Linux__: rdi rsi rdx rcx r8 r9
Windows: rcx rdx r8 r9

The solution for portability is to target Windows as if it had 6 argument registers since both rdi and rsi are callee save,

Windows: rcx rdx r8 r9 {rdi rsi}

But prior to a more than 4 integer argument C library call, push rsi and rdi on the stack, then add 32-bytes to the stack pointer for the Windows "register parameter area".

Finally don't use the "red zone" from the Linux ABI and also don't use the "register parameter area" from the Windows API.


Random Notes on Maxwell

Notes From GTX 980 Whitepaper | Maxwell Tuning Guide
GTX 980
16 geometry pipes
... One per SM
16 SMs
... 96KB shared memory per SM
... ?KB instruction cache per SM
... SMs divided into 4 quadrants
... Pair of quadrants sharing a TEX unit
... Each Quadrant
....... Issue 2 ops/clk per warp to different functional units
....... Supports up to 16 warps
....... Supports up to 8 workgroups
4 Memory Controllers
... 512KB per MC (2MB total)
... 16 ROPs per MC (64 total)

Only safe way to get L1 cached reads for read/write images is to run warp sized workgroups and work in global memory not shared by other workgroups. Hopefully an application can express this by typecasting to a read-only image before a read.

Using just shared memory, this GPU can run 64 parallel instances of a 24KB (data) computer without going out to L2.

(EDIT from Christophe's comments) This GPU has an insane untapped capacity for geometry: 16 pipes * more than 1GHz * 0.333 = maybe 5.3 million triangles per millisecond. Or enough for 2 single-pixel triangles per 1080p screen pixel per millisecond...


The Source of the Strange "Win7" Color Distortion?

EDIT: Root caused. Two problems: (1.) The Dell monitors at work have some problems in "Game" mode. They work fine in "Standard" mode. I'm guessing "Standard" uses some (latency adding?) logic to correct for some color distortion of the panel, and this logic gets turned off in "Game" mode. (2.) Displays tested at work are slow IPS panels with larger gamut than the fast displays at home. The hue-to-warm-to-white transition in my algorithm is too punchy in the yellows for large gamut displays. Algorithm needs some more tuning there.

I have a shadertoy program which presents photo editing controls and a different way to handle over-exposure. Fully saturated hues blend towards white instead of clamping at a hue, and all colors in over-exposure take a path towards white which follows a warm hue shift. So red won't go to pink then white (which looks rather unnatural), but rather red to orange to yellow to white. This test program also adds 50% saturation to really stress the hard exposure cases.

The result looks awesome on my wife's MacPowerBook, and on my home Linux laptop using either a CRT or laptop LCD. However at work on a Win7 box with two Dell monitors and on a co-worker's personal Win7 laptop, the result looks like garbage. Specifically some hues on their route towards white have non-contiguous gradient which looks like some color management mapping operation failed.

I tried lots of different things to adjust color management in Win7 to fix it, but was unable. Was certain this had to be a problem in Win7 because two different Win7 machines with completely different displays all had the same problem. Then another co-worker running Win8 with a set of two different Dell monitors (also color calibrated) got a different result: one display had the same problem, the other looked good.

So not a Windows problem, but rather some new problem I had never seen before on any display I personally have owned: LCD displays with really bad distortion on saturated hues. And apparently it is common enough such that 4 of 5 different displays in the office had the problem. 4 of these displays were color calibrated via the GPU's LUT per color channel, but calibrated to maintain maximum brightness. Resetting that LUT made no difference. However changing some of the color temp settings on the display itself did reduce the distortion on one monitor (only tested one monitor).

Maybe the source of the problem is that the display manufactures decided that the "brightness" and "contrast" numbers were more important, so they overdrive the display by default to the point where they have bad distortion? Changing the color temp would reduce the maximum output of one or two channels. Not at work right now, so not able to continue to test the theory, but guessing the solution is to re-calibrate the display at a lower brightness.


Driving NTSC TV from GTX 880M

Got the Crescendo Systems: TC1600 VGA to YPrPb Transcoder in this week, and combined with the HDFury Nano, I was able to generate a 720x240 progressive NTSC component signal and drive a 2003 TV from the HDMI out of my GTX 880M based laptop (using a custom Modeline in X).

Still have some challenges to iron out. The only VGA connector I had was a little thick, so had to open up TC1600 box and add more clearance to the VGA connector cutout (no problem). First attempt would not maintain sync, ended up just trying the second jumper setting, and everything worked enough to get a signal. EDIT, tried manual tuning, not able to resolve a problem where image brightness (input signal) effects h-sync (bright lines have different h-offset?) and what looks like a few random red or blue horizontal streaks in dark but not black regions. Otherwise the signal works well enough to try a bunch of things, but I do not have the tools to track down and resolve the remaining problems.

Impressions vs Memory
While I'm able to also run an interlaced signal with double the vertical resolution, this is effectively useless due to visual artifacts (as expected). However the NTSC TV looks better at 60Hz interlaced than the VGA CRT at 180Hz interlaced. The combined higher persistence and more diffuse beam makes a big difference here. The progressive 60Hz output on the higher persistence TV does visibly flicker a lot more than I remember thanks to viewing the TV a foot away with a white background. Old TVs managed to get away with this probably because content was almost never white, and the TV was far enough away to not trigger the peripheral vision as much.

Believe the crappy TV I have is actually low-pass filtering the component chroma input to match s-video or even horrible composite. Not sure, but single red pixels look very bad. Have a bit of work here to get the vintage arcade feel (probably need a better TV).

Improved Sampling for Gradient-Domain Metropolis Light Transport

Improved Sampling for Gradient-Domain Metropolis Light Transport


Maxwell 2 Extensions

Cyril Crassin: Maxwell GM204 OpenGL extensions

EXT_post_depth_coverage + NV_framebuffer_mixed_samples
Going to enable really fast good transparency blending with MSAA. Render opaque with standard MSAA. Then attach a no-MSAA color buffer to use with the MSAA depth buffer to composite transparent content. When blending, use "post depth coverage" to get the percentage of the pixel which passes the depth test, which is then used to reduce transparency. Render via front-first alpha blending (where at the end, alpha keeps amount of transparency left). Finally during a custom MSAA resolve, composite the final transparency over the resolved MSAA.


Using GDB Without Source and With Runtime Changing of Code

Getting disassembly of code generated at runtime to verify dynamic code generation can be a slow process. On Linux, I just use gdb. Using gdb requires setting the breakpoint for run-time generated code after that code is generated by the application (if there is an easier way, I don't know it). Here is a quick example,

Find the entry point of the ELF: readelf -a filename
Look for "Entry point address:" and write down address
Start gdb: gdb filename
Insert breakpoint, this example using 0x2c3 as entry point: break *0x2c3
Start program which will break at above entry point: run
Show disassembly with raw bytes example: disassemble /r 0x2c3,0x300
The "=>" shows the instruction pointer
Set another breakpoint to something after run-time generated code
Use the "continue" command to continue to that breakpoint
Show disassembly of the run-time generated code


Interlacing at high frame rates and low persistence?

After hearing Carmack's talk at the Oculus keynote, wanted to see first hand if interlacing artifacts are still visible at high rates on low persistence displays. Tried 180Hz interlaced on my CRT (get a pair of even-line and odd-line frames every 90 times per second). Still see massive artifacts. For instance when on-screen vertical motion is some multiple of scan-out, it is possible to see the black gaps between lines. The visible size of the gap is a function of the velocity of motion. Horizontal motion also does not look good. Probably has a lot to do with resolution (running 640x480 on a CRT which can do 1600x1200). Coarse shadow mask interlaced NTSC TVs did not look as bad at low frame rates, probably because the even and odd lines would partly illuminate the same bits of the screen (or maybe they had higher persistence phosphors)? If I use the display's Vertical Moire control to align the even and odd scan-lines to the same position in space, then the output looks much better even with constant black gaps between the lines. To me this whole experiment suggests that some kind of motion adaptive de-interlacing logic is needed at high frame rates (best case) or at a minimum the OLED display controller needs to fill in the missing lines in any interlaced mode using some spatial filter.