Maxwell 2 Extensions

Cyril Crassin: Maxwell GM204 OpenGL extensions

EXT_post_depth_coverage + NV_framebuffer_mixed_samples
Going to enable really fast good transparency blending with MSAA. Render opaque with standard MSAA. Then attach a no-MSAA color buffer to use with the MSAA depth buffer to composite transparent content. When blending, use "post depth coverage" to get the percentage of the pixel which passes the depth test, which is then used to reduce transparency. Render via front-first alpha blending (where at the end, alpha keeps amount of transparency left). Finally during a custom MSAA resolve, composite the final transparency over the resolved MSAA.


Using GDB Without Source and With Runtime Changing of Code

Getting disassembly of code generated at runtime to verify dynamic code generation can be a slow process. On Linux, I just use gdb. Using gdb requires setting the breakpoint for run-time generated code after that code is generated by the application (if there is an easier way, I don't know it). Here is a quick example,

Find the entry point of the ELF: readelf -a filename
Look for "Entry point address:" and write down address
Start gdb: gdb filename
Insert breakpoint, this example using 0x2c3 as entry point: break *0x2c3
Start program which will break at above entry point: run
Show disassembly with raw bytes example: disassemble /r 0x2c3,0x300
The "=>" shows the instruction pointer
Set another breakpoint to something after run-time generated code
Use the "continue" command to continue to that breakpoint
Show disassembly of the run-time generated code


Interlacing at high frame rates and low persistence?

After hearing Carmack's talk at the Oculus keynote, wanted to see first hand if interlacing artifacts are still visible at high rates on low persistence displays. Tried 180Hz interlaced on my CRT (get a pair of even-line and odd-line frames every 90 times per second). Still see massive artifacts. For instance when on-screen vertical motion is some multiple of scan-out, it is possible to see the black gaps between lines. The visible size of the gap is a function of the velocity of motion. Horizontal motion also does not look good. Probably has a lot to do with resolution (running 640x480 on a CRT which can do 1600x1200). Coarse shadow mask interlaced NTSC TVs did not look as bad at low frame rates, probably because the even and odd lines would partly illuminate the same bits of the screen (or maybe they had higher persistence phosphors)? If I use the display's Vertical Moire control to align the even and odd scan-lines to the same position in space, then the output looks much better even with constant black gaps between the lines. To me this whole experiment suggests that some kind of motion adaptive de-interlacing logic is needed at high frame rates (best case) or at a minimum the OLED display controller needs to fill in the missing lines in any interlaced mode using some spatial filter.


GTX 980 from 680

Looking again at the Wikipedia specs,

GTX 980 vs GTX 680: Perf/Area
GTX 980: 11.6 Mflop/ms/mm2, 0.56 MB/ms/mm2, 0.36 Mtex/ms/mm2, 0.18 Mpix/ms/mm2
GTX 680: 10.5 Mflop/ms/mm2, 0.65 MB/ms/mm2, 0.44 Mtex/ms/mm2, 0.11 Mpix/ms/mm2
Performance per total chip area goes down a little for bandwidth and texture, but up for ALU and ROP.

GTX 980 vs GTX 680: Perf/TDP
GTX 980: 28.0 Mflop/ms/W, 1.36 MB/ms/W, 0.87 Mtex/ms/W, 0.44 Mpix/ms/W
GTX 680: 15.8 Mflop/ms/W, 0.99 MB/ms/W, 0.66 Mtex/ms/W, 0.17 Mpix/ms/W
Performance per total chip power goes way up for ALU and ROP, a little less for bandwidth and even less for texture.


GTX 980/970

NVIDIA Whitepaper - Wikipedia Specs - Techreport Review

For framebuffer compression, looks like for 32-bit/pixel a block of 16x16 fragments gets compressed in 3 steps. Implies that samples for 8xMSAA are stored in 2x4 or 4x2 fragment blocks (depending on if the image or text description is correct), and samples for 4xMSAA stored in 2x2 fragment blocks. The three steps: (1.) 8:1 compression if the fragment values are all uniform for 2x4? or 4x2? blocks, (2.) 4:1 compression if fragment values are all uniform for 2x2 blocks, or lastly (3.) 2:1 compression if delta compression works out (stores deltas between fragments for less bits). Looks like Maxwell adds some more options for delta compression. Sounds like 8xMSAA shaded with a forced 2 samples per pixel would nicely fallback to the 4:1 compression case for blocks in triangle interiors. This compression ability is one of the major reasons MSAA is ultra fast on NVIDIA hardware.

Maxwell in the 980/970 has viewport multicast from VS without the need for GS, conservative raster, programmable sample positions for a repeating block of samples, and ability to stall the pixel shader to implement in-raster-order shader processing for a given pixel (could use for programmable blending for example).


Forward vs Deferred From the Perspective of GPU Limits

EDIT: Mb/ms = in this post is bytes (not bits).

Starting with specs from my notebook's GPU,
2930 Mflop/ms : 160 Mb/ms : 122 Mtex/ms : 30 Mpix/ms

Normalized for one pixel written to one render target,
97 flop/pix : 5.3 byte/pix : 4 tex/pix : 1 pix

Now looking at GPU capacity during the time for writing to 5 32-bit targets in a G-buffer,
488 flop/pix (244 op/pix) : 26 byte/pix : 20 tex/pix : 5 pix

G-buffer export eats 4*5=20 bytes of bandwidth leaving 6 left (not counting Z and/or stencil, and assuming color is fully uncompressed). This hints at why it is important to source from compressed textures when filling the G-buffer. Also important to note that G-buffer fill suffers from the same exact invocation occupancy problem (quad packing) for small triangles as clustered forward shading.

Next, GPU capacity during the time for reading the G-buffer back for lighting (again going to approximate and skip bandwidth used for fetching Z),
366 flop/pix (183 op/pix) : 20 byte/pix : 15 tex/pix

Adding G-buffer fill and G-buffer readback for lighting, presents a quick estimation of the floor of overhead for deferred shading in this example,
854 flop/pix (427 op/pix) : 46 byte/pix : 35 tex/pix

Massive amount of GPU capacity in the shadow of G-buffer overhead. I believe this is one of the primary reasons some developers really like clustered forward shading (or other modern varients): ability to choose simple shaders and cut minimum shader cost by a factor of 4 or so.

Defining Feature of the PC Platform?
A core of the PC as a platform is ultra high resolution, high framerate, super-sampling, or all of the above at the same time. Forward shading had a lot to do with building this legacy. Low overhead shading with easy SGSSAA driver override. Ability to play games with great pixel quality. This is often what I personally miss most with the current trend of high pixel overhead games.


Samsung Gear VR Innovator Edition

Samsung Gear VR highlights,

"Custom calibrated sensors talk to a dedicated kernel driver
Enabling real time scheduled multithreaded application processes at guaranteed clock rates
Context prioritized GPU rendering, enabling asynchronous time warp
Facilitating completely unbuffered display surfaces for minimal latency
Supporting low-persistence display mode for improved comfort, visual stability, and reduced motion blur / judder"

No position tracking like the DK2, 2560x1440 screen, and a 600MHz Adreno 420 GPU.

AnandTech talks about the capacity of 25.6 GB/s on Snapdragon 805. Adreno 420 GPU benchmarks for a different product, Pantech IM-A930 Vega show roughly 6 Mtex/ms (using the on-screen benchmark) or roughly 1.6 texture fetches per pixel per millisecond at the native resolution.

For a platform without position tracking: async timewarp paired with the higher resolution screen, high memory bandwidth, and a GPU which is a deferred vertex shading tiler sounds like a really smart setup for mobile VR. The deferred vertex shading tiler reduces the bandwidth taken by vertex outputs (pushing geometry is important for VR), the tiling for MSAA reduces the cost for high quality anti-aliasing. Having an Oculus store removes the problem of applications fighting for space in a infinite sea of non-VR Android application listings.


HDFury Nano GX: Part 2

Continuing to talk about the radical HDFury Nano GX HDMI to VGA converter...

Got a better VGA monitor (ViewSonic G75f) which can do up to a 85 KHz horizontal scan, which is good for 960x540 @ 140 Hz. This works perfectly through the Nano. Getting a 16:9 aspect ratio on display is as easy has changing the monitor's vertical scaling.

Method for Generating Modelines
Using the EDID, or fetching the modes from the monitor automatically, is effectively useless for anything interesting (non-standard). Not sure if this has more to do with the Monitor's EDID settings, what the Nano returns, or what the driver does with the EDID information. Instead I go back to classic "modelines" in the xorg configuration files.

In order to easily generate "modelines", first take the monitors advertised peak specs, in my case, 1600x1200 @ 68Hz, run them through "cvt" and write down the "hsync". Use this as the maximum hsync the monitor can handle,

[/etc/X11/xorg.conf.d] cvt 1600 1200 68
# 1600x1200 67.96 Hz (CVT) hsync: 84.95 kHz; pclk: 183.50 MHz
Modeline "1600x1200_68.00" 183.50 1600 1712 1880 2160 1200 1203 1207 1250 -hsync +vsync

Then use "cvt" to generate what ever modes are desired at a given resolution, making sure to limit the vertical refresh low enough such that the "hsync" value is never exceeded. At 640x480 this monitor can do 160 Hz,

[/etc/X11/xorg.conf.d] cvt 640 480 160
# 640x480 159.42 Hz (CVT) hsync: 84.49 kHz; pclk: 73.00 MHz
Modeline "640x480_160.00" 73.00 640 688 752 864 480 483 487 530 -hsync +vsync

X.org Configuration
Thankfully with NVIDIA drivers, it is possible to completely turn off mode validation. This is what enables the interesting non-standard modes, and also enables users to destroy old VGA monitors which have no out-of-range protection. In the "Device" section add (this is all one line),

Option "ModeValidation" "AllowNon60hzmodesDFPModes, NoEDIDDFPMaxSizeCheck, NoVertRefreshCheck, NoHorizSyncCheck, NoDFPNativeResolutionCheck, NoMaxSizeCheck, NoMaxPClkCheck, AllowNonEdidModes, NoEdidMaxPClkCheck"

Using this, I was able to try something similar to a 15.5 KHz arcade monitor modeline. This is actually for an Amiga I think. Also note "cvt" won't make correct 15.5 KHz modelines, horizontal frequency will be wrong.

Modeline "640x240" 13.22 640 672 736 832 240 243 246 265 -hsync -vsync

Which is out of range of the ViewSonic (to low of a horizontal sync) and triggers the monitor's out-of-range protection. The same mode seems to work on the really old VGA monitor I have (which probably supported CGA, EGA, and VGA horizontal frequencies). This seems to validate that newer NVIDIA GPUs and drivers (in X at least) and the Nano, can be made to generate the lower hsync signals needed to drive CGA arcade monitors. I'm still missing the conversion box required to test this (see the bottom of this post).

Solving the "Broken Panning"
The problem I was having before is the X.org+driver automatically configuring and using both the laptop and HDMI out displays whenever I switch from "Metamode" to classic "Modeline". Since the "Modeline" was not supported by the laptop display, the display would show nothing but still be enabled, and that is why it looked as if the panning was broken. Everything was actually working, the mouse was just going over to the other display which was a black screen.

Fixing this is easy, just disable the laptop screen using the "Monitor-" option (use the "xrandr" command line to list the display connections to find the proper name),

Section "Monitor"
Identifier "nope"
Option "Ignore" "true"
Option "Enable" "false"

Section "Monitor"
Identifier "Monitor0"
VendorName "Unknown"
ModelName "Unknown"
Option "Primary" "true"
Option "Enable" "true"
HorizSync 31-86
VertRefresh 60-162
UseModes "Modes0"
Option "DPMS"

Section "Screen"
Identifier "Screen0"
Device "Device0"
Option "Monitor-DP-2" "nope"
Option "Monitor-HDMI-0" "Monitor0"
Monitor "Monitor0"

Next step is to attempt to see if I can drive both screens at different frequency and resolution. Then I can use one for programming and the other for analog output.

240p Arcade Options
Great place to go for information: scanlines.hazard-city.de. Under the assumption that my setup can generate a proper 240p VGA signal at the standard 15.5KHz used by arcade monitors (and NTSC TVs), this thread post talks about a converter which will convert that VGA signal into component for a US TV: TC1600 VGA to YPrPb Transcoder (no resampling/scaling).


HDFury Nano GX: HDMI to VGA

Got my HDFury Nano GX, now have the ability to run my little CRT from the HDMI out of the laptop with the 120Hz screen and a GTX 880M...

The Nano is simply awesome. Can now also run the PS4 on a VGA CRT with this device, way better than the PlasmaHDTV I've been using. When the device needs to decript HDMI signals, it needs the cord for USB power. My little CRT can do 720p60Hz, and the tiny amount of super-sampling from the PS4's downscaler in combination with the CRT scanline bloom, low latency, and low persistance creates an awesome visual.

Running from the GTX 880M with NVIDIA drivers in Linux worked right out of the box at 720p60Hz also on the little CRT. I ran with 720p GPU-downsampled from 1080p to compare apples to apples. Yes 60Hz flickers with a white screen, but with the typically low intensity content I run, I don't really notice the flicker. Comparing the 120Hz LCD and the CRT at 60Hz is quite interesting. The CRT definitely looks better motion wise. The 120Hz LCD has no stobe backlight, so it has 4-8x higher persistence than the CRT at 60Hz. Very fast motion is easy to track visually on the CRT. When the eye tracking fails, it still does not look as bad as the 120Hz LCD. The 120Hz LCD is harder track in fast motion without seeing what looks similar to full-shutter 4-tap motion blur at 30Hz. It is still visually obvious that the frame sits for a bit in time.

In terms of responsiveness, the 120Hz LCD is direct driven by the 880M. GPU LUT reprogram to reading the value on a color calibration sensor loop is just 8 ms. My test application also minimizes latency of input by reading controller input directly from CPU memory right before view dependent rendering. Even with that, the 120Hz definitely feels more responsive. The 8ms difference between CRT at 60Hz and LCD at 120Hz seems to make an important difference.

Thoughts on Motion Blur
Motion blur on the CRT at 60Hz in my mind is completely not necessary. Motion blur on the 120Hz LCD (or even 60Hz LCD) is something I would not waste perf on any more. However it does seem as if the entire point of motion blur for "scan-and-hold" displays like LCDs is to simply reduce the confusion that the human visual system is subjected to. Specifically just to break the hard edges of objects in motion, as to reduce the blur confusion the mind is getting from the full persistance "hold". Seems like if motion blur is used at 60Hz and above on an LCD, it is much better to just limit it to a very short size with no banding.

Nano and NVIDIA Drivers in X
Had to manually adjust the xorg conf to get anything outside 720p60Hz to work. I disabled the driver from using the EDID data, went back to classic horz and vert ranges. Noticed a bunch of issues with the NVIDIA Drivers and/or hardware,

(a.) Down-sampling at scan-out has some limits, for example going from 1920x1080 to 640x480 won't work. However scaling only height or width with virtual panning in the unscaled direction does work. This implies that either the driver has a bug, or more likely that the hardware does not have enough on-chip line buffer for the scalar to do such high reductions.

(b.) Up-sampling at scan-out from NVIDIA GPUs is completely useless because they all introduce ringing (hopefully some day the will fix that, I haven't tried Maxwell GPUs yet).

(c.) Metamodes won't do under 31KHz horizontal frequency modes. Instead it forces the dead ugly "doublescan".

(d.) Skipping metamodes and fall back to modelines has a bug where small resolutions automatically extend out the virtual panning resolution in width only, but have broken panning. I still have not found a workaround for this. The modelines even under 31KHz seem to work however (must use "ModeValidation" options to turn off the safety checks).

The Nano does work with frequencies outside 60Hz: managed to get 85Hz going at 640x480. Seems as if the Nano also supports low resolutions and arcade horizontal frequencies (a modeline with 320x240 around 60Hz worked). Unfortunately I'm somewhat limited in testing by the limits of the CRT. It also seemed as if I could get the 880M to actually use an arcade horizontal frequency. But I don't have a way to validate this yet. Won't be sure until I eventually grab another converter (VGA to Component supporting 240p) and try driving a NTSC TV with 240p like old consoles did.


Thoughts on Display Color Calibration for Games

On Apple products across the board, the factory tonal configuration is Gamma 2.2 not sRGB. Using an sRGB backbuffer is totally useless, instead whatever shader converts from linear high dynamic range to display target needs to manually do the pow(). Typically this step is manual anyway, because that is required to properly dither the floating point color to 8-bit per channel output. On the plus side, Apple products are so well calibrated and matched even between desktop and mobile, that anyone with a color calibrated authoring pipeline can target the hardware and the consumer will experience the artist's intent. This is simply awesome.

On the PC side, and from what I could see on a very small sampling of the fragmented Android space, sRGB is a better match (than Gamma 2.2) to default device factory calibration. This is not surprising given that both sRGB and Rec.709 (and later) HDTV standards adopted a linear segment close to black. The idea being that the linear segment enables a better perceptual distribution given a fixed set of bits.

The disadvantage of encodings like sRGB, which mix both a linear segment and gamma curve into the tonal curve, is that "correct" manual dithering can be more expensive (because the conversion is much more expensive). Given that all realtime digital content should use temporal dithering to avoid output banding, Apple's choice of fixed Gamma 2.2 seems like a much better choice. However...

On the Topic of Banding
TN panels often are 6-bit/channel, with temporal dithering. Plasma hits the other end of the extreme (maybe 1 or 2-bit/channel?) with extreme temporal dithering (600 Hz). In both these cases, applications need to manually dither beyond the exact amount required for 8-bit output (display dither is too conservative). Also in both these cases, the application temporal dither can mix in bad ways with the display's temporal dithering. My current feeling is that the correct solution to this problem is to replace the "correct" temporal dither with a film grain with a gamma responce (like film) applied in the linear HDR colorspace. This film grain would have a minimum amount even in the light areas which is large enough to serve as the temporal dither (to remove banding on worst case target). Also the film grain would be between 1.5 and 2 pixel in size, so that it does not conflict with a display's 1 pixel sized temporal dithering. The end result of this, is that sRGB again is a fine target, and Gamma 2.2 requires extra shader overhead.

White Point Calibration
Seems best to just target the D65 (daylight filtered 6500K) of sRGB. Knowing that: (a.) displays will be +/- that value, (b.) the mind automatically adapts to small differences in white point, and (c.) that white point will change +/- towards darks as well even on a given display. The cause of (c.) is that even if the display is calibrated to D65, the native black point of the display typically is not D65, and the only way to fix that is to raise the black level (adding intensity to come channels, reducing contrast), which is not something OEMs and users want.

Simple Production Calibration Goals
So the goal of simple calibration of displays is to get the {R,G,B} LUTs to provide a D65 white, through the entire gray scale, with the sRGB tonal curve, with the exception that somewhere in the darks, the very dark grey color will start to color shift to the native black point color tint, and then terminate at something which is not fully black. The color gamut of the display ultimately decides saturation scaling, which changes per type of display.

Simple In-Game Controls
Display gamma is the wild west, so at least need some user-adjustable gamma. Something like "move the slider until the dark symbol is just barely visible". Still not sure if user-adjustable offset is required as well.

Scifi Reading Suggestion List from Twitters

Accelerando - Charles Stross
Altered Carbon - Richard K. Morgan
Anathem - Neal Stephenson
Blindsight - Peter Watts
Blue Remembered Earth - Alastair Reynolds
Book of the New Sun: Series of 4 books - Gene Wolfe
Commonwealth Saga: Pandora's Star, Judas Unchained - Peter F. Hamilton
Cryponomicon - Neal Stephenson
The Culture Series - Ian M Banks
Deepness in the Sky - Vernor Vinge
Diamond Age - Neal Stephenson
Dune Series - Frank Herbert
A Fire Upon the Deep - Vernor Vinge
Leviathan Wakes - James S. A. Corey
Lord of Light - Roger Zelazny
Heechee Series - Frederik Pohl
House of Suns - Alastair Reynolds
Hyperion Cantos: Hyperion, ... - Dan Simmons
In Her Name Series - Michael R. Hicks
The Martian - Andy Weir
The Moon is a Harsh Mistress - Heinlein
Neptune's Brood - Charles Stross
Nexus - Ramez Naam
The Night's Dawn Trilogy: The Reality Dysfunction, The Neutronium Alchemist, The Naked God - Peter F. Hamilton
Old Man's War - John Scalzi
Only Forward - Michael Marshall Smith
Player of Games - Iain M. Banks
Redshirts - John Scalzi
Revelation Space - Alastair Reynolds
Sandkings - G.R.R Martin
Silo Series: Wool, Shift, Dust - Hugh Howey
Singularity Sky, Iron Sunrise - Charles Stross
The Skinner - Neal Asher
Solaris - Stanislaw Lems
Star Wolf - David Hamilton
The Unincorporated Man - Dani Kollin and Eytan Kollin
Tuf Voyaging - G.R.R Martin
Windup Girl - Paolo Bacigalupi



Placed up the X Window Manager I've been using for over a decade on GitHub. It provides simple full or split screen tiled windowing with virtual windows. Nothing more. Ideal for me, probably not ideal for you. Works great now that GIMP can be placed in "single window mode".

 Simple yet very useful single screen X Window Manager.
 Designed to minimize wasted user time interacting with windows.
 No configuration files.
 Tiny x86-64 binary.

 ALT+ESC .......... Close window.
 ALT+TAB .......... Cycle through window list on virtual screen (like Windows).
 ALT+` ............ Cycle window shape between full, and tiled positions.
 ALT+1 ............ Switch virtual screen left.
 ALT+2 ............ Switch virtual screen right.
 ALT+3 ............ Move focus window to virtual screen left.
 ALT+4 ............ Move focus window to virtual screen right.

 The windows list is ordered as follows,

  { most recently used, 2nd most recently used, ..., last used }

 While ALT is held down pressing TAB will cycle through list,
 going to the last reciently used window from the current window.
 It will wrap around at the end.
 After ALT is released, the list is updated.
 The new current window is moved to the front of the list.

 Only requires a C compiler and the X11 library.
 Try something like,

  gcc minwm.c -Os -o minwm -I/usr/X11/include -L/usr/X11/lib -lX11
  strip minwm

 Then setup your .xinitrc file like,

  xrdb -merge $HOME/.Xresources
  xterm -rv -ls +sb -sl 4096 &
  exec $HOME/minwm

 Then run xinit and then start programs from the terminal.


Next Generation OpenGL Initiative Details from Khronos BOF

OpenGL Ecosystem BOF 2014


Cross vendor project between OpenGL and OpenGL ES working groups:
- Chair = Tom Olson (ARM)
- IL Group Chair = Bill Licea-Kane (Qualcomm)
- API Spec Editors = Graham Sellers (AMD) and Jeff Bolz (NVIDIA)

Committed to adopting a portable intermediate language for shaders.
Compatibility break from existing OpenGL.
Starting from first principles.
Multi-thread friendly.
Greatly reduced CPU overhead.
Full support for tiled and direct renderers.
Explicit control: application tells driver what it wants.



Link to the Shadertoy example.

Growing up in the era of the CRT "CGA" Arcade Monitor was just awesome. Roughly 320x240 or lower resolution at 60 Hz with a low persistence display. Mix that with stunning pixel art. One of the core reasons I got into the graphics industry.

Built the above Shadertoy example to show what I personally like in attempting to simulate that old look and feel on modern LCD displays. The human mind is exceptionally good at filling in hidden visual information. The dark gaps between scanlines enable the mind to reconstruct a better image than what is actually there. The right most panel adds a quick attempt at a shadow mask. It is nearly impossible to do a good job simulating that because the LCD cannot get bright enough. The compromise in the shader example is to rotate the mask 90 degrees to reduce chromatic aberration. The mask could definitely be improved, but this is a great place to start...

Feel free to use/modify the shader. Hopefully I'll get lucky and have the option to turn on the vintage scanline look when I play those soon to be released games with awesome pixel art!


Vintage Programming

A photo (not a screenshot) of one of my home vintage development environments running on modern fast PCs. Shot shows colored syntax highlighted source to the compiler of the language I use most often (specifically the part which generates the ELF header for Linux). More on this below.

This is running 640x480 on a small mid 90's VGA CRT which supports around 1000 lines. So no garbage double scan and no horrible squares for pixels. Instead a high quality analog display running at 85 Hz. The font is my 6x11 fixed size programming font.

This specific compiler binary on x86-64 Linux is under 1700 bytes.

A Language
The language is ultra primitive, it does not include a linker, or anything to do code generation, there is no debugger (and it frankly is not needed as debuggers are slower than instant run-time recompile/reload style development). Instead the ELF (or platform) header for the binary, and the assembler or secondary language which actually describes the program, is written in the language itself.

Over the years I've been playing with either languages which are in classic text form, and languages which require custom editors and are in a binary form. This A language is the classic text source form. All the variations of languages I've been interested in are heavily influenced by Color Forth.

This A compiler works in 2 passes, the first both parses and translates the source into x86-64 machine code. Think of this as factoring out the interpreter into the parser. The second pass simply calls the entry point of the source code to interpret the source (by running the existing generated machine code). After that whatever is written in the output buffer gets saved to a file.

Below is the syntax for the A language. A symbol is an untyped 64-bit value in memory. Like Forth there is a separate data and return stack.

012345- \compile: push -0x12345 on the data stack\
,c3 \write a literal byte into the compile stream\
symbol \compile: call to symbol, symbol value is a pointer to function\
'symbol \compile: pop top of data stack, if value is true, call symbol\
`symbol \copy the symbol data into the compile stream, symbol is {32-bit pointer, 32-bit size}\
:symbol \compile: pop data stack into symbol value\
.symbol \compile: push symbol value onto data stack\
%symbol \compile: push address of symbol value onto data stack\
"string" \compile: push address of string, then push size of string on the data stack\
{ symbol ... } \define a function, symbol value set to head of compile stream\

And that is the A language. The closing "}" writes out the 32-bit size to the packed {32-bit pointer, 32-bit size} symbol value, and also adds an extra RET opcode to avoid needing to add one at the end of every define. There is one other convention missing in the above description, there is a hidden register used for the pointer to the output buffer.

Writing Parts of the Language in the Language
The first part of any source file is a collection of opcodes, like the { xor ,48 ... } at the top of the image which is the raw x86-64 machine code to do the following in traditional assembly language (rax = top of data stack, rbx points to second data stack entry),

XOR rax, [rbx]
SUB rbx, 8

These collection of opcodes generate symbols which form the stack based language the interpreter uses. They would get used like `xor in the code (the copy symbol to compile stream syntax). For instance `long pops the top of the data stack and writes out 8-bytes to the output buffer, and `asm pushes the output buffer pointer onto the data stack.

I use this stack based language to then define an assembler (in the source code), and then I write code in the assembler using the stack based language as effectively the ultimate macro language. For instance if I was to describe the `xor command in the assembly it would look like follows,

{ xor .top .stk$ 0 X@^ .stk$ 8 #- }

Which is really hard to read without syntax coloring (sorry my HTML is lazy). For naming, the "X" = 64-bit extended, the "@" = load, and the "#" = immediate. So the "X@^" means assemble "XOR reg,[mem+imm]". The symbols "top" and "stk$" contain the numbers of the registers for the top of the stack and the pointer to the second item on the stack respectively.

Compiler Parser
The compiler parsing pass is quite easy, just a character jump table based on prefix character to a function which parses the {symbol, number, comment, white space, etc}. These functions don't return, they simply jump to the next thing to parse. As symbol strings are read they are hashed into a register and bit packed into two extra 64-bit registers (lower 4-bits/character in one register, upper 3-bits/character in another register). This packing makes string compare easy later when probing. Max symbol string is 16 characters. Hash table is a simple linear probing style, but with an array 2 of entries per hash value filling one cacheline. Each hash table entry has the following 8-byte values {lower bits of string, upper bits of string, pointer to symbol storage, unused}. The symbol storage is allocated from another stack (which only grows). Upon lookup, if a symbol isn't in the hash table it is added with new storage. Symbols never get deleted.