## 20150630

### Sugar Free Peppermint Chocolate Chip Custard Ice Cream [SFPCCCIC]

EDIT: Photo and updated ingredients...

Been experimenting with ketogenic friendly ultra-low-carb ice cream. This, along with bacon-wrapped sour cream, is one of the perks of the extremely high fat and low carb life style. First experiment didn't go as planned, placing the result in the freezer was a mistake. Apparently sugar in normal ice cream actually is the key component which enables the ice cream to maintain a great texture when frozen. Second pass, I'm keeping the result in the fridge. Ingredients for part one,

1 pint - whole cream
3 - egg yokes
1/2 tsp - peppermint extract
6 drops - pure liquid stevia

Blend everything together in a blender on high. Pour in a pan on the stove, stirring and slowly bring up to 160 deg F. Pour in a chilled container, store in fridge until chilled. Extra ingredients for part two,

1 - Lindt 90% chocolate square

Pour chilled mixture in ice cream maker, along with chopped chocolate square. Lindt 90% is better than the lower cocoa thanks to it's higher fat content. Churn until mixture looks like ice cream. Add to chilled container, store in fridge until chilled. Eat. Turned out awesome. Going to try higher peppermint next time and maybe another drop of stevia...

## 20150624

### AMD Fury X (aka Fiji) is a Beast of a GPU Compute Platform

Compute Raw Specs
Raw specs from wikipedia adjusted to per millisecond: comparing what the two vendors built around 600 mm^2 on 28 nm,

AMD FURY X: 8.6 GFlop/ms, 0.5 GB/ms, 0.27 GTex/ms
NV TITAN X: 6.1 GFlop/ms, 0.3 GB/ms, 0.19 GTex/ms

Or the same numbers in operations per pixel at 1920x1080 at 60 Hz,

AMD FURY X: 69 KFlop/pix, 4.0 KB/pix, 2.2 KTex/pix
NV TITAN X: 49 KFlop/pix, 2.4 KB/pix, 1.5 KTex/pix

Think about what is possible with 69 thousand flops per pixel per frame.

HBM
HBM definitely represents the future of bandwidth scaling for GPUs: a change which brings the memory clocks down and bus width up (512 bytes wide on Fury X vs 48 bytes wide on Titan X). This will have side effects on ideal algorithm design: ideal access granularity gets larger. Things like random access global atomics and random access 16-byte vector load/store operations become much less interesting (bad idea before, worse idea now). Working in LDS with shared atomics, staying in cache, etc, becomes more rewarding.

## 20150623

### BenQ XL2730Z Blur Reduction vs CRT

TFT Central Review for the BenQ XL2730Z

The BenQ XL2730Z is one of the best 2560x1440 LCDs on the market. This display has a blur reduction mode which enables a strobe back-light (which only works properly with fixed 120Hz or 144Hz refresh rates). Its sharpness control also enables a reduction of sharpness, which can cut out a lot of the hard exact square pixel LCD feel, leaving an image which is more pleasing to look at. And it provides an exact pixel mode which instead of enlarging lower resolution modes, keeps a 1:1 mapping of input to output pixel, and draws the outside borders with black. So 1920x1080 still looks good, just doesn't fully fill the screen. Compared to other LCDs this display is awesome, and running with a 1000Hz gaming mouse at 144Hz with a strobe back-light is a wonderful experience in comparison to what most people are used to.

Compared to CRTs?
CRTs still easily surpass even these top of the line LCDs in quality. One of the core differences is that even the best LCDs cannot transition pixels fast enough for a true low persistence display. The BenQ for instance has frame cross-talk (seeing part of one or more frames). At the default "Area" setting (which controls the vertical position of minimal cross-talk), at 144Hz, up to 2 extra frames are partly visible at the top and bottom of the screen, however toward the center just 1 extra frame is partly visible. I have a feeling that variable cross-talk has to do with the fact that the back-light has to be a global strobe, but pixel rows are still scanned (changed) in the classic CRT scan order. So row to strobe timing is only best at one point on the screen. Likewise the LCD uses a column dither on the transition overdrive, even and odd pixel columns over-shoot then under-shoot respectively the transition such that the 2 column average is closer to the correct signal. The minimal 1 extra frame cross-talk is a product of the lack of ability to transition to the proper signal in one frame. This is quite distracting in comparison with the CRT. The CRT can easily offer low persistence at much lower frame rates without transition artifacts, while providing filtered pixels at a lower resolution while presenting a proper image instead of a lesson in cubism. So what would be 2560x1440x120Hz or 442 Mpix/s on the LCD, still does not compare to the quality of 1600x1200x76Hz or 146 Mpix/s on the CRT. And the CRT can use that 3x GPU performance for higher quality visuals.

Other Technology?

## 20150525

### OS Project : 7 - PS/2 and Misc

Just started back up again after a long break. Finished up the PS/2 driver. Simplified keyboard interface to one 64-bit bit array in memory, and a simple polling function.

   __0 __1 __2 __3 __4 __5 __6 __7 __8 __9 __A __B __C __D __E __F
0| 0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F
1| G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V
2| W   X   Y   Z      -   =   [   ]   \   ;   '   ,   .   /   SPC
3| LF  RT  UP  DN  HOM END PUP PDN INS DEL BS  TAB RET SHF CTL ALT`

Key down sets a bit. Key release clears bits for {SHF, CTL, ALT} only. It is up to the application to clear bits otherwise. Ended up being a good design choice as keys like '\' immediately generate release commands even when held down on my keyboard. Key bitmap organized to make editor usage (like key-to-hex conversions) simple. Keys limited to what I expect to use, and supporting everything on typical PC arcade controllers.

Humans and modern computers share a common thread: neither is any good at multitasking. The context of multitasking in this post is not async IO, but rather running multiple applications at the same time on the same CPU core. Suspect that the needs of multitasking change quite a bit at the point where nearly everything on the machine is perceptually instantaneous: as it would be once the 30 years of 40% yearly compounding complexity dept is removed. SSDs have throughput similar to last-level cache on early P4 era PCs. Logical end point: CPU cores owned by the applications themselves, no single-core multitasking, no preemption. App using power control to drop CPU core to lowest power state required to sustain a hard real-time user interface.

## 20150522

### The Other Project : Cleaned Up

From the CLP Motorsports open house,

## 20150519

### Voxelnauts

"Voxelnauts: one universe, infinite worlds, with possibilities limited only by your imagination. Your adventure awaits - come build it"

As a young kid I grew up building things out of legos. Had one box which collected all the legos from a few sets. Would set out to build things and imagine what they could be in a virtual world. Voxelnauts looks to be the modern realization of that, but done in a way which was impossible 30+ years ago: ability to feel the presence of what you build in a massive real-life scale, to share that with others, and interact in a persistent world...

## 20150509

### OS Project : 6 - Hashing

Been a while since I had any time to work on this project. Renamed as I've done some major design changes...

Axed
Dropping source-less programming part. Source-less worked great for x86 programming, however there are a few things I didn't like about it. Little issues like having 2 words for each common forth construct like "CALL WORD" instead of just "WORD". Larger issues being the requirement to have another editor to generate complex data or non-x86 machine code.

From the Ashes
Pulled the old work into a new bootloader. This time rapid prototyping using just NASM. Eventually I'll have to rewrite the bootloader in the source/language I end up creating. Switched back to 64-bit long mode (see why below). Running with 1GB pages since long mode forces page tables.

Re-Thinking The Forth Dictionary
Time off from this project brought forth some new ideas addressing the original motivation for going source-less: (a.) don't like linear searches at interpreter-time through large dictionaries (simple forth implementation), (b.) don't like how hashing (complex forth implementation) takes a bunch of ALU-time, (c.) really don't like how dynamic hashing can leave 75% of the cache unused, (d.) and still want to have full 32-bit values in source without multiple source tokens.

Solving (d.) is easy, just switch to 64-bit source tokens. Source is now 2x the size, but 32-bit immediates are trivial, and words can now use 10 characters instead of 5. Source is a prefetchable linear read, so the "more memory for more simple" trade is worth it to me.

Now Can Hashing be Made Better?
Solving (b.): how about factoring everything in the hashing algorithm except the "AND" operation into the string encoding itself? The tokens encode strings as 60-bit integers. Just need to switch from a generic 6-bits per character, to something where the characters are encoded in a way that they effect all bits.

One thing I tried is to use a reversible cellular automata (CA) like Wolfram Rule 30R. So strings in tokens are kept in memory in the form after being processed through N steps of applying the CA. Editor to display the string simply has to reverse the CA process.

To test the theory I first grabbed an english dictionary, then hashed to {256, 512, 1K, 2K, ... 512K, 1 M} entry tables, and tracked the {min,max,average} table bin counts. Comparing the reversible CA to just the 64-bit finalizer in MurmurHash3 (not reversible but provides a good baseline): both have similar results.

Also tested against a dictionary with all possible {1,2,3} character strings (including symbols and numbers). Results are again similar.

Second I tried a recursive interval encoding: each character is encoded in the interval of the parent character. First character starts with the full 0 to 2^60-1 interval. Character 63 is a special terminal character (signals no more characters are encoded). Character 63 gets a smaller interval, allowing the other characters to get an extended interval designed to spread out the value when ANDed to a power of two. This works similar in theory to arithmetic encoding, except this has constant probabilities. Multiplers used to encode each character (starting with the 1st character and going to the last),

64*64*64*64*64*64*64*64*64 +63*63*63*63*63*63*63*63,
64*64*64*64*64*64*64*64 +63*63*63*63*63*63*63,
64*64*64*64*64*64*64 +63*63*63*63*63*63,
64*64*64*64*64*64 +63*63*63*63*63,
64*64*64*64*64 +63*63*63*63,
64*64*64*64 +63*63*63,
64*64*64 +63*63,
64*64 +63,
64 +1,
1,

This evenly distributes the effect of a character across all bits. The synthetic all {1,2,3} length string case performs better with this encoding. The english dictionary does not see much of an improvement.

A forth option would be to just use arithmetic encoding, or predictive arithmetic encoding on the string. I suspect this would have similar results on the english dictionary as the above fixed interval encoder had on the all possible strings case. However I didn't try it, because I don't have any great way to get symbol probabilities for source that does not exist yet, and current results are probably good enough.

Hash Cache Utilization
Solving (c.) hash cache utilization: how about a software hash cache. Source tends to have a small working set of words, even if the dictionary gets quite large. Now that hashing to different size tables is free via AND, I can split the hash table into two parts: a 256 entry table (or larger) which will effectively always be filled and in hardware cache, and a much larger table for software misses which still wastes 75% of the hardware cacheline. The software hash cache contains the following per entry,

{4-byte data, 4-byte larger table address, 8-byte string}

The larger table is inclusive. Eviction of the small hash uses the larger table address, to avoid searching (or probing) for the right entry.

Moving On
This should have the right combination of being simple enough and fast enough to avoid triggering the "something is wrong" re-design impulse again.

The high-level view I have of the OS is effectively an ultra-high-power micro-controller that boots into a forth editor like a C64 boots into basic...

## 20150426

### Source-Less Programming : 5

Managed to get the boot loader done, which includes the following steps,

(1.) Move the stack seg:pointer (since next step overwrites it).
(2.) Use BIOS to read the other 62 512-byte sectors for the first track.
(3.) Use BIOS to switch to 80x50 text mode and load custom character glyphs.
(4.) Use BIOS to set EGA text palette to 0-15 with 0 for overscan.
(5.) Program VGA palette registers for those 16 colors.
(6.) Use BIOS to enable A20.
(7.) Turn off interrupts, and relocate the image's 63 sectors to zero.
(8.) Load zero entry IDT, minimal 3 entry GDT.

The 2nd 512-byte sector contains the 8x8 character bitmaps for the first 64 characters. The majority of the time was spent making a nice font, getting colors the way I wanted, and prototyping editor look and feel (without building it).

Didn't feel like fully hand assembling 16-bit x86 machine code for the boot loader, so I used NASM and hexdump to accellerate the process (to provide machine code I could pad out to 32-bit alignment). Also wrote a quick C based tool to bootstrap the process of building the loader. Something which would enable me to easily build out an annotated image, and show a print out in the console of what I'd be seeing in the editor. Here is a shot of a bit of the scratch C code I used to make the font,

Here is a shot in QEMU of the loader displaying the font,

And another shot from QEMU showing the pallet,

What the Current Annotated Image Looks Like
Below is a shot captured from the terminal window output of the C tool. I'm using 3 cache lines for the loader code.

Grey lines separate the 512-byte sectors. Memory address on the left in grey. Each pair of lines shows half a x86 cacheline. The blue to white shows the 5 character/word annotation strings (now using the extra 2 bits of the label for color). The red hex show the image data. Not using {GET,ABS,REL} tagged words in this part, so everything in the bootloader is just hand assembled 16-bit machine code, and this is not representative of what the rest of the system will look like. The rest of the system will have {GET opcode} followed by {HEX} or {ABS} for opcode immediates (easy to write). The 16-bit code is {HEX} mixed opcode and immediates, quite a bit different (hard to write).

Some hints on the annotations,

Everything is in base 16. AX is TOP so I don't bother with "A=9000" (which wouldn't fit anyway), instead I just write "9000" (the A= is implied). The "!" means store so "SSSP!" is storing TOP (or AX) into both SS and SP. The "B=200" means BX=200h. In this 16-bit x86 case I use 3E to pad out opcodes to 32-bit. The "X" = SI, "Y" = DI, "F" = BP.

Next Step
Ground work is done, next step is to bring up the opcode dictionary for {GET} words, then write a little IDE driver to get access to load the rest of the image, and to be able to save in the editor. After that, write the drawing code for the editor, then a mini PS/2 driver for the input, then write editor input handling. Then I have a full OS ready to start on a real machine.

## 20150423

### Source-Less Programming : 4

Still attempting to fully vet the design before the bootstrap reboot...

DAT words in the edit image need to maintain their source address in the live image This way on reload the live data can be copied over, and persistent data gets saved to disk. DAT annotations no longer have 30 bits of free space, instead they have a live address. When live address is zero. then DAT words won't maintain live data. This way read-only data can be self-repairing (as long as the annotations don't get modified). Going to use a different color for read-only DAT words. New persistent data DAT words will reference their edit-image hex value before reload (then get updated to the live address).

REL words always get changed on reload (self repairing). No need to keep the live address. REL is only used for relative branching x86 opcodes. Don't expect to have any run-time (non-edit-time) self-modifying of relative branch addresses. Given that branching to a relative branch opcode immedate is not useful, the LABEL of a REL word is only useful as a comment.

GET words also get changed on reload (self repairing). GET is only designed for opcodes and labeled constants. GET words will often be LABELed as a named branch/call target. Been thinking about removing GET, and instead making a new self-annotating word (display searches for a LABELed DAT word with the same image value, then displays the LABEL instead of HEX). This opens up the implicit possibility of mis-annotations. Would be rare for opcodes given they are large 32-bit values. But for annotating things like data structure immediate offsets, this will be a problem (4 is the second word offset in any structure).

ABS words always get changed on reload (self repairing). ABS words are targets for self-modifying code/data, so they also need LABELs. Reset on reload presents a problem in that ABS cannot be used to setup persistent data unless that persistent data is constant or only built/changed in the editor. But this limitation makes sense in the context that ABS addresses in live data structures can get invalidated by moving stuff around in memory. The purpose of ABS is edit-time relinking.

### Source-Less Programming : 3

Annotation Encoding
Refined from last post, two 32-bit annotation words per binary image word,

FEDCBA9876543210FEDCBA9876543210
================================
00EEEEEEDDDDDDCCCCCCBBBBBBAAAAAA - LABEL : 5 6-bit chr string ABCDE

FEDCBA9876543210FEDCBA9876543210
================================
..............................00 - DAT : hex data
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA01 - GET : get word from address A*4
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA02 - ABS : absolute address to A*4
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA03 - REL : relative address to A*4

Going to switch to just 2 lines per word displayed in the editor, Only DAT annotations show hex value, other types show LABEL of referenced address in the place of the hex value. So no need for an extra note. In practice will be using some amount of binary image memory to build up a dictionary of DAT words representing all the common somewhat forth like opcodes, then GET words in the editor to build up source.

Need to redo the bootloader from floppy to harddrive usage, and switch even the bootloader's 16-bit x86 code to 32-bit aligned LABEL'ed stuff so the final editor can edit the bootloader. Prior was avoiding manually assembling the 16-bit x86 code in the boot loader, but might as well ditch NASM and use something else to bootstrap everything.

## 20150422

### Source-Less Programming : 2

Continuing with what will either be an insanely great or amazingly stupid project...

Making slow progress with bits of free-time after work, far enough thinking through the full editor design to continue building. Decided to ditch 64-bit long mode for 32-bit protected mode. Not planning on using the CPU for much other than driving more parallel friendly hardware... so this is mostly a question of limiting complexity. Don't need 16 registers and the REX prefix is too ugly for me to waste time on any more. The 32-bit mode uses much more friendly mov reg,[imm32] absolute addressing, also with ability to use [EBP+imm32] without an SIB byte (another thing I mostly avoid). Unfortunately still need relative addresses for branching. 32-bit protected mode thankfully doesn't require page tables unlike 64-bit long mode. Can still pad out instructions to 32-bits via reduntant segment selectors.

Source-Less Analog to Compile-Time Math?
Compile-time math is mostly for the purpose of self-documenting code: "const uint32_t willForgetHowICameUpWithThisNumber = startHere + 16 * sizeof(lemons);". The source-less analog is to write out the instructions to compute the value, execute that code at edit time, then have anotations for 32-bit data words which automatically pull from the result when building 32-bit words for opcode immediates for the new binary image.

Reduced Register Usage Via Self Modifying Code
Sure, kills the trace cache in two ways, what do I care. Sometimes the easist way to do something complex is to just modify the opcode immediates before calling into the function...

What Will Annotations Look Like?
The plan so far is for the editor to display a grid of 8x8 32-bit words. Each word is colored according to a tag annotation {data, absolute address, relative address, pull value}. Each word has two extra associated annotations {LABEL, NOTE}. Both are 5 6-bit character strings. Words in grid get drawn showing {LABEL, HEX VALUE, NOTE} as follows,

LABEL
00000000
NOTE

The LABEL provides a name for an address in memory (data or branch address). Words tagged with absolute or relative addresses or pull value show in the NOTE field the LABEL of the memory address they reference. Words tagged with data use NOTE to describe the opcode, or the immediate value. Editor when inserting a NOTE can grab the data value from other words with the same NOTE (so only need to manually assemble an opcode once). Edit-time insert new words, delete words, and move blocks of words, all just relink the entire edit copy of the binary image. ESC key updates a version number in the edit copy, which the excuting copy sees triggering it to replace itself with the edit copy.

Boot Strapping
I'm bootstrapping the editor in NASM in a way that I'll be able to see and edit later at run-time. This is a time consuming process to get started because instead of using NASM to assemble code, I need to manually write the machine code to get the 32-bit padded opcodes. Once enough of the editor is ready, I need a very tiny IDE/PATA driver to be able to store to the disk image. Then I can finish the rest of the editor in the editor. Then I'll also be self hosted outside the emulator and running directly on an old PC with a non-USB keyboard, but with a proper PCIe slot...