20150426

Source-Less Programming : 5

Boot Loader Bring-up
Managed to get the boot loader done, which includes the following steps,

(1.) Move the stack seg:pointer (since next step overwrites it).
(2.) Use BIOS to read the other 62 512-byte sectors for the first track.
(3.) Use BIOS to switch to 80x50 text mode and load custom character glyphs.
(4.) Use BIOS to set EGA text palette to 0-15 with 0 for overscan.
(5.) Program VGA palette registers for those 16 colors.
(6.) Use BIOS to enable A20.
(7.) Turn off interrupts, and relocate the image's 63 sectors to zero.
(8.) Load zero entry IDT, minimal 3 entry GDT.
(9.) Enable protected mode and jump to the 3rd sector.

The 2nd 512-byte sector contains the 8x8 character bitmaps for the first 64 characters. The majority of the time was spent making a nice font, getting colors the way I wanted, and prototyping editor look and feel (without building it).

Didn't feel like fully hand assembling 16-bit x86 machine code for the boot loader, so I used NASM and hexdump to accellerate the process (to provide machine code I could pad out to 32-bit alignment). Also wrote a quick C based tool to bootstrap the process of building the loader. Something which would enable me to easily build out an annotated image, and show a print out in the console of what I'd be seeing in the editor. Here is a shot of a bit of the scratch C code I used to make the font,



Here is a shot in QEMU of the loader displaying the font,



And another shot from QEMU showing the pallet,



What the Current Annotated Image Looks Like
Below is a shot captured from the terminal window output of the C tool. I'm using 3 cache lines for the loader code.



Grey lines separate the 512-byte sectors. Memory address on the left in grey. Each pair of lines shows half a x86 cacheline. The blue to white shows the 5 character/word annotation strings (now using the extra 2 bits of the label for color). The red hex show the image data. Not using {GET,ABS,REL} tagged words in this part, so everything in the bootloader is just hand assembled 16-bit machine code, and this is not representative of what the rest of the system will look like. The rest of the system will have {GET opcode} followed by {HEX} or {ABS} for opcode immediates (easy to write). The 16-bit code is {HEX} mixed opcode and immediates, quite a bit different (hard to write).

Some hints on the annotations,

Everything is in base 16. AX is TOP so I don't bother with "A=9000" (which wouldn't fit anyway), instead I just write "9000" (the A= is implied). The "!" means store so "SSSP!" is storing TOP (or AX) into both SS and SP. The "B=200" means BX=200h. In this 16-bit x86 case I use 3E to pad out opcodes to 32-bit. The "X" = SI, "Y" = DI, "F" = BP.

Next Step
Ground work is done, next step is to bring up the opcode dictionary for {GET} words, then write a little IDE driver to get access to load the rest of the image, and to be able to save in the editor. After that, write the drawing code for the editor, then a mini PS/2 driver for the input, then write editor input handling. Then I have a full OS ready to start on a real machine.

20150423

Source-Less Programming : 4

Still attempting to fully vet the design before the bootstrap reboot...

DAT words in the edit image need to maintain their source address in the live image This way on reload the live data can be copied over, and persistent data gets saved to disk. DAT annotations no longer have 30 bits of free space, instead they have a live address. When live address is zero. then DAT words won't maintain live data. This way read-only data can be self-repairing (as long as the annotations don't get modified). Going to use a different color for read-only DAT words. New persistent data DAT words will reference their edit-image hex value before reload (then get updated to the live address).

REL words always get changed on reload (self repairing). No need to keep the live address. REL is only used for relative branching x86 opcodes. Don't expect to have any run-time (non-edit-time) self-modifying of relative branch addresses. Given that branching to a relative branch opcode immedate is not useful, the LABEL of a REL word is only useful as a comment.

GET words also get changed on reload (self repairing). GET is only designed for opcodes and labeled constants. GET words will often be LABELed as a named branch/call target. Been thinking about removing GET, and instead making a new self-annotating word (display searches for a LABELed DAT word with the same image value, then displays the LABEL instead of HEX). This opens up the implicit possibility of mis-annotations. Would be rare for opcodes given they are large 32-bit values. But for annotating things like data structure immediate offsets, this will be a problem (4 is the second word offset in any structure).

ABS words always get changed on reload (self repairing). ABS words are targets for self-modifying code/data, so they also need LABELs. Reset on reload presents a problem in that ABS cannot be used to setup persistent data unless that persistent data is constant or only built/changed in the editor. But this limitation makes sense in the context that ABS addresses in live data structures can get invalidated by moving stuff around in memory. The purpose of ABS is edit-time relinking.

Source-Less Programming : 3

Annotation Encoding
Refined from last post, two 32-bit annotation words per binary image word,

FEDCBA9876543210FEDCBA9876543210
================================
00EEEEEEDDDDDDCCCCCCBBBBBBAAAAAA - LABEL : 5 6-bit chr string ABCDE


FEDCBA9876543210FEDCBA9876543210
================================
..............................00 - DAT : hex data
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA01 - GET : get word from address A*4
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA02 - ABS : absolute address to A*4
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA03 - REL : relative address to A*4


Going to switch to just 2 lines per word displayed in the editor, Only DAT annotations show hex value, other types show LABEL of referenced address in the place of the hex value. So no need for an extra note. In practice will be using some amount of binary image memory to build up a dictionary of DAT words representing all the common somewhat forth like opcodes, then GET words in the editor to build up source.

Need to redo the bootloader from floppy to harddrive usage, and switch even the bootloader's 16-bit x86 code to 32-bit aligned LABEL'ed stuff so the final editor can edit the bootloader. Prior was avoiding manually assembling the 16-bit x86 code in the boot loader, but might as well ditch NASM and use something else to bootstrap everything.

20150422

Source-Less Programming : 2

Continuing with what will either be an insanely great or amazingly stupid project...

Making slow progress with bits of free-time after work, far enough thinking through the full editor design to continue building. Decided to ditch 64-bit long mode for 32-bit protected mode. Not planning on using the CPU for much other than driving more parallel friendly hardware... so this is mostly a question of limiting complexity. Don't need 16 registers and the REX prefix is too ugly for me to waste time on any more. The 32-bit mode uses much more friendly mov reg,[imm32] absolute addressing, also with ability to use [EBP+imm32] without an SIB byte (another thing I mostly avoid). Unfortunately still need relative addresses for branching. 32-bit protected mode thankfully doesn't require page tables unlike 64-bit long mode. Can still pad out instructions to 32-bits via reduntant segment selectors.

Source-Less Analog to Compile-Time Math?
Compile-time math is mostly for the purpose of self-documenting code: "const uint32_t willForgetHowICameUpWithThisNumber = startHere + 16 * sizeof(lemons);". The source-less analog is to write out the instructions to compute the value, execute that code at edit time, then have anotations for 32-bit data words which automatically pull from the result when building 32-bit words for opcode immediates for the new binary image.

Reduced Register Usage Via Self Modifying Code
Sure, kills the trace cache in two ways, what do I care. Sometimes the easist way to do something complex is to just modify the opcode immediates before calling into the function...

What Will Annotations Look Like?
The plan so far is for the editor to display a grid of 8x8 32-bit words. Each word is colored according to a tag annotation {data, absolute address, relative address, pull value}. Each word has two extra associated annotations {LABEL, NOTE}. Both are 5 6-bit character strings. Words in grid get drawn showing {LABEL, HEX VALUE, NOTE} as follows,

LABEL
00000000
NOTE


The LABEL provides a name for an address in memory (data or branch address). Words tagged with absolute or relative addresses or pull value show in the NOTE field the LABEL of the memory address they reference. Words tagged with data use NOTE to describe the opcode, or the immediate value. Editor when inserting a NOTE can grab the data value from other words with the same NOTE (so only need to manually assemble an opcode once). Edit-time insert new words, delete words, and move blocks of words, all just relink the entire edit copy of the binary image. ESC key updates a version number in the edit copy, which the excuting copy sees triggering it to replace itself with the edit copy.

Boot Strapping
I'm bootstrapping the editor in NASM in a way that I'll be able to see and edit later at run-time. This is a time consuming process to get started because instead of using NASM to assemble code, I need to manually write the machine code to get the 32-bit padded opcodes. Once enough of the editor is ready, I need a very tiny IDE/PATA driver to be able to store to the disk image. Then I can finish the rest of the editor in the editor. Then I'll also be self hosted outside the emulator and running directly on an old PC with a non-USB keyboard, but with a proper PCIe slot...

Look No Triangles : Scatter vs Gather

There are a bunch of people working-on and succeeding in non-triangle rendering. With GPU perf still climbing IMO it is possible to return to the golden age of a different kind of software rendering: the kind done in a pipeline built out of compute shaders.

In my sphere tracing of math based SDF fields I was purely ALU bound, tracing to the limit of floating point precision. The largest performance win was found by doing a many-level hierarchical trace (starting with very coarse grain empty space skipping). But the limit of all this is just a log reduction of the number of steps in the search, still requires many search steps per pixel. And when doing a memory based trace (instead of a math based trace) the search is just a very long latency chain with divergent access patterns. Tracing via searching on the GPU hits a wall. To make matters worse when tracing, the ALU units are loaded up with work involved in tracing, instead of something useful.

The alternative to this is to switch to a mostly scatter based design. A large amount of the tree structure traversed each frame in a gather based approach is similar across frames. Why not just have the tree stored mostly expanded in memory based on the needs of the view. Then expand or collapse the tree based on the new visibility needs of the next frame. Rendering is then a mostly scatter process which reads leaves in the tree once. Reads of memory can now be coherent, and ALU can be used for things more interesting than search. Scatter will be somewhat divergent, but that cost can be managed by loading up enough useful ALU work in parallel. There are a lot of ways to skin this. Nodes of the tree can be bricks. Bricks can be converted into little view based depth sprites, then binned into tiles and composited. Seems as if bricks converted into triangle meshes and rasterized is the popular path now, but still using the CPU to feed everything. This could get much more interesting when the GPU is generating the cached geometry bricks: artistically controlled procedual volume generation...

20150421

From Scratch Bug 2 : Source-Less Programming

This is a disruptive idea which comes back periodically: source-less programming. Is it possible to efficiently program at a level even lower than an assembler?

The general idea is that the editor is now something similar to an enhanced hex editor which edits a binary image directly. Lowest memory is split into three parts {{running image, annotations for edit image}, edit image}. The {running image, annotations for edit image} is the boot image. The {edit image} is working space which gets snapshot replacement for {running image} on a safe transition zone. The "annotation" section is what enables human understanding of the binary image.

Words
One way to keep the system very simple is to always work in 32-bit words. A 32-bit word in memory is one of four things {data, opcode, absolute address, rip relative address}. Data is easily handled by the hex editor part. The annotation could provide a name for the data or a comment. Opcodes in x86 can be complex but it is easy to simplify. Start with something similar to forth zero-operand and one-operand operations (calls, etc). Make all operations padded to 32-bit alignment (x86-64 can use the 2e ignored prefix to pad). A call for instance becomes {32-bit opcode for call, 32-bit rip relative branch address}. Or a global load becomes {32-bit opcode for load, 32-bit rip relative branch address}. Annotation memory can provide a name for the opcode. Annotation can provide a tag for each word in memory which marks if the memory is a relative or absolute address (word gets a different color based on tag similar to color forth). Addresses can be auto annotated by displaying the annotation associated with the destination. Editor works on the {edit image}, with insert and delete of words automatically adjusting all the words tagged as address (effectively relinking at edit time). The {edit image} can also keep a mapping of original {running image} address so that it is possible to view the live values of any data. Editor provides something to easily be able to write an annotation and have the associated opcode or address automatically get updated. For example type the opcode name and the 32-bit value is automatically updated. Very simple and yet powerful minimal tool.

20150415

Pixel Art and Slot Mask Pitch

This and the prior post are all shots from the same late model Arcade CRT, a 29" SVGA Makvision which can scan 30-40KHz and 47-90Hz. I'm cheating somewhat in taking a Metal Slug screen shot and displaying it on a non-15KHz monitor. Metal Slug was roughly 304x224 if I'm remembering right, so ultra low resolution to enable a 60Hz scan-out on CGA CRTs.

Arcade titles over the years with CRTs had a range of monitors and resolutions. Displays would provide a different look depending on the Slot Mask Pitch (effectively the number of dots for a given scanline). In this next shot I'm driving the monitor near it's lowest resolution (at roughly 312 lines), then using H-size and V-size control to enlarge the screen shot as much as possible (showing maybe 250 lines on a 600 line display, so higher slots/line count than the Metal Slug titles). The 29" Makvision is a flat screen and thus suffers from moire patterns more than a curved display. In order to get the classic scan-line look (which is caused by scanning only half the display's lines to get double the frame rate), this shot has the moire reduction turned off (which keeps the beam from having vertical line jitter, which would otherwise cause lines to blend together).



Alternatively I can drive this monitor at 800x600 and then set the moire reduction to blend scan lines. This is to simulate various Arcade games which displayed a relatively higher resolution compared the display slot mask pitch (lower slots/line count, the other extreme). The prior post's image was somewhere in-between these two examples.

Indie vs Real Slug Fest

If you see squares you are doing it wrong. The classic pixel art masters never intended for it to look as ugly as exact square pixels.

Shot from Metal Slug. The shot on the right is from a photo of an arcade CRT monitor.

20150414

From Scratch Bug

Inspired by Jaymin's JayStation2 effort and remembering a past life building custom OSs for early x86 machines, haven't been able to avoid the custom OS bug any longer. It starts easy with a harmless QEMU install, followed by a 512-byte bootloader switching to 80x50 text mode and installing a custom 48 character Forth font, then bring up of a Forth assembler/editor, then on to the pain of modern PCI and USB driver bring-up... with the eventual goal of a tiny bootable USB thumb system.

Amazingly refreshing to not have the OS telling you NO. Or the API telling you NO. Modern systems are all about the NO. Systems I grew up on were all about the YES.

Reworking my language from scratch, trying something new, replacing the Forth data stack with a new concept, but maintaining zero operand opcodes. Not sure if the idea will pan out. Dropping everything but 32-bit word support from the language, no need to interop with other software. No more 8/16/64-bit loads or stores (can still just inline machine code if required). Still running in x86-64 64-bit mode, so return stack PUSH/POP/CALL/RET is still a 64-bit stack operation, just don't need that 64-bit address space or 64-bit pointers anywhere else. Trying padding out all x86 opcodes to 32-bit alignment. This makes the 32-bit immediate 32-bit aligned. Wastes space, gives up some perf? Why would I care when most of the CPU side of the system fits in the L1 cache. Dropping paging, dropping interrupts, dropping everything, none of that stuff is needed.

Reworking an editor and binary source encoding. Switching to 32-bit tokens with 5 character max strings. 48 character character-set. Doing something horrible with font design: 1=I, 0=O, 2=Z, 5=S, etc. All caps font with no non-vertical or non-horizonal lines. Actually looks awesome. When you don't have to interop with the NO machine, long symbol names are not required. Color Forth like editors have almost no state. It is magical how they function simultaneously as an editor/assembler/console/debugger/UI/etc. Take the idea of "editor-time-words", words embedded in the source code which are evaluated when the block of source is drawn to the screen. Becomes possible to build out UI tools in the source. Can have an editor-time word read system data and draw in real-time updates in the source code itself. Editor-time words are just like any other word in the system, just color tagged to only be evaluated at draw time.

Minimal systems are a blessing, more so when you have only minimal free time to work on them.

20150406

End of an Era

A followup with stills from the why I'm using Fedex from now on post...
The 29" Makvision/Wei-ya 30K-50KHZ XGA monitor. Actually the one I found looks like an early 50khz model in the original box, never used, perfect condition!!! Unable to find any of the 50khz models, was super thrilled to finally find one in this condition,



Then UPS Killed It
The last and only "new" 50khz model I could find, destroyed by UPS. Looks like it was dropped on the corner, or rammed by a forklift, causing the tube to implode. Would have been a huge bang, someone probably got a good laugh, then sent the corpse to it's new owner, me. The loss of a CRT is very sad. No one will ever again manufacture them, they are a superior technology forgotten by the world, far better than even the best low persistence flat panels. To avoid another tragic loss like this, do the world a favor and ship with Fedex instead.



Revision 2015 Tubes : Part 3

Ziphead by Fairlight & CNCD


We Stand Divided by cocoon


20150403

Why I'm Using Fedex From Now On

Payed extra to have a 100 lbs UPS package delayed a day so I could ensure I'd be there to sign for it on Friday (was out of town Thursday), very important to me that I get the package before the weekend. This required the hassle to sign up for a UPS account and pay extra to have it delayed a day. Website accepts the money and provides an estimate delivery of 11am - 3pm. Delivery day at home waiting, 3pm rolls around and no package. Use the online chat, let them know something is wrong, double check the online tracking and I don't see a "Out For Delivery" on the tracking webpage. The first online chat I tell them something is wrong, they say no everything is fine. I ask is the package on a local truck for delivery, they say yes, then tell me to wait around until the end of the day which is 7pm. I ask for confirmation, they say yes it will make it by 7pm. Then an hour passes, I use the UPS online chat again, saying the tracking webpage still does not say "Out For Delivery", I tell them again something is wrong, and again they tell me I'm wrong. I ask for a guarantee that the package is in-fact on a local truck with a guaranteed delivery today, I wait, they again confirm and say that for sure I will get the page by 7pm. I wait around until 7pm. Still no package after waiting for 8 hours. I call their customer support line, asking where is the package, they put me on hold for 30 minutes. Then 5 or so more minutes saying I can pick it up tomorrow at a location which is 15 minutes away. They say the package is there at a pickup location which is 15 minutes away but closes at 8pm. With roughly 17 minutes to get there, I speed over, manage to get to the UPS location with a few minutes to spare. Sharing the building with a bunch of other random people who are also waiting for something. 8pm rolls around and they lock us in the building so no one else can enter after close. They cannot seem to find my package, they tell me the person on the phone lied and the page isn't in the building. Person by person, they give excuses of why they cannot get their package. One woman got confirmation that her package was pulled and ready for pickup. Except it was left on a trailer next store, and they cannot get the package until next Monday. Roughly 30 minutes of waiting they tell me the same line. My package is right next store in the warehouse on a truck which cannot be "sorted" until Sunday at midnight. Even UPS has Saturday delivery, they won't bother to just take the package off their truck on the weekend. UPS has managed a rare perfect score in the "ensure your customer will never use your service ever again if they have a choice not too" category!

Finally got the package today, many days late on Monday, as they wheeled it to my door, it sounded like broken glass was banging around in the package, after taking a look inside, it was totally destroyed. Brand new (from 2006) never been opened 29" VGA arcade monitor of specs which are impossible to find anywhere (special 800x600 @ 90Hz tube). Picture tube totaled. Would have made some serious noise when the vacuum imploded, something very obvious. Instead of sending it back to the shipper after they destroyed it, they decided it was better to give me the problem and drop it on my doorstep.