20140420

Minimal x86-64 Elf Header For Dynamic Loading

It is possible to get the Linux x86-64 ELF overhead down to 495 bytes and include enough information to support one symbol, dlsym(), which is the only symbol required to do manual dynamic loading of whatever is needed at runtime. This is a very important step in reducing the work required for those doing custom languages. Below is a readelf -a dump of a test binary.

ReadELF
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x1f0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          0 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         3
  Size of section headers:           64 (bytes)
  Number of section headers:         0
  Section header string table index: 0

There are no sections in this file.

There are no sections in this file.

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  INTERP         0x00000000000001c4 0x00000000000001c4 0x00000000000001c4
                 0x000000000000001a 0x000000000000001a  RWE    1
      [Requesting program interpreter: /lib/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000361 0x0000000000000361  RWE    200000
  DYNAMIC        0x00000000000000e8 0x00000000000000e8 0x00000000000000e8
                 0x0000000000000080 0x0000000000000080  RWE    8

Dynamic section at offset 0xe8 contains 8 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000004 (HASH)               0x1b0
 0x0000000000000005 (STRTAB)             0x1dd
 0x0000000000000006 (SYMTAB)             0x168
 0x0000000000000007 (RELA)               0x198
 0x0000000000000008 (RELASZ)             24 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x0000000000000000 (NULL)               0x0

There are no relocations in this file.

There are no unwind sections in this file.

Histogram for bucket list length (total of 1 buckets):
 Length  Number     % of total  Coverage
      0  0          (  0.0%)
      1  1          (100.0%)    100.0%

No version information found in this file.

Details
Everything is aliased into one Read/Write/Execute binary blob with all ELF related stuff at the beginning of the blob. The ordering of the ELF related stuff is {ELF header, program header, dynamic section, symbol table, relocation table, hash table, interpreter string, dynamic string table}.

There is no need for section headers as they are redundant and not read by the dynamic loader. There is no PHDR program header. This also uses the SYSV style hash instead of the GNU style hash (same as the -hash-style=sysv option for ld). The hash table can be a simple array of 32-bit words {1,2,1,0,0}. So just one bucket, and 2 symbols in the file (undefined and dlsym).

This uses "/lib/ld-linux-x86-64.so.2" as the interpreter string. Note when messing around with ld and simple assembly programs, ld can easily place in the ELF standard /lib/ld64.so.1, which does not work on Linux because of a missing symlink, so the --dynamic-linker=/lib/ld-linux-x86-64.so.2 option would be required.

This packs the interpreter string then the dynamic string table, with the dynamic string table starting at the null terminator of the interpreter string. This one byte overlap covers the required null first string.

This uses PF_X+PF_W+PF_R for p_flags for all the program headers, and runs from offset 0 instead of 0x400000 to make file offset = virtual address.

This does not use DT_BIND_NOW or the associated flag from the dynamic section, and that seems to work at least in this case when using STT_OBJECT instead of STT_FUNC for the symbol entry for dlsym.

The readelf fails to print the relocaton info and objdump simply cannot process the binary. The binary's only symbol, dlsym, is setup with STV_DEFAULT, STB_WEAK, and STT_OBJECT. The single relocation entry is R_X86_64_JUMP_SLOT which is setup to modify a 64-bit address in the binary blob which is used directly. There is no real PLT and GOT. This is setup to do binding at load time, so the 64-bit address for dlsym is just used after load directly.

Note, other types of relocation types simply don't work and I'm not sure why. Using "LD_DEBUG=all ./a.out" to debug shows the following when things work,

relocation processing: ./a.out (lazy)
3306:     symbol=dlsym;  lookup in file=./a.out [0]
3306:     symbol=dlsym;  lookup in file=/lib/libdl.so.2 [0]
3306:     binding file ./a.out [0] to /lib/libdl.so.2 [0]: normal symbol `dlsym'

And then shows something like "binding file ./a.out [0] to ./a.out [0]" (binding to itself) when things fail. From what I can remember, attempting to switch to relocations R_X86_64_64 or R_X86_64_GLOB_DAT always hit the fail case.

Fail
My new favorite measure of fail in system engineering is the ratio of the garbage required to do something vs the minimal amount of stuff actually required. In this case binaries could be as simple as {a filled in at run-time 64-bit pointer to dlsym() at address 0, program entry at address 8, and then the rest of the program}. By this metric ELF has roughly a 64x fail ratio.

Revision 2014 Tubes











20140413

Minimal Single Symbol ELF For Using GL

With the exception of graphics, everything else on Linux can easily be done via kernel system calls. For those working in a custom language, using GL adds the complexity nightmare of dynamic linking. However there is a way to reduce this complexity to the following (which enables a simpler ELF),

(1.) Link to libdl.so only.
(2.) Use only one symbol: dlsym().

Using dlsym() one can get the address for dlopen(), which together enables manually opening libraries. Unfortunately some GL libraries require that the caller have the full libc treatment (everything that happens between _start() and main()). Implementing all that crap in a custom language is a great way to gain the full realization that the Linux people lost track of the K.I.S.S. principle a long time ago. Instead one can just dlopen() libc.so itself and manually pass through the complexity filter via __libc_start_main(). Here is a proof-of-concept in C for x86-64 which lacks a bunch of important stuff, but works enough to get into main() and then print "ok" via a Linux syscall.

static inline long Linux1(long num, long arg1) { long ret; asm volatile ("syscall" : "=a" (ret) : "a" (num), "D" (arg1) : "cc", "memory", "%rcx", "%rdx", "%rsi", "%r8", "%r9", "%r10", "%r11"); return ret; }

static inline long Linux3(long num, long arg1, long arg2, long arg3) { long ret; asm volatile ("syscall" : "=a" (ret) : "a" (num), "D" (arg1), "S" (arg2), "d" (arg3) : "cc", "memory", "%rcx", "%r8", "%r9", "%r10", "%r11"); return ret; }

void* dlsym(void*, const char*);

typedef void* (*dlopenF)(const char*, int);

#define RTLD_NOW 2

typedef int (*__libc_start_mainF)(int (*)(int, char**, char**), int, char**, void (*)(void), void(*)(void), void (*)(void), void (*));

static void None(void) { }

static int Main(int a, char** b, char** c) { char msg[]="ok\n"; Linux3(1, 1, (long)msg, 3); Linux1(231, 0); }

void _start(void) {
dlopenF dlopen;
dlopen = (dlopenF)dlsym(0, "dlopen");
void* libc = dlopen("libc.so", RTLD_NOW);
__libc_start_mainF __libc_start_main;
__libc_start_main = (__libc_start_mainF)dlsym(libc, "__libc_start_main");
__libc_start_main(Main, 0, 0, None, None, None, 0); }


This can be built with,

gcc -std=gnu99 src.c -ldl -Os -fno-unwind-tables -fno-exceptions -nostdlib

Even given the unnecessary garbage GNU ld dumps in the ELF, this binary will be under 3K, which almost half the size of compiling "void main(void) { }" using the standard libs. Going to assembly or some other custom language and manually building the ELF can make this much smaller.

20140410

Why GL's ARB_Bindless_Texture is Awesome

Unquestionably GL bindless is well designed to NVIDIA hardware. After all, they wrote the extension that ARB_bindless_texture was based on. Over time I have slowly started to understand how it is also a great match for AMD's GCN hardware as well...

Quick GCN Hardware Review
Each non-buffer texture in GL can be accessed by either a floating point offset or an integer texel offset. On GCN the integer texel offset case requires only a 16-byte or 32-byte descriptor depending on texture type. The floating point offset case requires an extra 16-byte sample descriptor. GCN has ability to block load the descriptors into the scalar register file at 4-byte, 8-byte, 16-byte, 32-byte, and 64-bytes at a time with a single instruction.

GL vs DX With Regards to Sampler Style
The DX style of separate samplers and textures on GCN would require two block load instructions for the first texture accessed with a new sampler (one load for the texture descriptor the other for the sampler descriptor). Each additional texture using the same sampler would require just one block load instruction.

The GL style of a combined sampler and texture on GCN only requires one block load instruction for any texture. The texture and sampler descriptor sitting side by side in memory can be fetched together with just one 32-byte or 64-byte scalar block load instruction. As long as the K$ is not a limiter, the GL case could be faster (less instructions). This could be more important as the wavefront occupancy goes down (less options to multi-issue scalar and vector and mem instructions).

A Bindless Indirection can be an Optimization?
Lets look in detail at a situation which would be common for a Clustered Forward based engine. Typically objects are drawn with a material shader, and this material shader will have a selection of new textures in combination with some textures shared with the prior shader such as per-view textures (to fetch lights, etc). If the graphics API builds a unique buffer of descriptors for each draw call, then there is an extra amount of CPU overhead to copy in all the per-view texture descriptors into each chunk of buffer per draw call. Another option would be an API which enables keeping a unique cached table/material of descriptors in a buffer (table changes per draw). However in that case during rendering, after each table change, parts of the K$ need to get reloaded for the per-view descriptors which already might exist in the cache at a different address. Usage of the indirection of GL bindless could in theory reduce CPU setup (less to build), and also give a better chance to keeping a warmed K$ between draw calls (as the much larger descriptors shared between shaders have the same address).

Using Bindless in a "Binding" Way
Another option for using bindless which was recently brought to my attention was the option of just replacing glBindTextures() with glProgramUniformHandleui64vARB(). The later sets the traditional GL sampler bind points with a collection of bindless handles, and in theory has the ability to remove the driver overhead and validation of the legacy glBindTextures() path. This also in theory opens up an option to the driver of skipping the bindless indirection if that indirection would possibly be slower on some kinds of hardware. Lastly this option is 100% compatible with engines which need to support GPUs which have no ARB_bindless_texture support.

AMD's Socketed Kabini

Semiaccurate on the Socketed Kabini

CHIP________ CORES GHz MemMhz L2MB $__
Athlon_ 5350 4____ 2.0 1600__ 2___ $59
Athlon_ 5150 4____ 1.6 1600__ 2___ $49
Sempron 3850 4____ 1.3 1600__ 2___ $39
Sempron 2650 2____ 1.4 1333__ 1___ ???


Interesting options for those who need maximum single GPU performance, but can accept minimum CPU performance also limited to PCIe x4, all at minimum cost. With USB 3.0 support, can run this kind of machine from a fast USB stick. The ASRock AM1B-ITX mITX mother board for the AM1 platform is just $40. Total build costs for a case-less system could be around,

$40_ motherboard
$50_ Athlon 5150
$60_ 8GB memory (2x 4GB)
$70_ 550W power supply
$570 AMD R9 290X GPU
----
$780 total

20140406

Practical Assembly

Market Evolution
The market has evolved such that x86-64 is the only practical consumer platform which supports an attached GPU which has medium to high performance. Personally I'd rather work on ARM-64 as it is a much better ISA, but market forces have made that impossible. The first sign was that AMD (an x86 company) secured all the console wins. Then when Microsoft decided to limit Windows on ARM to Metro only, that effectively killed ARM-64 as an interesting platform for desktop graphics.

One positive side effect of this is that for those who build graphics which requires 100W and up, there is only one ISA, x86-64, which is, by design, fully backwards compatible. Thus for those that are brave, ISA portability is really no longer an excuse not to leverage assembly any more.

Where Traditional Assembly Goes Wrong
Core problem with traditional assembly usage is the complete lack of practical front-end tools to the language. A second problem is that of developers attempting to program assembly in a C style, using the platform ABI, resulting in a massive mess of constantly pushing and poping everything from a stack around functions.

There is a much better way to look at x86-64 as a platform for solving problems. First while x86-64 only has 16 integer registers, it is somewhat mitigated by the ISA's built-in ability to take one operand from a memory location. This provides enough effective working space to factor register allocation from inside calls to outside calls. This is a powerful simplication tool as many algorithms can then effectively have a programmer-defined global register allocation. Different subsets of program running at different times simply alias the meaning of various registers. Likewise the stack is no longer used for data (only then used to interopt with dynamic link libaries) and a pure data flow based design is much easier to realize. Assembly, with a good front-end tool, becomes easier to write than C or C++.

What is a Good Front-End Tool?
My choice has always been something similar to the Forth language. A language front-end should be powerfull enough to use code to generate code. For instance a language front-end should be able to load a file, transform that file as a executable would and produce an inline data structure when compiling. In contrast think about how limited the "#include" functionality is in the C pre-processor.

I have been leveraging Forth like interpreters for ages now. For instance, my prior FarrarFocus website was written in a server side Forth-like language which was interpreted by PHP which generated client-side javascript. One of the advantages of using a Forth style front-end, is that the assembler itself can be defined in the language. Also in contrast to the hundreds of MB of install required to compile "hello world" in gcc or clang, a simple complete Forth based system could easily be under 16K.

Here is what I've found to be useful over the years as front-end language syntax,

\comment is anything between slashes\
0deafb175- \hex number starting with 0-9, with postfix sign\
builtin \evaluate a built-in symbol\
.symbol \push the value of a symbol on the data stack\
:symbol \pop the top of data stack into symbol\
{ symbol \define a symbol as some code\ }
`symbol \evaluate a defined symbol\
'symbol \push the address of a symbol on the data stack\
"string" \push the address of a string on the data stack\


Note the use of character prefix to define usage of symbols. When editing, I use simple colored syntax highlighting to make code meaning easy to understand visually. It is important to note the data stack is a construct of the interpreter (not code generation). The language works with postfix notation: "a + b" is written "a b +". Now for some examples of parts of writing a x86-64 assembler, starting with the syntax of the assembly,

\somewhere setup the aliased global register allocation\
0 :dst \the symbol "dst" is register 0 or RAX\
1 :address \the symbol "address" is register 1 or RCX\
\then in code use symbols instead of register names or numbers\
.dst .address 10 `@MOV8, \mov rax,[rcx+16]\


The "@MOV8," is used for the 64-bit form of the MOV instruction in the form where the source operand is a load (the @ means load) such as [reg] or [reg+imm8] or [reg+imm32]. The comma in this case "," means write to the output file. The language is really flexible. The above code could be also written as follows,

0 :RAX 1 :RCX
.RAX :dst .RCX :address
.dst .address 10 `@MOV8, \mov rax,[rcx+16]\


Here are the bits of an assembler (written in the front-end language) which could implement the "@MOV8" symbol. I'm writing these in reverse order,

{ @MOV8, 8b `@OP28, } \MOV is opcode 8b then call common @OP2 for 64-bit\

\FOLLOWING CODE REUSED FOR ALL @OP2 FORMS\
\pull args from stack then write opcode\
{ @OP28, :$op :$imm :$rm :$r `REX8, .$op ,1 `@MODrm, }
\generate 64-bit REX prefix\
{ REX8, .$r 1 >> 4 & .$rm 3 >> + 48 + ,1 }
\generate MODrm byte and offset in [r] [r+imm8] [r+imm32] cases\
{ @MODrm, .$imm .@MODrm_0, if0 `#1? .@MODrm_4, if!0 40 :$mod `MODrm, .$imm ,1 }
{ @MODrm_0, 0 :$mod `MODrm, drp; }
{ @MODrm_4, 80 :$mod `MODrm, .$imm ,4 drp; }
\write out MODrm byte\
{ MODrm, .$r 7 & 8 * .$rm 7 & + .$mod + ,1 }
\check if immediate is 8-bit\
{ #1? .$imm dup s1>2 sub }


Typically the selection of builtin ops for the front-end would match what the programmer wanted to make it easy to write code. The builtin ops in this case work as follows,

2 ,1 \append 2 as one byte to the memory block of the executable\
2 ,4 \append 2 as one 32-bit word to the memory block of the executable\
.a .b >> \pop a and b from stack as 64-bit integers, push a>>b (shift left)\
.a .b & \... a&b (and)\
.a .b + \... a+b (add)\
{ bob \some defined symbol\ }
.a .bob if0 \if a==0 then call bob\
.a .bob if!0 \if a!=0 then call bob\
drp; \drop top value of return stack, later returns to caller's caller\
s1>2 \sign extend the top of data stack from byte to 2 byte\


The process of writing code for this kind of system becomes more of an exersize of generating words (or symbols) in the language a programmer would like to solve a problem by writing sentences and paragraphs of code. For instance here is an example of implementation of a simple data stack,

0 :stk \stack top pointer in RAX\
{ Stk+ .stk 8 `#ADD8, }
{ Stk- .stk 8- `#ADD8, }
{ @Stk0 \reg\ .stk 0 `@MOV8, }
{ @Stk1 \reg\ .stk 8- `@MOV8, }
{ !Stk \reg\ .stk 0 `!MOV8, }


And a simple program to implement something which takes two 64-bit values off the stack, adds them, and pushes the result (all using the above interface),

1 :stk \later decided data stack pointer is in RCX\
.tempA `@Stk0 .tempB `@Stk1 .tempA .tempB `ADD8, `Stk- .tempA `!Stk


This is only starting to scratch the surface of how such a front-end can be used to generate code...

20140330

GCN and Wavefront Occupancy

References: GCN Architecture Whitepaper | Sea Islands ISA

Review
A Compute Unit (CU) is partitioned into four separate SIMD units.
Each SIMD unit has the capacity of 1 to 10 wavefronts.
Once launched, wavefronts do not migrate across SIMD units.
CU can decode and issue 5 instructions/clk for one SIMD unit.
It takes 4 clocks to issue across all four SIMD units.
The 5 instructions need to come from different wavefronts.
The 5 instructions need to be of different types.
Each SIMD unit has 512 scalar registers.
Each SIMD unit has 256 vector registers.
Each CU has 64KB of LDS space.

Occupancy Table
Waves = Wavefronts per SIMD unit (4 SIMD units/CU)
Scalar = Maximum number of scalar registers/wavefront
Vector = Maximum number of 64-wide vector registers/wavefront
LdsW/I = Maximum amount of LDS space per vector lane per wavefront in 32-bit words
Issue = Maximum number of instructions which can issue per clock

Waves Scalar Vector LdsW/I Issue
1____ 128___ 256___ 64____ 1
2____ 128___ 128___ 32____ 2
3____ 128___ 85____ 21____ 3
4____ 128___ 64____ 16____ 4
5____ 102___ 51____ 12____ 5
6____ 85____ 42____ 10____ 5
7____ 73____ 36____ 9_____ 5
8____ 64____ 32____ 8_____ 5
9____ 56____ 28____ 7_____ 5
10___ 51____ 25____ 6_____ 5


Notes
Shaders can easily get issue limited when the number of wavefronts becomes small. Without at least 3 wavefronts per SIMD unit, the device cannot tri-issue a {vector, scalar, and memory} group of operations. This will be further limited by the number of wavefronts which are blocked during execution (say because they are waiting on memory). Abusing instruction level parallelism and increasing register pressure at the expense of occupancy can result in low ALU utilization.

20140328

Related to Filtering for VR

Ideas for high quality VR warp and reconstruction filtering for higher-end GPUs:

(1.) Instead of applying a resolve or filter before the warp, apply during the warp.

(2.) Starting with a ray-tracing example with temporal changing stratified sampling. For all samples before the warp, use a pre-pass to compute then store out the screen position for the sample after the warp. Then when filtering for the warp, fetch this position per sample in the reconstruction filter and use it to compute the weighting for the filter kernel. Weight is a function of distance the post-warp sample position to the center of the back-buffer pixel. Note one can also rotate and scale the offset before computing the distance (and weight) to reshape a circle kernel into an ellipse aligned to the gradient of curvature of the warp. The core optimization here is to avoid doing the warp computation many times per output pixel.

(3.) Post warp screen position output needs sub-pixel precision for quality filtering. If using FP16 to store position, make sure to use a signed pixel offset with {0,0} at center of the eye. This doubles precision, and places the high precision area in the center of the eye's view where it is needed the most. Remember FP16 represents -2048 to +2048 exactly, but -4096 to +4096 with a gap of 2. So process left and right eyes separately when filtering. This way the worst precision which is 1/2 out from the center of each view, is steps of 960/4096 pixels for the DK2.

(4.) The reconstruction filter can do the inverse warp computation once to get a location in the source framebuffer, then use fixed immediate offsets from that position to do fast sampling of a large fixed pattern. Must use more than four samples. Beginning with 12 samples is a good starting point.

(5.) For raster based engines with fixed sample positions, warp and kernel weights can be pre-computed. Per back-buffer pixel, compute the source framebuffer pattern center offset. Then store out weights for all the samples in the pattern. These weights will vary per pixel, and can even be compressed. Compute the maximum weight per sample in the fixed pattern for all pixels in the back buffer. Use this per sample maximum as a constant to scale the sample weight before packing.

(6.) The Oculus DK1 can support higher resolution in X by computing filtering weights separate for {red,green,blue}. Not sure what to do yet with the display of the DK2.

(7.) Fixing for chromatic aberration requires computing separate filter weights for {red,green,blue}. However towards the center of the eye, the sample pattern can remain constant. Can use different shader source based on distance of a screen tile from the eye. Tiles at the ends of the screen for instance need different sample patterns.

(8.) If ray tracing (or ray marching) with the Oculus (I ray march at home), use temporally stratified sampling which is evenly distributed in the warped space. Do not just shoot rays in a regular grid.

(9.) If forward rendering, especially if on NVIDIA hardware, instead of rendering higher resolution, render with MSAA and some amount of super-sampling on. In OpenGL this can be done using the glMinSampleShadingARB() extension. This can enable for instance 4xMSAA shading at 2 samples per pixel, or 8xMSAA shading at 2 or 4 samples per pixel. The end result of this is a better sample pattern distribution for the resolve filter. Also it enables a reduction in shading rate compared to regular grid super-sampling.

(10.) It is possible to up-sample during the reconstruction filter on devices which cannot do super-sampled shading at good frame rates. In this case it is still a good idea to use the MSAA trick for a better base sampling pattern.

(11.) Temporal filtering is very hard and beyond the scope of this blog post. It is possible to use temporal filtering for either jittered AA or noise reduction with stratified sampling and a ray traced (or marched) engine.

(12.) Crazy optimization which I haven't tried yet. Could work with a pair (or more) of output pixels during the reconstruction filtering pass. Have the pair share the same slightly larger fixed pattern of input samples. This better amortizes the cost of the inverse warp computation, and the cost of fetching all the samples. Might increase the number of samples/pixel which can be done in the same amount of time. Output of the pixel shader in this case would be surface stores instead of ROP.

(13.) Continuing that line of thought, in the pre-computed weights situation, one can reuse the weights for both the left and right eye. In this case, a pixel's pair is the mirror across the center of the back-buffer. Surface stores in this case would be coherent...

Brigade 3.0 Preview on Youtube