Problem Statement ----------------- QEMU uses dynamic translation (vs. acting as an interpreter) to achieve reasonably fast simulation of guest code. In order to avoid a machine-code translator for each (guest, host) pair, QEMU breaks each guest instruction into one or more C functions, called ``micro-operations.'' QEMU compiles these with the host's C compiler and stitches them together during translation. Micro-ops range from ``instruction building blocks'' like condition code update routines and memory access primitives to implementations of entire guest instructions. In broad overview, for each basic block of guest code, QEMU translates it into a string of micro-ops and then the byte streams of the micro-ops are concatenated to produce a translation of that basic block. This translation is then cached so that subsequent use of this block by guest code will (hopefully) not need re-translation. Because each micro-op will be used in many translations, it will exist in many places in memory. Every time the micro-op is copied, the translator must rewrite it to relocate references to external code and variables. There are essentially three cases: 1. References to global data, such as guest CPU register state. 2. References to external code, such as the MMU implementation. 3. Abusive relocation to simulate immediate values. The first two are relatively straight forward uses of relocation. On CISC machines like X86, code and data references tend to be relative to the program counter; on RISC machines accesses to global data tend to be relative to a register reserved to point into the loaded data section. QEMU's dynamic translator relies on access to relocation records enumerating all external references made by each micro-op. QEMU's use of relocation-based rewriting also contains an abusive side. Whenever possible, guest code immediate values (viz. $0x3 in a PPC instruction such as "MOV $0x3, R2") are placed directly into the translation buffer, as host code immediate values. This is accomplished by having micro-ops load the address of fictitious external symbols such as "__op_param_1", as in long temp = (long) &__op_param_1; When compiled, the host code will have a word-sized ``hole'' where, under normal circumstances, the linker would place the address of __op_param_1, acting on relocation records placed in the file's meta-data. Instead, QEMU's translator will place the value 3 there, basing this special treatment on the name of the external symbol referenced. If we are, for example, running on an X86 machine, then the compiler will turn the above C into ``MOV $__op_param_1, %eax'' which will be assembled into a .o with an external reference to __op_param_1. While translating, then, QEMU will temporarily pretend that the address of __op_param_1 is 3, yielding ``MOV $0x3, %eax.'' Later, when translating ``MOV $0xCAFE, %eax,'' QEMU could pretend that the address of __op_param_1 is 0xCAFE. In summary, every time QEMU copies a micro-op's code into a translation buffer, every external reference made by the micro-op is relocated. Many such relocations are done ``naturally'' (i.e., a reference to a helper function is simply updated to point at the helper function's code), but some relocations are ``abusive,'' meaning that QEMU overrides the normal relocation process to insert values which need not be the addresses of any particular object. Dyngen ------ QEMU uses a build-time tool, called dyngen, to statically analyze the compiled micro-ops. For each micro-op, dyngen emits a small bit of C code which knows how to copy the micro-op onto the end of the translation buffer, patch ``normal'' relocations, and fill in ``abusive'' references. Each such chunk of code is a case arm in an enormous switch(micro-op) statement. On all existing platforms, using GCC/gas/GNU ld, the build process compiles op.c to op.o. Dyngen analyzes the relocation records in op.o to produce the C code bit and some tables used by the runtime translator. Then dyngen's output is compiled and the result is linked together with op.o into the QEMU executable. This approach will not work with kencc on Plan 9: op.8 contains no instructions and does not really even contain relocation data -- in terms of patch points and rewriting rules -- as symbols are still just names. Thus we propose building op.c into a Plan 9 dynamically loaded module, which contains both instructions and relocation meta-data. However, as QEMU's use of relocation is very different from ``regular'' relocation, some changes to Plan 9's dynamic-load module, libdynld, are required. What Dyngen Does ---------------- Dyngen internally produces several tables and can turn them into various forms as requested: 1. Dyngen parses the micro-ops library to produce an in-memory list of functions therein. This list is used internally as a guide for all subsequent operations. 2. Each function is investigated a little more closely, to determine its boundaries. These boundaries are used A. To determine size of the function as copied and needed in generated code. B. To associate relocation records with this function, below. 3. The module's relocation meta-data (i.e., its patch-point list) is iterated over and each entry is associated with the function it effects. This allows dyngen to map a micro-op to its list of patch points. This is used in code generation. 4. Each patch point must further be identified with the symbol it references. This is necessary for the abusive constant-folding approach outlined above. Specifically, if the name of the symbol matches a given prefix, the value landed via relocation will not be patched with a reference to the symbol (which doesn't really exist); instead the value is abusively loaded with a translator-specified constant. 5. On platforms using ELF, OSX (using MACHO), and Windows (using COFF), dyngen deeply understands the loader format: it can interpret relocation records and emit patch instructions directly in C. As a concrete example, when a micro-op calls raise_exception(), dyngen's emitted code contains a line like: *(uint32_t *)(gen_code_ptr + 39) = (long)(&raise_exception) - (long)(gen_code_ptr + 39) + -4; If abusive constant folding is taking place, the relocation instead looks like: long param1; param1 = *opparam_ptr++; *(uint32_t *)(gen_code_ptr + 3) = param1 - (long)(gen_code_ptr + 3) + -4; Both of these examples are using EIP-relative loads, but provision exists in dyngen for absolute loads; also, arbitrary offsets, as might be found when looking inside structures, are understood, as in: *(uint32_t *)(gen_code_ptr + 2) = param1 - (long)(gen_code_ptr + 2) + 56; Dyngen For Plan 9 ----------------- Most of the above steps are immediately possible on Plan 9, using libmach and code from libdynld: 1. Libmach's crackhdr and symbol family of functions will allow us to iterate fully over the symbol table, providing type data for each. 2. For each text symbol, libmach's fnbounds() gives us exactly what we want. We do not care about symbols outside the text region, so all is well. 3. Because libdynld is designed to insulate its users from the notion of a patch point, a Plan 9 dyngen will require either assigning a public interface to some of libdynld's internals or copying some of libdynld's code into dyngen. More explicitly, dyngen needs access to the dlm's import table and relocation records. The code in libdynld which decodes these is only specific to the dlm a.out format, so its duplication is possibly acceptable, but it probably should be exported beside libmach's decoding facilities, for use by, e.g. acid. For example, a function int dyn_import_table(int fd, Fhdr* fh, Dynsym **table); which took the file descriptor and cracked header, generated the import table, and returned the number of entries would suffice. For the proposed function, the user is responsible for freeing the table. 4. Determination of the referenced symbol name from the patch point is not currently possible through the API exposed by libdynld. For each patch point, dyngen will need to know the symbol name to decide which, if any, abusive case is at hand. Because mapping from a table index to a name is trivial, solving the patch point to index problem also solves the patch point to name problem. Therefore, I propose an addition to libdynld's API: a new structure to capture state currently only held in temporary variables struct Dynrel { ulong offset; /* Patch point offset in module */ int mode; /* Patch point mode from relocation record */ int reloc_index; /* Index into import table, -1 if none. */ uvlong reloc_offset; /* Offset from symbol or, if none, from base */ }; and a new function int dyn_reloc_table(int fd, Fhdr* fh, Dynrel **table); which returns the number of relocations stored in the table. Here again the user would be responsible for freeing the table. [Strictly speaking, for the project at hand, reloc_offset is not necessary; it is proposed for future users such as acid or a symbolic disassembler that is dlm aware.] 5. Once dyngen knows what symbol each patch point refers to, it knows whether the relocation is ``normal'' or ``abusive'' and which type of code it should emit. However, the encoding of the value into the instruction stream (relative vs. absolute, base, byte order, etc.) depends on the host instruction set and the particular instruction at hand. While the gcc dyngen understands each host's relocation record format and can thus generate the right code scraps, in Plan 9 this information is currently encapsulated within the per-architecture dynreloc() function. The simplest approach for Plan 9's dyngen would be to rely on the existing dynreloc() implementation. Since dynreloc() works using a module import table, it will perform the necessary mixture of regular and abusive patching for us, if we provide it with an import table which maps regular symbols to their addresses and abusive symbols to their desired values. It seems straightforward, at build time, to have dyngen emit a C array declaration which will reproduce the 8.op import table. This avoids copying code from libdynld to build the import table using QEMU's export table. It is not possible to use libdynld's loader function, dynloadgen(), for the purpose at hand, as it assumes that an entire module is loaded in one place, all at once, destructively. It is not possible to take a loaded dlm and relocate it elsewhere, even if the dlm's import and relocation tables are kept in RAM, as the act of relocation overwrites some relocation meta-data originally placed in the machine code stream. libdynld API Addition Discussion -------------------------------- The above API functions rely on libmach's decoding facilities to provide parsed headers. An alternative, making them more at home in libdyngen, would be have them to mirror dynloadgen()'s prototype, as in: int dyn_import_table(void *file, long (*read)(void *, void*, long), vlong (*seek)(void *, vlong, int), void (*werr)(char*), Dynsym **table); int dyn_reloc_table(void *file, long (*read)(void *, void*, long), vlong (*seek)(void *, vlong, int), void (*werr)(char*), Dynrel **table); However, it seems the reason for dynloadgen()'s prototype taking function pointers to do read, seek, and error printing is so that it is usable in the kernel as well as by user land (where these functions can simply be wrappers around system calls). As the kernel seems an unlikely consumer of these decoding functions, at least relative to some future version of acid, the former prototypes feel more natural. As to which library they belong in, at the moment libdynld is not widely distributed but libmach is. Thus it would seem less disruptive to place these functions in libdynld, at least temporarily. The definition of dyn_reloc_table() makes it require per-architecture knowledge to fill in the reloc_offset and reloc_index fields. At present, the entirety of the per-architecture relocation behavior is encapsulated in dynreloc(), which does the entirety of the relocation necessary for a given patch point, including modifying the instruction byte stream. This makes dynreloc() an unsuitable API for dyn_reloc_table()'s use. To encapsulate the required knowledge to fill in the ->reloc_index and ->reloc_offset values in the Dynrel struct, I envision a per-architecture function, to be called by dyn_reloc_table: void crack_dynreloc(uchar *base, Dynrel *reloc); where reloc is a pointer to a Dynrel which has ->offset and ->mode specified. This function will fill in ->reloc_index and ->reloc_offset. The index is -1 if no symbol is imported (if the relocation is purely an offset into the dlm). Essentially, this is dynreloc() without side-effects (notably, it would leave the patch point unrelocated). dynloadgen() would continue to call dynreloc(), which may or may not be a consumer of crack_dynreloc(). On the one hand, dynreloc() is very small and simple as is, I'd rather not change the implementation of dynreloc(), and the change to consume Dynrel structures doesn't seem to dramatically reduce or simplify dynreloc(). On the other hand, keeping dynreloc() and crack_dynreloc() in sync might be a drag. I think, unless strong preference is indicated the other way, that I will go with leaving dynreloc() alone. To be painfully explicit, loading a dlm will be no different after the above changes; the extant code paths through dynloadgen() and dynreloc() would be entirely unaltered by the proposed changes. The envisioned purpose of the new API functions is to facilitate inspection of a dlm; possible consumers beyond dyngen include acid and objdump-like tools. Current Demo ------------ I have written a brief, proof-of-concept demo version of dyngen running on Plan 9. To work around the problems above it copies code and contains its own x86-specific version of crack_dynreloc. It has many other deficiencies, but on the other hand, it was able to take a micro-op extern ulong __abuseme; extern void externfun(void *); void foo() { externfun(&__abuseme); } and generate a C function to relocate it (including abusive patches). When combined with a master program, which defined void externfun(ulong param) { printf("externfun called: %d\n", param); } and which had at its core two invocations of dyngen's code, each using different abusive values: uchar *buf31337 = malloc(20); put_fun(buf31337, 31337); ( (void (*)(void)) buf31337 )(); uchar *buf2BAD = malloc(20); put_fun(buf2BAD, 0x2BADD00D); ( (void (*)(void)) buf2BAD )(); Here put_fun() is the dyngen generated code, which does both a normal relocation for externfun() and an abusive relocation for __abuseme. Each of the patched buffers managed to call printf() with the right value. The tarball is available on Uriel's site at http://gsoc.cat-v.org/people/nwf/dyngen-demo.tgz . I apologize in advance for the demo quality of the code. As an aside, if dyngen emits code which uses dynreloc() (such as the one discussed and the one of the demo), the translator will be slightly slower than the GCC-style dyngen which understands each relocation type, as it will be making function calls per relocation record for each function in a translation buffer. However, a sufficiently aggressive cross-function optimizer should be able to reduce it to being a few memory and math operations more than code generated from gcc dyngen's traditional output. Conclusion ---------- I propose that libdynld export the Dynrel struct and the dyn_import_table() and dyn_reloc_table() functions as described above. In addition, per-architecture code currently in dynreloc() would be copied into crack_dynreloc(), which would be used by dyn_reloc_table(). These changes would be made to libdynld, leaving libmach unmodified. This approach would enable a Plan 9 port of dyngen and would also be helpful for extending acid to deal with dlm's. Experience ---------- Experience suggests that the current dlm meta-data is difficult to extract (in particular, dyn_import_table and dyn_reloc_table both know how to scan the import table's packed form). Further, the relocation meta-data is spread between the relocation table, which specifies a relative offset (from the last relocation) and the mode of this relocation. dynreloc() must then go read the symbol import index (if any) and offset (if any) from the dlm's text and data sections. We currently assume that dynreloc() reads only near the offset given, but we have no correct ground to stand for this assumption. Moving the symbol and offset information into the relocation record would grow the dlm in size slightly but would eliminate the need for crack_dynrel() and would ease concerns that some future dynreloc() may expect more of the dlm to be present than QEMU's dynamic translator ensures. All of these features of dynld seem to stem from the mindset that dynamic modules are to be loaded once then forgotten. While the attempt to stear quite clear of the madness of dynamic libraries on UNIX hosts is laudable, the current design may make debuggers' (both people and programs) lives harder and leave lingering doubts in the minds of slightly-off-the-beaten-path users (e.g. QEMU) by being overly opaque. Unfortunately, changing these properties requires chaning the loaders, which are complex. (It is worth noting that libdynld is itself remarkably short and concise.) Fortunately, all these complaints are nitpicks and do not currently prevent us from moving forward. The above API will translate naturally to a modified libdynld, should changes ever be made.