Problem Statement

-----------------



QEMU uses dynamic translation (vs.  acting as an interpreter) to achieve

reasonably fast simulation of guest code.  In order to avoid a machine-code

translator for each (guest, host) pair, QEMU breaks each guest instruction

into one or more C functions, called ``micro-operations.'' QEMU compiles

these with the host's C compiler and stitches them together during

translation.  Micro-ops range from ``instruction building blocks'' like

condition code update routines and memory access primitives to

implementations of entire guest instructions.  In broad overview, for each

basic block of guest code, QEMU translates it into a string of micro-ops and

then the byte streams of the micro-ops are concatenated to produce a

translation of that basic block.  This translation is then cached so that

subsequent use of this block by guest code will (hopefully) not need

re-translation.



Because each micro-op will be used in many translations, it will exist in

many places in memory.  Every time the micro-op is copied, the translator

must rewrite it to relocate references to external code and variables.

There are essentially three cases:



  1.  References to global data, such as guest CPU register state.

  2.  References to external code, such as the MMU implementation.

  3.  Abusive relocation to simulate immediate values.



The first two are relatively straight forward uses of relocation.  On CISC

machines like X86, code and data references tend to be relative to the

program counter; on RISC machines accesses to global data tend to be

relative to a register reserved to point into the loaded data section.

QEMU's dynamic translator relies on access to relocation records enumerating

all external references made by each micro-op.



QEMU's use of relocation-based rewriting also contains an abusive side.

Whenever possible, guest code immediate values (viz.  $0x3 in a PPC

instruction such as "MOV $0x3, R2") are placed directly into the

translation buffer, as host code immediate values.  This is accomplished by

having micro-ops load the address of fictitious external symbols such as

"__op_param_1", as in



  long temp = (long) &__op_param_1;



When compiled, the host code will have a word-sized ``hole'' where, under

normal circumstances, the linker would place the address of __op_param_1,

acting on relocation records placed in the file's meta-data.  Instead,

QEMU's translator will place the value 3 there, basing this special

treatment on the name of the external symbol referenced.  If we are, for

example, running on an X86 machine, then the compiler will turn the above C

into ``MOV $__op_param_1, %eax'' which will be assembled into a .o with an

external reference to __op_param_1.  While translating, then, QEMU will

temporarily pretend that the address of __op_param_1 is 3, yielding ``MOV

$0x3, %eax.'' Later, when translating ``MOV $0xCAFE, %eax,'' QEMU could

pretend that the address of __op_param_1 is 0xCAFE.



In summary, every time QEMU copies a micro-op's code into a translation

buffer, every external reference made by the micro-op is relocated.  Many

such relocations are done ``naturally'' (i.e., a reference to a helper

function is simply updated to point at the helper function's code), but

some relocations are ``abusive,'' meaning that QEMU overrides the normal

relocation process to insert values which need not be the addresses of any

particular object.



Dyngen

------



QEMU uses a build-time tool, called dyngen, to statically analyze the

compiled micro-ops.  For each micro-op, dyngen emits a small bit of C code

which knows how to copy the micro-op onto the end of the translation buffer,

patch ``normal'' relocations, and fill in ``abusive'' references.  Each such

chunk of code is a case arm in an enormous switch(micro-op) statement.



On all existing platforms, using GCC/gas/GNU ld, the build process compiles

op.c to op.o.  Dyngen analyzes the relocation records in op.o to produce the

C code bit and some tables used by the runtime translator.  Then dyngen's

output is compiled and the result is linked together with op.o into the

QEMU executable.



This approach will not work with kencc on Plan 9: op.8 contains no

instructions and does not really even contain relocation data -- in terms of

patch points and rewriting rules -- as symbols are still just names.  Thus

we propose building op.c into a Plan 9 dynamically loaded module, which

contains both instructions and relocation meta-data.  However, as QEMU's use

of relocation is very different from ``regular'' relocation, some changes to

Plan 9's dynamic-load module, libdynld, are required.



What Dyngen Does

----------------



Dyngen internally produces several tables and can turn them into various forms

as requested:



  1.  Dyngen parses the micro-ops library to produce an in-memory list of

  functions therein.  This list is used internally as a guide for all

  subsequent operations.



  2.  Each function is investigated a little more closely, to determine its

  boundaries.  These boundaries are used



     A. To determine size of the function as copied and needed in generated

     code.

     B. To associate relocation records with this function, below.



  3.  The module's relocation meta-data (i.e., its patch-point list) is

  iterated over and each entry is associated with the function it effects.

  This allows dyngen to map a micro-op to its list of patch points.  This

  is used in code generation.



  4.  Each patch point must further be identified with the symbol it

  references.  This is necessary for the abusive constant-folding approach

  outlined above.  Specifically, if the name of the symbol matches a given

  prefix, the value landed via relocation will not be patched with a

  reference to the symbol (which doesn't really exist); instead the value is

  abusively loaded with a translator-specified constant.



  5.  On platforms using ELF, OSX (using MACHO), and Windows (using COFF),

  dyngen deeply understands the loader format: it can interpret relocation

  records and emit patch instructions directly in C. As a concrete example,

  when a micro-op calls raise_exception(), dyngen's emitted code contains a

  line like:



    *(uint32_t *)(gen_code_ptr + 39) = (long)(&raise_exception)

					- (long)(gen_code_ptr + 39) + -4;



  If abusive constant folding is taking place, the relocation instead looks

  like:



      long param1;

      param1 = *opparam_ptr++;

      *(uint32_t *)(gen_code_ptr + 3) = param1

					- (long)(gen_code_ptr + 3) + -4;



  Both of these examples are using EIP-relative loads, but provision exists

  in dyngen for absolute loads; also, arbitrary offsets, as might be

  found when looking inside structures, are understood, as in:



    *(uint32_t *)(gen_code_ptr + 2) = param1

				      - (long)(gen_code_ptr + 2) + 56;





Dyngen For Plan 9

-----------------



Most of the above steps are immediately possible on Plan 9, using libmach

and code from libdynld:



  1.  Libmach's crackhdr and symbol family of functions will allow us to

  iterate fully over the symbol table, providing type data for each.



  2.  For each text symbol, libmach's fnbounds() gives us exactly what we

  want.  We do not care about symbols outside the text region, so all is

  well.



  3.  Because libdynld is designed to insulate its users from the notion of

  a patch point, a Plan 9 dyngen will require either assigning a public

  interface to some of libdynld's internals or copying some of libdynld's

  code into dyngen.  More explicitly, dyngen needs access to the dlm's

  import table and relocation records.  The code in libdynld which decodes

  these is only specific to the dlm a.out format, so its duplication is

  possibly acceptable, but it probably should be exported beside libmach's

  decoding facilities, for use by, e.g.  acid.  For example, a function



    int dyn_import_table(int fd, Fhdr* fh, Dynsym **table);



  which took the file descriptor and cracked header, generated the import

  table, and returned the number of entries would suffice.  For the proposed

  function, the user is responsible for freeing the table.



  4.  Determination of the referenced symbol name from the patch point is

  not currently possible through the API exposed by libdynld.



  For each patch point, dyngen will need to know the symbol name to decide

  which, if any, abusive case is at hand.  Because mapping from a table

  index to a name is trivial, solving the patch point to index problem also

  solves the patch point to name problem.  Therefore, I propose an addition

  to libdynld's API: a new structure to capture state currently only held in

  temporary variables



    struct Dynrel {

      ulong offset; /* Patch point offset in module */

      int mode; /* Patch point mode from relocation record */

      int reloc_index; /* Index into import table, -1 if none.  */

      uvlong reloc_offset; /* Offset from symbol or, if none, from base */

    };



  and a new function



    int dyn_reloc_table(int fd, Fhdr* fh, Dynrel **table);



  which returns the number of relocations stored in the table.  Here again

  the user would be responsible for freeing the table.  [Strictly speaking,

  for the project at hand, reloc_offset is not necessary; it is proposed for

  future users such as acid or a symbolic disassembler that is dlm aware.]



  5.  Once dyngen knows what symbol each patch point refers to, it knows

  whether the relocation is ``normal'' or ``abusive'' and which type of code

  it should emit.  However, the encoding of the value into the instruction

  stream (relative vs.  absolute, base, byte order, etc.) depends on the host

  instruction set and the particular instruction at hand.  While the gcc

  dyngen understands each host's relocation record format and can thus

  generate the right code scraps, in Plan 9 this information is currently

  encapsulated within the per-architecture dynreloc() function.



  The simplest approach for Plan 9's dyngen would be to rely on the existing

  dynreloc() implementation.  Since dynreloc() works using a module import

  table, it will perform the necessary mixture of regular and abusive

  patching for us, if we provide it with an import table which maps regular

  symbols to their addresses and abusive symbols to their desired values.



  It seems straightforward, at build time, to have dyngen emit a C array

  declaration which will reproduce the 8.op import table.  This avoids

  copying code from libdynld to build the import table using QEMU's export

  table.



It is not possible to use libdynld's loader function, dynloadgen(), for the

purpose at hand, as it assumes that an entire module is loaded in one place,

all at once, destructively.  It is not possible to take a loaded dlm and

relocate it elsewhere, even if the dlm's import and relocation tables are

kept in RAM, as the act of relocation overwrites some relocation meta-data

originally placed in the machine code stream.



libdynld API Addition Discussion

--------------------------------



The above API functions rely on libmach's decoding facilities to provide

parsed headers.  An alternative, making them more at home in libdyngen,

would be have them to mirror dynloadgen()'s prototype, as in:



  int dyn_import_table(void *file, long (*read)(void *, void*, long),

		       vlong (*seek)(void *, vlong, int),

		       void (*werr)(char*), Dynsym **table);



  int dyn_reloc_table(void *file, long (*read)(void *, void*, long),

		       vlong (*seek)(void *, vlong, int),

		       void (*werr)(char*), Dynrel **table);



However, it seems the reason for dynloadgen()'s prototype taking function

pointers to do read, seek, and error printing is so that it is usable in the

kernel as well as by user land (where these functions can simply be wrappers

around system calls).  As the kernel seems an unlikely consumer of these

decoding functions, at least relative to some future version of acid, the

former prototypes feel more natural.



As to which library they belong in, at the moment libdynld is not widely

distributed but libmach is.  Thus it would seem less disruptive to place

these functions in libdynld, at least temporarily.



The definition of dyn_reloc_table() makes it require per-architecture

knowledge to fill in the reloc_offset and reloc_index fields.  At present,

the entirety of the per-architecture relocation behavior is encapsulated in

dynreloc(), which does the entirety of the relocation necessary for a given

patch point, including modifying the instruction byte stream.  This makes

dynreloc() an unsuitable API for dyn_reloc_table()'s use.



To encapsulate the required knowledge to fill in the

->reloc_index and ->reloc_offset values in the Dynrel struct, I envision a

per-architecture function, to be called by dyn_reloc_table:



  void crack_dynreloc(uchar *base, Dynrel *reloc);



where reloc is a pointer to a Dynrel which has ->offset and ->mode

specified.  This function will fill in ->reloc_index and ->reloc_offset.

The index is -1 if no symbol is imported (if the relocation is purely an

offset into the dlm).  Essentially, this is dynreloc() without side-effects

(notably, it would leave the patch point unrelocated).



dynloadgen() would continue to call dynreloc(), which may or may not be a

consumer of crack_dynreloc().  On the one hand, dynreloc() is very small and

simple as is, I'd rather not change the implementation of dynreloc(), and

the change to consume Dynrel structures doesn't seem to dramatically reduce

or simplify dynreloc().  On the other hand, keeping dynreloc() and

crack_dynreloc() in sync might be a drag.  I think, unless strong preference

is indicated the other way, that I will go with leaving dynreloc() alone.



To be painfully explicit, loading a dlm will be no different after the above

changes; the extant code paths through dynloadgen() and dynreloc() would be

entirely unaltered by the proposed changes.  The envisioned purpose of the

new API functions is to facilitate inspection of a dlm; possible consumers

beyond dyngen include acid and objdump-like tools.



Current Demo

------------



I have written a brief, proof-of-concept demo version of dyngen running on

Plan 9.  To work around the problems above it copies code and contains its

own x86-specific version of crack_dynreloc.  It has many other deficiencies,

but on the other hand, it was able to take a micro-op



extern ulong __abuseme;

extern void externfun(void *);



void foo() {

  externfun(&__abuseme);

}



and generate a C function to relocate it (including abusive patches).  When

combined with a master program, which defined



void externfun(ulong param)

{

  printf("externfun called: %d\n", param);

}



and which had at its core two invocations of dyngen's code, each using

different abusive values:



uchar *buf31337 = malloc(20);

put_fun(buf31337, 31337);

( (void (*)(void)) buf31337 )();



uchar *buf2BAD = malloc(20);

put_fun(buf2BAD, 0x2BADD00D);

( (void (*)(void)) buf2BAD )();



Here put_fun() is the dyngen generated code, which does both a normal

relocation for externfun() and an abusive relocation for __abuseme.  Each of

the patched buffers managed to call printf() with the right value.  The

tarball is available on Uriel's site at

http://gsoc.cat-v.org/people/nwf/dyngen-demo.tgz . I apologize in advance

for the demo quality of the code.



As an aside, if dyngen emits code which uses dynreloc() (such as the one

discussed and the one of the demo), the translator will be slightly slower

than the GCC-style dyngen which understands each relocation type, as it will

be making function calls per relocation record for each function in a

translation buffer.  However, a sufficiently aggressive cross-function

optimizer should be able to reduce it to being a few memory and math

operations more than code generated from gcc dyngen's traditional output.



Conclusion

----------



I propose that libdynld export the Dynrel struct and the dyn_import_table()

and dyn_reloc_table() functions as described above.  In addition,

per-architecture code currently in dynreloc() would be copied into

crack_dynreloc(), which would be used by dyn_reloc_table().  These changes

would be made to libdynld, leaving libmach unmodified.  This approach would

enable a Plan 9 port of dyngen and would also be helpful for extending acid

to deal with dlm's.



Experience

----------



Experience suggests that the current dlm meta-data is difficult to extract (in

particular, dyn_import_table and dyn_reloc_table both know how to scan the

import table's packed form).  Further, the relocation meta-data is spread

between the relocation table, which specifies a relative offset (from the last

relocation) and the mode of this relocation.  dynreloc() must then go read the

symbol import index (if any) and offset (if any) from the dlm's text and data

sections.  We currently assume that dynreloc() reads only near the offset

given, but we have no correct ground to stand for this assumption.  Moving the

symbol and offset information into the relocation record would grow the dlm in

size slightly but would eliminate the need for crack_dynrel() and would ease

concerns that some future dynreloc() may expect more of the dlm to be present

than QEMU's dynamic translator ensures.



All of these features of dynld seem to stem from the mindset that dynamic

modules are to be loaded once then forgotten.  While the attempt to stear quite

clear of the madness of dynamic libraries on UNIX hosts is laudable, the

current design may make debuggers' (both people and programs) lives harder and

leave lingering doubts in the minds of slightly-off-the-beaten-path users (e.g.

QEMU) by being overly opaque.



Unfortunately, changing these properties requires chaning the loaders, which

are complex.  (It is worth noting that libdynld is itself remarkably short and

concise.) Fortunately, all these complaints are nitpicks and do not currently

prevent us from moving forward.  The above API will translate naturally to a

modified libdynld, should changes ever be made.