Problem Statement
-----------------

QEMU uses dynamic translation (vs. acting as an interpreter) to achieve
reasonably fast simulation of guest code.  In order to avoid a machine-code
translator for each (guest, host) pair, QEMU breaks each guest instruction
into one or more C functions, called ``micro-operations.''  QEMU compiles
these with the host's C compiler and stitches them together during
translation.  Micro-ops range from ``instruction building blocks'' like
condition code update routines and memory access primitives to
implementations of entire guest instructions.  In broad overview, for each
basic block of guest code, QEMU translates it into a string of micro-ops and
then the byte streams of the micro-ops are concatenated to produce a
translation of that basic block.  This translation is then cached so that
subsequent use of this block by guest code will (hopefully) not need
re-translation.

Because each micro-op will be used in many translations, it will exist in
many places in memory.  Every time the micro-op is copied, the translator
must rewrite it to relocate references to external code and variables.
There are essentially three cases:

  1. References to global data, such as guest CPU register state.
  2. References to external code, such as the MMU implementation.
  3. Abusive relocation to simulate immediate values.

The first two are relatively straight forward uses of relocation.  On CISC
machines like X86, code and data references tend to be relative to the
program counter; on RISC machines accesses to global data tend to be
relative to a register reserved to point into the loaded data section.
QEMU's dynamic translator relies on access to relocation records enumerating
all external references made by each micro-op.

QEMU's use of relocation-based rewriting also contains an abusive side.
Whenever possible, guest code immediate values (viz. $0x3 in a PPC
instruction such as  "MOV $0x3, R2") are placed directly into the
translation buffer, as host code immediate values.  This is accomplished by
having micro-ops load the address of fictitious external symbols such as
"__op_param_1", as in

  long temp = (long) &__op_param_1;

When compiled, the host code will have a word-sized ``hole'' where, under
normal circumstances, the linker would place the address of __op_param_1,
acting on relocation records placed in the file's meta-data.  Instead,
QEMU's translator will place the value 3 there, basing this special
treatment on the name of the external symbol referenced.  If we are, for
example, running on an X86 machine, then the compiler will turn the above C
into ``MOV $__op_param_1, %eax'' which will be assembled into a .o with an
external reference to __op_param_1.  While translating, then, QEMU will
temporarily pretend that the address of __op_param_1 is 3, yielding ``MOV
$0x3, %eax.''  Later, when translating ``MOV $0xCAFE, %eax,'' QEMU could
pretend that the address of __op_param_1 is 0xCAFE.

In summary, every time QEMU copies a micro-op's code into a translation
buffer, every external reference made by the micro-op is relocated.  Many
such relocations are done ``naturally'' (i.e., a reference to a helper
function is simply updated to point at the helper function's code), but
some relocations are ``abusive,'' meaning that QEMU overrides the normal
relocation process to insert values which need not be the addresses of any
particular object.

Dyngen
------

QEMU uses a build-time tool, called dyngen, to statically analyze the
compiled micro-ops.  For each micro-op, dyngen emits a small bit of C code
which knows how to copy the micro-op onto the end of the translation buffer,
patch ``normal'' relocations, and fill in ``abusive'' references.  Each such
chunk of code is a case arm in an enormous switch(micro-op) statement.

On all existing platforms, using GCC/gas/GNU ld, the build process compiles
op.c to op.o.  Dyngen analyzes the relocation records in op.o to produce the
C code bit and some tables used by the runtime translator.  Then dyngen's
output is compiled and the result is linked together with op.o into the
QEMU executable.

This approach will not work with kencc on Plan 9: op.8 contains no
instructions and does not really even contain relocation data -- in terms of
patch points and rewriting rules -- as symbols are still just names.  Thus
we propose building op.c into a Plan 9 dynamically loaded module, which
contains both instructions and relocation meta-data.  However, as QEMU's use
of relocation is very different from ``regular'' relocation, some changes to
Plan 9's dynamic-load module, libdynld, are required.

What Dyngen Does
----------------

Dyngen internally produces several tables and can turn them into various forms
as requested:

  1.  Dyngen parses the micro-ops library to produce an in-memory list of
  functions therein.  This list is used internally as a guide for all
  subsequent operations.

  2.  Each function is investigated a little more closely, to determine its
  boundaries.  These boundaries are used

     A. To determine size of the function as copied and needed in generated
     code.
     B. To associate relocation records with this function, below.

  3.  The module's relocation meta-data (i.e., its patch-point list) is
  iterated over and each entry is associated with the function it effects.
  This allows dyngen to map a micro-op to its list of patch points.  This
  is used in code generation.

  4.  Each patch point must further be identified with the symbol it
  references.  This is necessary for the abusive constant-folding approach
  outlined above.  Specifically, if the name of the symbol matches a given
  prefix, the value landed via relocation will not be patched with a
  reference to the symbol (which doesn't really exist); instead the value is
  abusively loaded with a translator-specified constant.

  5.  On platforms using ELF, OSX (using MACHO), and Windows (using COFF),
  dyngen deeply understands the loader format: it can interpret relocation
  records and emit patch instructions directly in C.  As a concrete example,
  when a micro-op calls raise_exception(), dyngen's emitted code contains a
  line like:

    *(uint32_t *)(gen_code_ptr + 39) = (long)(&raise_exception) 
                                        - (long)(gen_code_ptr + 39) + -4;

  If abusive constant folding is taking place, the relocation instead looks
  like:

      long param1;
      param1 = *opparam_ptr++;
      *(uint32_t *)(gen_code_ptr + 3) = param1 
                                        - (long)(gen_code_ptr + 3) + -4;

  Both of these examples are using EIP-relative loads, but provision exists
  in dyngen for absolute loads; also, arbitrary offsets, as might be
  found when looking inside structures, are understood, as in:

    *(uint32_t *)(gen_code_ptr + 2) = param1
                                      - (long)(gen_code_ptr + 2) + 56;


Dyngen For Plan 9
-----------------

Most of the above steps are immediately possible on Plan 9, using libmach
and code from libdynld:

  1.  Libmach's crackhdr and symbol family of functions will allow us to
  iterate fully over the symbol table, providing type data for each.

  2.  For each text symbol, libmach's fnbounds() gives us exactly what we
  want.  We do not care about symbols outside the text region, so all is
  well.

  3.  Because libdynld is designed to insulate its users from the notion of
  a patch point, a Plan 9 dyngen will require either assigning a public
  interface to some of libdynld's internals or copying some of libdynld's
  code into dyngen.  More explicitly, dyngen needs access to the dlm's
  import table and relocation records.  The code in libdynld which decodes
  these is only specific to the dlm a.out format, so its duplication is
  possibly acceptable, but it probably should be exported beside libmach's
  decoding facilities, for use by, e.g. acid.  For example, a function
  
    int dyn_import_table(int fd, Fhdr* fh, Dynsym **table);

  which took the file descriptor and cracked header, generated the import
  table, and returned the number of entries would suffice.  For the proposed
  function, the user is responsible for freeing the table.

  4.  Determination of the referenced symbol name from the patch point is
  not currently possible through the API exposed by libdynld.

  For each patch point, dyngen will need to know the symbol name to decide
  which, if any, abusive case is at hand.  Because mapping from a table
  index to a name is trivial, solving the patch point to index problem also
  solves the patch point to name problem.  Therefore, I propose an addition
  to libdynld's API: a new structure to capture state currently only held in
  temporary variables

    struct Dynrel {
      ulong offset;         /* Patch point offset in module */
      int mode;             /* Patch point mode from relocation record */
      int reloc_index;      /* Index into import table, -1 if none. */
      uvlong reloc_offset;  /* Offset from symbol or, if none, from base */
    };

  and a new function

    int dyn_reloc_table(int fd, Fhdr* fh, Dynrel **table);

  which returns the number of relocations stored in the table.  Here again
  the user would be responsible for freeing the table.  [Strictly speaking,
  for the project at hand, reloc_offset is not necessary; it is proposed for
  future users such as acid or a symbolic disassembler that is dlm aware.]

  5. Once dyngen knows what symbol each patch point refers to, it knows
  whether the relocation is ``normal'' or ``abusive'' and which type of code
  it should emit.  However, the encoding of the value into the instruction
  stream (relative vs. absolute, base, byte order, etc.) depends on the host
  instruction set and the particular instruction at hand.  While the gcc
  dyngen understands each host's relocation record format and can thus
  generate the right code scraps, in Plan 9 this information is currently
  encapsulated within the per-architecture dynreloc() function.

  The simplest approach for Plan 9's dyngen would be to rely on the existing
  dynreloc() implementation.  Since dynreloc() works using a module import
  table, it will perform the necessary mixture of regular and abusive
  patching for us, if we provide it with an import table which maps regular
  symbols to their addresses and abusive symbols to their desired values.

  It seems straightforward, at build time, to have dyngen emit a C array
  declaration which will reproduce the 8.op import table.  This avoids
  copying code from libdynld to build the import table using QEMU's export
  table.

It is not possible to use libdynld's loader function, dynloadgen(), for the
purpose at hand, as it assumes that an entire module is loaded in one place,
all at once, destructively.  It is not possible to take a loaded dlm and
relocate it elsewhere, even if the dlm's import and relocation tables are
kept in RAM, as the act of relocation overwrites some relocation meta-data
originally placed in the machine code stream.

libdynld API Addition Discussion
--------------------------------

The above API functions rely on libmach's decoding facilities to provide
parsed headers.  An alternative, making them more at home in libdyngen,
would be have them to mirror dynloadgen()'s prototype, as in:

  int dyn_import_table(void *file, long (*read)(void *, void*, long),
                       vlong (*seek)(void *, vlong, int),
                       void (*werr)(char*), Dynsym **table);

  int dyn_reloc_table(void *file, long (*read)(void *, void*, long),
                       vlong (*seek)(void *, vlong, int),
                       void (*werr)(char*), Dynrel **table);

However, it seems the reason for dynloadgen()'s prototype taking function
pointers to do read, seek, and error printing is so that it is usable in the
kernel as well as by user land (where these functions can simply be wrappers
around system calls).  As the kernel seems an unlikely consumer of these
decoding functions, at least relative to some future version of acid, the
former prototypes feel more natural.

As to which library they belong in, at the moment libdynld is not widely
distributed but libmach is.  Thus it would seem less disruptive to place
these functions in libdynld, at least temporarily.

The definition of dyn_reloc_table() makes it require per-architecture
knowledge to fill in the reloc_offset and reloc_index fields.  At present,
the entirety of the per-architecture relocation behavior is encapsulated in
dynreloc(), which does the entirety of the relocation necessary for a given
patch point, including modifying the instruction byte stream.  This makes
dynreloc() an unsuitable API for dyn_reloc_table()'s use.

To encapsulate the required knowledge to fill in the
->reloc_index and ->reloc_offset values in the Dynrel struct, I envision a
per-architecture function, to be called by dyn_reloc_table:

  void crack_dynreloc(uchar *base, Dynrel *reloc);

where reloc is a pointer to a Dynrel which has ->offset and ->mode
specified.  This function will fill in ->reloc_index and ->reloc_offset.
The index is -1 if no symbol is imported (if the relocation is purely an
offset into the dlm).  Essentially, this is dynreloc() without side-effects
(notably, it would leave the patch point unrelocated).

dynloadgen() would continue to call dynreloc(), which may or may not be a
consumer of crack_dynreloc().  On the one hand, dynreloc() is very small and
simple as is, I'd rather not change the implementation of dynreloc(), and
the change to consume Dynrel structures doesn't seem to dramatically reduce
or simplify dynreloc().  On the other hand, keeping dynreloc() and
crack_dynreloc() in sync might be a drag.  I think, unless strong preference
is indicated the other way, that I will go with leaving dynreloc() alone.

To be painfully explicit, loading a dlm will be no different after the above
changes; the extant code paths through dynloadgen() and dynreloc() would be
entirely unaltered by the proposed changes.  The envisioned purpose of the
new API functions is to facilitate inspection of a dlm; possible consumers
beyond dyngen include acid and objdump-like tools.

Current Demo
------------

I have written a brief, proof-of-concept demo version of dyngen running on
Plan 9.  To work around the problems above it copies code and contains its
own x86-specific version of crack_dynreloc.  It has many other deficiencies,
but on the other hand, it was able to take a micro-op

extern ulong __abuseme;
extern void externfun(void *);

void foo() {
  externfun(&__abuseme);
}

and generate a C function to relocate it (including abusive patches).  When
combined with a master program, which defined

void externfun(ulong param)
{
  printf("externfun called: %d\n", param);
}

and which had at its core two invocations of dyngen's code, each using
different abusive values:

uchar *buf31337 = malloc(20);
put_fun(buf31337, 31337);
( (void (*)(void)) buf31337 )();

uchar *buf2BAD = malloc(20);
put_fun(buf2BAD, 0x2BADD00D);
( (void (*)(void)) buf2BAD )();

Here put_fun() is the dyngen generated code, which does both a normal
relocation for externfun() and an abusive relocation for __abuseme.  Each of
the patched buffers managed to call printf() with the right value.  The
tarball is available on Uriel's site at
http://gsoc.cat-v.org/people/nwf/dyngen-demo.tgz .  I apologize in advance
for the demo quality of the code.

As an aside, if dyngen emits code which uses dynreloc() (such as the one
discussed and the one of the demo), the translator will be slightly slower
than the GCC-style dyngen which understands each relocation type, as it will
be making function calls per relocation record for each function in a
translation buffer.  However, a sufficiently aggressive cross-function
optimizer should be able to reduce it to being a few memory and math
operations more than code generated from gcc dyngen's traditional output.

Conclusion
----------

I propose that libdynld export the Dynrel struct and the dyn_import_table()
and dyn_reloc_table() functions as described above.  In addition,
per-architecture code currently in dynreloc() would be copied into
crack_dynreloc(), which would be used by dyn_reloc_table().  These changes
would be made to libdynld, leaving libmach unmodified.  This approach would
enable a Plan 9 port of dyngen and would also be helpful for extending acid
to deal with dlm's.

Experience
----------

Experience suggests that the current dlm meta-data is difficult to extract (in
particular, dyn_import_table and dyn_reloc_table both know how to scan the
import table's packed form).  Further, the relocation meta-data is spread
between the relocation table, which specifies a relative offset (from the last
relocation) and the mode of this relocation.  dynreloc() must then go read the
symbol import index (if any) and offset (if any) from the dlm's text and data
sections.  We currently assume that dynreloc() reads only near the offset
given, but we have no correct ground to stand for this assumption.  Moving the
symbol and offset information into the relocation record would grow the dlm in
size slightly but would eliminate the need for crack_dynrel() and would ease
concerns that some future dynreloc() may expect more of the dlm to be present
than QEMU's dynamic translator ensures.

All of these features of dynld seem to stem from the mindset that dynamic
modules are to be loaded once then forgotten.  While the attempt to stear quite
clear of the madness of dynamic libraries on UNIX hosts is laudable, the
current design may make debuggers' (both people and programs) lives harder and
leave lingering doubts in the minds of slightly-off-the-beaten-path users (e.g.
QEMU) by being overly opaque.

Unfortunately, changing these properties requires chaning the loaders, which
are complex. (It is worth noting that libdynld is itself remarkably short and
concise.)  Fortunately, all these complaints are nitpicks and do not currently
prevent us from moving forward.  The above API will translate naturally to a
modified libdynld, should changes ever be made.