Faulty Page Fault Handler (By gammal Last mod: Aug 5 2007 )
I'm still puzzled by the “hw tlb loading disabled w/o sw loading available” warning displayed by the simulator. I verified that software TLB management is enabled, and that the correct interrupt handler is invoked on a page fault. Having run out of ideas, I moved on to the next issue. The page fault handler (which eventually causes mmuput() to be called) behaves correctly the first time it's called, that is, a vacant TLB entry is used to map the page, but when the next fault occurs, mmuput() panics as it finds that the virtual page it's trying to map already exists in the TLB, which means one of two things: either the checking used in mmuput() itself is buggy, or that there's inconsistency between the hardware TLB & the STLB cached by the kernel, which is unlikely. My biggest problem now is that I can't even use print() for debugging. mmuput() is extremely sensitive to any modification I make (may be because it's running in an interrupt -- there might be some restrictions that I'm not aware of). For example, this is what I get when I modify the condition used in checking if an entry is duplicated:
Cell Memory Management (By gammal Last mod: Jul 29 2007 )
Several bugs have been identified and fixed in the MMU code. Although the first user process still dies prematurely. I hope that we're close to finding the remaining issues. The simulator output has been very useful, although it's really a tedious task having to single-step through long runs of instructions to find where things go wrong.
Most changes have been done in l.s, an assembly file containing processor initialization code & other essential functions, and mmu.c, which has the MMU code specific to the Cell BE. This is a brief summary of the issues that were solved so far:
- As an enhancement, l.s now uses the TLBIA instruction to invalidate the whole TLB, instead of looping through all entries. This is both smaller & faster.
- The simulator was producing warnings about bad values used in specifying the TLB way. I found that a bit-mask used to specify the way had an invalid value.
- We're software-managing the TLB, and I found that in certain places, inconsistency is created between the hardware TLB and the cached Software TLB.
- kernel pages weren't mapped correctly in the TLB whenswitching between processes.
Being working at a very early stage in the kernel, fixing all of these issues hasn't shown any perceivable improvement yet (other than the internal state of the processor being more like what we expect), simply because any little remaining issue is capable of screwing everything up!
By enabling Memory XLATE_FAULTS debug control in the simulator, I get a message saying “hw tlb loading disabled w/o sw loading available” right after switching to the user process. This is interesting because software management of TLB is enabled in the very beginning, long before the the user process is created, and I don't see that message during that. Anyway, at least I have a clue about where to look next!