|Tuning The Forth VM|
Tuning The Portable Forth Environment
Loop unrolling in the inner interpreter
The most time critical piece of code in pfe is the inner interpreter, a tight loop calling all primitives compiled into a high-level definition. You find it in file support.c, function run_forth().
On some CPU's it significantly saves time when the code of the inner interpreter is unrolled several times without the need to jump back to the start of the loop after every primitive is executed. On other CPU's it doesn't help or even makes it slower.
For example the benchmark-performance of pfe on a 486 is about 15% better with unrolled NEXT, while the performance on a Pentium becomes slightly worse.
You'll have to try it, what is better on your machine. To enable the
feature, you need the following compiler option in Makefile:
Using global register variables
pfe is designed for best portability. This means it can be compiled with a variety of compilers on many systems. Obviously this prevented me from squeezing the last bit of performance out of any special system.
Fortunately there's a way to tune it up significantly with only little effort provided you have GNU-C at hand.
Let me explain: As most of you probably know, a Forth-interpreter
traditionally contains a so-called virtual machine. PFE does. This
virtual machine consists of several virtual registers and a basic set
of operations. The virtual registers are:
In a traditional assembler-based Forth implementation these virtual registers would be mapped to physical registers of the CPU at hand. How efficient such an implementation is depends heavily on how cleverly this mapping is done.
pfe has no other choice than to declare C-language global variables to represent these virtual registers. These variables are accessed very frequently.
Now GNU-C allows us to put global variables in registers! Obviously the number of registers in a CPU is limited and the use of registers by library functions and the compiler itself interferes.
In spite of these restrictions it is possible to find a niche even in an i386 where to place the two most important virtual registers resulting in a performance boost of about 50%. (Just one more detail that shows what a great job the GNU-C developers did.)
If your system is one of those known by the config-script then all provisions to use global register variables are already taken. You can enable and disable the usage of global register variables in by specifying the option '-DPFE_USE_REGS' in the Makefile, or using the configure-option '--with-regs'.
If your system isn't known by the config script, then first make sure you have a stable port according to the instructions in the file `INSTALL'. Then read the next section to enable the usage of register variables on your system. If all works well please send me your changes.
current versions of gcc (<= 2.6.0) seem to compile incorrect code in very special situations when global register variables are used. This is reported and fixed in later gcc versions.
When you find something not working that worked in previous versions of pfe, then please check if it works again after recompiling pfe without -DPFE_USE_REGS. Please inform me of such cases: "email@example.com (Dirk Zoller)"
The use of gcc register-allocation has proven to be a very fine optimization technique for a C-implemented forth system. However most versions of gcc have some kind of problems on some kind of platform. The latest 2.8.x version is currently the most stable one with just a problem for the i960 target due to the peculiar call-instruction behaviour on that processor.
Later versions of gcc (2.95.x at the time of writing) show some problems involved. The most common: many platforms use a register-based calling scheme (they have 32 cpu registers or more). Using some of the cpu-registers does not make the call-frame generator to save away the forth global register - they are simply overwritten and never restored. If you look through the def-regs.h file then you'll see that we use register numbers 11 and up, so the only thing you have to watch out for are instances of function-calls with ten arguments and more. If you have such a call, save the cpu-register the hard way and restore them after the call.
Another problem are the builtin-functions, especially on the i386-platforms with later gcc. They do effectivly reserve the special-purpose registers for these builtin-functions, so that you rarely get more than three cpu registers for the forth VM. Sometime -fno-builtin helps, sometimes the compile will warn you about register-allocation overlaps, some versions of gcc won't do even that and go to generate bad code. All these problems made me disable --with-regs by default which is clearly a very poor choice, so I advise you to enable it in your default-configuration just after running some tests to ensure the compiler you installed works correctly.
The versions of pfe up to 0.30.x do also need the gcc register reservation for being multithreaded - the global thread-pointer is put into a register that is going to be saved/restored on a thread-switch. Without it, some glue code would be needed that does not currently exist in the sources - so this is another strong hint to configure --with-regs in your projects.
If you have a new processor with yet another register-assignment, please mail me the information so that I can include it to pfe-regs.h - mail to firstname.lastname@example.org
Choosing registers to use
When you use global register variables in GNU-C then you have to
explicitly state which machine register to use for the global variable
to declare "register". The syntax is like this:
As far as I see choosing machine registers to use for global register variables is just a matter of trial and error.
First find out how registers are named on your machine. Not how the CPU-manufacturer names them but how the assembler used by gcc (as or gas depending on the configuration of gcc) names them. It's easy: simply use gcc to compile one of the C files with option -S. I changed the `makefile' to allow this by simply `make core.s'.
Then look at `core.s': You don't have to know much of assembly
language programming and even less of the particular CPU. All you are
interested in is: what are the registers? In `core.s' search for the
label `dupe_' i.e. the compiled function that does the work of the
Forth word `DUP'. The C-source for DUP is:
On an RS/6000 (where you won't have to do this because I did it
already) using gcc you'd find the following assembler lines generated
Next edit the file `pfe/def-regs.h'. Add a system specific section of
preprocessor definitions naming CPU registers to use for virtual
machine registers like this:
Ok, the full set needed a little more experimentation. Maybe start with only P4_REGSP or P4_REGIP.
After enabeling these declarations with the -DPFE_USE_REGS command line
option another `make core.s' yields the following translation for DUP:
If your CPU has different types of registers for data and for pointers then the pointers are needed in pfe. (On M68k the Ax not the Dx.)
If you don't have enough free registers in your CPU then serve the first virtual registers in the above list first. They are ordered by their importance.
Then do a `make new' with option -DPFE_USE_REGS. If you get compiler errors and warnings about `spilled' or `clobbered' registers then change the mapping until it compiles quietly. There's a good chance that it still runs now and if it does it runs significantly faster than before.
(last changed by guidod)