Some Typical Results. 
=====================


All runs conducted on 400MHz AMD K6-2, 100MHz motherboard, 128Mb PC100 RAM.
All figures are entire application run times. The actual overhead of memory
administration is just one part of the total run time. This makes the
observed impact of memory management implementation even more striking.

Meassured memory usage is total application usage, including the executable
image.


Benchmark 1: Random subtree operations.
---------------------------------------

This is a syntetic benchmark designed to test the garbage collector.
The program builds a binary tree. It then repeatedly seeks out a random
subtree and replaces it with a new one. Every interior node in the tree 
has the same size, while the leafs are of random size. The average node
size is 50 bytes, unless otherwise stated. Occasionally it allocates 
256Kb just to be mean. And occasionally it replaces the entire tree 
with a new one. During replacement, two trees, the new and the old coexist
in memory, roughly doubling the storage requirements. This kind of behaviour
is seldomly seen in real programs. Thus, the picture presented from this
application is a very pesimistic one.

The program is used in four variants with SGC in the 
benchmarks above:

1) As a new/delete provider. The application handles deletes by itself,
   no SGC specific interfaces are imported. SGC is simply included in 
   the link phase to replace new/delete.
2) As a seperate garbage collector. SGC is included in the link phase,
   all calls to delete are removed from the program. No SGC specific
   interfaces are imported.
3) With leaf optimization enabled. SGC is informed when a leaf (a pointer-
   free object) is allocated. Usually 60-85% of the heap is used for the
   leafs.
4) With leaf+inline optimization. Small objects with compiler determined
   sizes are allocated using special optimized variants of new.

The amount of live memory in the heap changes dramatically during the
run of this benchmark. The heap often becomes highly fragmented. 

(*) The runs marked (*) are not representative. Due to blacklisting 
    (see sgc.h) the heap is expanded to accomodate the 256Kb allocations. 
    Without the 256Kb allocations, the heap would be significantly smaller, 
    and the programs perhaps a teeny weeny bit slower.

(**) For the Tiny Benchmark, the allocation of 256Kb objects were disbled
     and the parameters set to make the entire used memory fit in the 1Mb
     L2 cache, especially when run with traditional new/delete. The figures
     obtained confirms that SGC as a new/delete provider is sensitive to 
     caching.

It is not known whether the 256Kb allocation is actually visible in the
numbers given for traditional new/delete, because they are not alive for
very long, and they may have been allocated and released between updates
of the 'top' status window used to collect the figures.

Size of executable image: About 0.5 Mb


Tiny Configuration (**)			Running Time	Memory Used (max)
^^^^^^^^^^^^^^^^^^
Built in GNU new/delete:		241 s	1	844Kb	1
SGC as new/delete provider:		154 s	0.64	1080Kb	1.28
non-optimized GC			184 s	0.76	1708Kb	2.02
GC with leaf optimization		168 s	0.70	1456Kb	1.73
GC with leaf+inline optimization	160 s   0.66	1440Kb  1.71


Small Configuration			Running Time	Memory Used (max)
^^^^^^^^^^^^^^^^^^^
Built in GNU new/delete:		790 s	1	1648Kb	1
SGC as new/delete provider:		638 s	0.81	2736Kb	1.66
non-optimized GC			631 s	0.80	6436Kb	3.91 (*)
GC with leaf optimization		561 s	0.71	7476Kb  4.54 (*)
GC with leaf+inline optimization	535 s	0.68	7472Kb	4.54 (*)


Medium Configuration			Running Time	Memory Used (max)
^^^^^^^^^^^^^^^^^^^^
Built in GNU new/delete:		560 s	1	4.6Mb	1
SGC as new/delete provider:		553 s	0.99	7.8Mb	1.70
non-optimized GC			445 s	0.79	14Mb	3.04
GC with leaf optimization		398 s	0.71	12.6Mb	2.73
GC with leaf+inline optimization	387 s	0.69	12.6Mb	2.73


Huge Configuration			Running Time	Memory Used (max)
^^^^^^^^^^^^^^^^^^
Built in GNU new/delete:		684 s	1	31Mb	1
SGC as new/delete provider:		662 s	0.97	49Mb	1.58
non-optimized GC			558 s	0.82	91Mb	2.94
GC with leaf optimization		498 s	0.73	69Mb	2.23
GC with leaf+inline optimization:	483 s	0.71	67Mb	2.16


Using smaller objects (1-32b leafs, average object size: 19 bytes)
^^^^^^^^^^^^^^^^^^^^^
Built in GNU new/delete:		230 s	1	7.4Mb	1
SGC as new/delete provider:		181 s	0.79	9.0Mb	1.22
non-optimized GC			147 s	0.64	16Mb	2.16
GC with leaf optimization		141 s	0.61	13.9Mb	1.88
GC with leaf+inline optimization:	136 s	0.59	14.4Mb	1.95


Using larger objects (1-512b leafs, average object size: 195 bytes)
^^^^^^^^^^^^^^^^^^^^
Built in GNU new/delete:		165 s	1	3948Kb	1
SGC as new/delete provider:		176 s	1.07	7240Kb	1.83
non-optimized GC			237 s	1.44	13.4Mb	3.39
GC with leaf optimization		158 s	0.96	11.8Mb	2.99
GC with leaf+inline optimization:	158 s	0.96	11.5Mb	2.91


Using very large objects (1-2048b leafs, average object size: 695 bytes)
^^^^^^^^^^^^^^^^^^^^^^^^
Built in GNU new/delete:		374 s	1	10.7Mb	1
SGC as new/delete provider:		541 s	1.45	21Mb	1.96
non-optimized GC			709 s	1.90	41Mb	3.83
GC with leaf optimization		415 s	1.11	28Mb	2.62
GC with leaf+inline optimization:	416 s	1.11	28Mb	2.62


Benchmark 2: Random GC-friendly subtree operations
--------------------------------------------------

This benchmark is a simplified variant of benchmark 1. All objects are
of the same size. This benchmark can be run in an "optimal" variant. It
uses a class specific free-stack to administer memory. This is the fastest
possible administration scheme, when the order of new/delete is arbitrary.
Thus, it gives a lower bound on the execution time.

(avg size of subtree replaced: 1-2 levels)
Config					Speed		Memory

builtin					108.38	1	16Mb	1
optimal					48.56	0.45	16Mb	1.00
sgc as new/delete provider		70.38	0.65	17Mb	1.06	
non-optimized GC			61.11	0.56	31Mb	1.94
gc with inline optimization		57.42	0.53	31Mb	1.94


(avg size of subtree replaced ~4 levels)
Config					Speed		Memory

builtin					1140	1	1	16Mb	1
optimal					250	0.22	16Mb	1.00
sgc as	new/delete provider		425	0.37	17Mb	1.06
non-optimized gc			442	0.39	35Mb	2.19
gc with inline optimization		389	0.34	35Mb	2.19


Benchmark 3: Digital Logic Simulation.
--------------------------------------


The SimSys programming framework simulating a 5 stage pipeline and
memory hierarchy of a generic RISC microprocessor doing a quicksort 
of 512 integers. This program is a real application, not a syntetic 
benchmark. It is designed with traditional new/delete in mind and
only trivial changes are made to run with SGC. A single 1Mb object
representing the simulated memory is allocated from the standard heap
instead of the garbage collected one. Most of the memory used by the
application are either small link objects from general container classes,
or variable sized blocks representing data packets flowing through the
simulated components. The average object size is 16 bytes. Following a brief 
initialisation phase, the amount of live memory remains almost constant 
throughout the entire run, and fragmentation is very low. leaf+inline
optimization is used to speed allocation of link objects. leaf optimization
only is used to speed allocation of the data packets. 

During the 60 seconds run mentioned below, this application consumes 340Mb 
of memory in 22 million objects. Garbage collection occurs several hundred 
times during the benchmark costing a total of 8-9% of the CPU time. During 
a typical collection 10.000 objects are alive, while 23.000 objects are 
found to be dead, all within at most 20 milli seconds. The average tracing 
overhead is about 250 ns pr object (or roughly 16 ns pr byte).

Configuration				Running Time	Memory Used (max)

Built in GNU new/delete:		68 s	1	2708Kb	1
GC with leaf+inline optimization:	60 s	0.88	3428Kb	1.27

- adjusted for
	1Mb object on standard heap.			1684Kb	1
							2404Kb	1.43
- further adjusted for
	1.376Kb exec image				308Kb	1
							1028Kb  3.34


Conclusions
===========

The benchmarks generally confirm the cost model given in "sgc.h". The cost
of garbage collected memory can be divided into a pr object cost, and a
pr word cost. This differs from traditional new/delete which carries only
a pr object cost. The net result, is that for larger objects, where the
pr word cost dominates, garbage collected memory becomes significantly 
more expensive than non-garbage collected memory. For small obejcts, the
pr object cost dominates, and the figures show that the pr object cost is
significantly lower with SGC, than with GNU new/delete.

The pr word cost is actually dominated by the initialisation cost, not the
tracing overhead. Objects allocated from the garbage collected heap MUST
be zeroed to eliminate stale pointers. If not, old pointers will be revived
causing massive overuse of memory. The cost of zeroing a memory block can
be lowered significantly if the underlying architecture supports either
 1) User mode cache block zero instructions
 2) User mode ST with lazy block load
but unfortunately the IA-32 architecture offers no such support.

It is a design goal that SGCs memory requirements should be bounded by a
factor of 3 when compared to GNU new/delete. As seen above, this goal is
often not met. Further work is required in this area.