Some Typical Results. ===================== All runs conducted on 400MHz AMD K6-2, 100MHz motherboard, 128Mb PC100 RAM. All figures are entire application run times. The actual overhead of memory administration is just one part of the total run time. This makes the observed impact of memory management implementation even more striking. Meassured memory usage is total application usage, including the executable image. Benchmark 1: Random subtree operations. --------------------------------------- This is a syntetic benchmark designed to test the garbage collector. The program builds a binary tree. It then repeatedly seeks out a random subtree and replaces it with a new one. Every interior node in the tree has the same size, while the leafs are of random size. The average node size is 50 bytes, unless otherwise stated. Occasionally it allocates 256Kb just to be mean. And occasionally it replaces the entire tree with a new one. During replacement, two trees, the new and the old coexist in memory, roughly doubling the storage requirements. This kind of behaviour is seldomly seen in real programs. Thus, the picture presented from this application is a very pesimistic one. The program is used in four variants with SGC in the benchmarks above: 1) As a new/delete provider. The application handles deletes by itself, no SGC specific interfaces are imported. SGC is simply included in the link phase to replace new/delete. 2) As a seperate garbage collector. SGC is included in the link phase, all calls to delete are removed from the program. No SGC specific interfaces are imported. 3) With leaf optimization enabled. SGC is informed when a leaf (a pointer- free object) is allocated. Usually 60-85% of the heap is used for the leafs. 4) With leaf+inline optimization. Small objects with compiler determined sizes are allocated using special optimized variants of new. The amount of live memory in the heap changes dramatically during the run of this benchmark. The heap often becomes highly fragmented. (*) The runs marked (*) are not representative. Due to blacklisting (see sgc.h) the heap is expanded to accomodate the 256Kb allocations. Without the 256Kb allocations, the heap would be significantly smaller, and the programs perhaps a teeny weeny bit slower. (**) For the Tiny Benchmark, the allocation of 256Kb objects were disbled and the parameters set to make the entire used memory fit in the 1Mb L2 cache, especially when run with traditional new/delete. The figures obtained confirms that SGC as a new/delete provider is sensitive to caching. It is not known whether the 256Kb allocation is actually visible in the numbers given for traditional new/delete, because they are not alive for very long, and they may have been allocated and released between updates of the 'top' status window used to collect the figures. Size of executable image: About 0.5 Mb Tiny Configuration (**) Running Time Memory Used (max) ^^^^^^^^^^^^^^^^^^ Built in GNU new/delete: 241 s 1 844Kb 1 SGC as new/delete provider: 154 s 0.64 1080Kb 1.28 non-optimized GC 184 s 0.76 1708Kb 2.02 GC with leaf optimization 168 s 0.70 1456Kb 1.73 GC with leaf+inline optimization 160 s 0.66 1440Kb 1.71 Small Configuration Running Time Memory Used (max) ^^^^^^^^^^^^^^^^^^^ Built in GNU new/delete: 790 s 1 1648Kb 1 SGC as new/delete provider: 638 s 0.81 2736Kb 1.66 non-optimized GC 631 s 0.80 6436Kb 3.91 (*) GC with leaf optimization 561 s 0.71 7476Kb 4.54 (*) GC with leaf+inline optimization 535 s 0.68 7472Kb 4.54 (*) Medium Configuration Running Time Memory Used (max) ^^^^^^^^^^^^^^^^^^^^ Built in GNU new/delete: 560 s 1 4.6Mb 1 SGC as new/delete provider: 553 s 0.99 7.8Mb 1.70 non-optimized GC 445 s 0.79 14Mb 3.04 GC with leaf optimization 398 s 0.71 12.6Mb 2.73 GC with leaf+inline optimization 387 s 0.69 12.6Mb 2.73 Huge Configuration Running Time Memory Used (max) ^^^^^^^^^^^^^^^^^^ Built in GNU new/delete: 684 s 1 31Mb 1 SGC as new/delete provider: 662 s 0.97 49Mb 1.58 non-optimized GC 558 s 0.82 91Mb 2.94 GC with leaf optimization 498 s 0.73 69Mb 2.23 GC with leaf+inline optimization: 483 s 0.71 67Mb 2.16 Using smaller objects (1-32b leafs, average object size: 19 bytes) ^^^^^^^^^^^^^^^^^^^^^ Built in GNU new/delete: 230 s 1 7.4Mb 1 SGC as new/delete provider: 181 s 0.79 9.0Mb 1.22 non-optimized GC 147 s 0.64 16Mb 2.16 GC with leaf optimization 141 s 0.61 13.9Mb 1.88 GC with leaf+inline optimization: 136 s 0.59 14.4Mb 1.95 Using larger objects (1-512b leafs, average object size: 195 bytes) ^^^^^^^^^^^^^^^^^^^^ Built in GNU new/delete: 165 s 1 3948Kb 1 SGC as new/delete provider: 176 s 1.07 7240Kb 1.83 non-optimized GC 237 s 1.44 13.4Mb 3.39 GC with leaf optimization 158 s 0.96 11.8Mb 2.99 GC with leaf+inline optimization: 158 s 0.96 11.5Mb 2.91 Using very large objects (1-2048b leafs, average object size: 695 bytes) ^^^^^^^^^^^^^^^^^^^^^^^^ Built in GNU new/delete: 374 s 1 10.7Mb 1 SGC as new/delete provider: 541 s 1.45 21Mb 1.96 non-optimized GC 709 s 1.90 41Mb 3.83 GC with leaf optimization 415 s 1.11 28Mb 2.62 GC with leaf+inline optimization: 416 s 1.11 28Mb 2.62 Benchmark 2: Random GC-friendly subtree operations -------------------------------------------------- This benchmark is a simplified variant of benchmark 1. All objects are of the same size. This benchmark can be run in an "optimal" variant. It uses a class specific free-stack to administer memory. This is the fastest possible administration scheme, when the order of new/delete is arbitrary. Thus, it gives a lower bound on the execution time. (avg size of subtree replaced: 1-2 levels) Config Speed Memory builtin 108.38 1 16Mb 1 optimal 48.56 0.45 16Mb 1.00 sgc as new/delete provider 70.38 0.65 17Mb 1.06 non-optimized GC 61.11 0.56 31Mb 1.94 gc with inline optimization 57.42 0.53 31Mb 1.94 (avg size of subtree replaced ~4 levels) Config Speed Memory builtin 1140 1 1 16Mb 1 optimal 250 0.22 16Mb 1.00 sgc as new/delete provider 425 0.37 17Mb 1.06 non-optimized gc 442 0.39 35Mb 2.19 gc with inline optimization 389 0.34 35Mb 2.19 Benchmark 3: Digital Logic Simulation. -------------------------------------- The SimSys programming framework simulating a 5 stage pipeline and memory hierarchy of a generic RISC microprocessor doing a quicksort of 512 integers. This program is a real application, not a syntetic benchmark. It is designed with traditional new/delete in mind and only trivial changes are made to run with SGC. A single 1Mb object representing the simulated memory is allocated from the standard heap instead of the garbage collected one. Most of the memory used by the application are either small link objects from general container classes, or variable sized blocks representing data packets flowing through the simulated components. The average object size is 16 bytes. Following a brief initialisation phase, the amount of live memory remains almost constant throughout the entire run, and fragmentation is very low. leaf+inline optimization is used to speed allocation of link objects. leaf optimization only is used to speed allocation of the data packets. During the 60 seconds run mentioned below, this application consumes 340Mb of memory in 22 million objects. Garbage collection occurs several hundred times during the benchmark costing a total of 8-9% of the CPU time. During a typical collection 10.000 objects are alive, while 23.000 objects are found to be dead, all within at most 20 milli seconds. The average tracing overhead is about 250 ns pr object (or roughly 16 ns pr byte). Configuration Running Time Memory Used (max) Built in GNU new/delete: 68 s 1 2708Kb 1 GC with leaf+inline optimization: 60 s 0.88 3428Kb 1.27 - adjusted for 1Mb object on standard heap. 1684Kb 1 2404Kb 1.43 - further adjusted for 1.376Kb exec image 308Kb 1 1028Kb 3.34 Conclusions =========== The benchmarks generally confirm the cost model given in "sgc.h". The cost of garbage collected memory can be divided into a pr object cost, and a pr word cost. This differs from traditional new/delete which carries only a pr object cost. The net result, is that for larger objects, where the pr word cost dominates, garbage collected memory becomes significantly more expensive than non-garbage collected memory. For small obejcts, the pr object cost dominates, and the figures show that the pr object cost is significantly lower with SGC, than with GNU new/delete. The pr word cost is actually dominated by the initialisation cost, not the tracing overhead. Objects allocated from the garbage collected heap MUST be zeroed to eliminate stale pointers. If not, old pointers will be revived causing massive overuse of memory. The cost of zeroing a memory block can be lowered significantly if the underlying architecture supports either 1) User mode cache block zero instructions 2) User mode ST with lazy block load but unfortunately the IA-32 architecture offers no such support. It is a design goal that SGCs memory requirements should be bounded by a factor of 3 when compared to GNU new/delete. As seen above, this goal is often not met. Further work is required in this area.