**Jyrki Katajainen
Autumn 1998**

Last modified 4.5.1999 18:00

- Meetings
- Credit points
- 1. Lecture: Introduction
- 2. Lecture: Programming style
- 3. Lecture: Software tools
- 4. Lecture: Reducing instruction count
- 5. Lecture: Utilizing word parallelism
- Extra Lecture: Learning C in four hours
- 6. Lecture: Improving paging performance
- 7. Lecture: Improving cache behaviour
- 8. Lecture: Utilizing processor parallelism
- 9. Lecture: Utilizing instruction parallelism
- 10. Lecture: Utilizing computer parallelism

7. | Sept | 10.15 am 14.00 pm | sal E35B | Jagtvej 155B |

14. | Sept | 10.15 am 14.00 pm | sal E:3318 | E-huset, LTH |

21. | Sept | 10.15 am 14.00 pm | sal E35B | Jagtvej 155B |

28. | Sept | 10.15 am 14.00 pm | sal E:3318 | E-huset, LTH |

5. | Oct | 10.15 am 14.00 pm | sal E35B | Jagtvej 155B |

12. | Oct | 10.15 am 14.00 pm | sal E35B | Jagtvej 155B |

19. | Oct | 10.15 am 14.00 pm | sal E:3318 | E-huset, LTH |

26. | Oct | 10.15 am 14.00 pm | sal E35B | Jagtvej 155B |

2. | Nov | 10.15 am 14.00 pm | sal E:1407 | E-huset, LTH |

9. | Nov | 10.15 am 14.00 pm | sal E35B | Jagtvej 155B |

16. | Nov | 10.15 am 14.00 pm | sal E:1407 | E-huset, LTH |

1 point: active participation

*x*/50 point(s): *x*% of all home exercises

1 point: both programming exercises

1 point per a lecture note

1 point per a presentation (1 hour)

Total: 04 Danish points

active participation

both programming exercises

one of the following:

> 50% of all home exercises

a lecture note

a presentation (1 hour)

Total: 5 Swedish points

My main sources of inspiration were the books by Bentley:

Jon Bentley, *Programming Pearls*, Addison-Wesley Publishing Company,
Reading (1986)

Jon Bentley, *More Programming Pearls: Confessions of a Coder*,
Addison-Wesley Publishing Company, Reading (1988)

1. Write a program that computes a minimum spanning tree for a undirected, weighted graph. Your program should be based on Kruskal's algorithm. The emphasis in this exercise is in the correctness and in the style of programming.

2. Keep a diary of the errors you make when solving Exercise 1. What was the most difficult error to find and correct? What did you learn from this error?

3. How would you classify the errors reported for Exercise 2?

4. What is the most difficult programming error you have ever made? Could you describe it briefly? How were you able to recover from it?

Rob Pike, *Notes
on programming in C*, Worldwide Web Document (1989)

1. Write a short review of Knuth's implementation of Kruskal's algorithm
given in [Donald E. Knuth, *The Stanford GraphBase: A Platform for Combinatorial
Computing*, Addison-Wesley Publishing Company, Reading (1993), 460468].

2. Consider the string
copying routines discussed in the lecture. Test the speed of these
routines in your computer by using an optimizing compiler (`gcc -O4`
or higher). Take an assembler listing of the program (`gcc -O4 -S`)
and try to explain why the slowest routine is slower than the fastest one.

3. Write a string-copying routine that copies a string into another string wordwise, not bytewise as the routines discussed in the lecture. Is your routine faster than the routine available in the library?

4. Given two arrays *s* and *t*, both of size *n*. What
is the pure C cost [Jyrki Katajainen and Jesper Larsson Träff, A
meticulous analysis of mergesort programs, *Proceedings of the 3rd
Italian Conference on Algorithms and Complexity*, *Lecture Notes in
Computer Science* **1203**, Springer-Verlag, Berlin/Heidelberg (1997),
217228] of copying all elements from array *t* to array *s*?
How much can you improve the program by using loop unrolling?

Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick, An execution
profiler for modular programs, *SoftwareÑPractice and Experience*
**13** (1983), 671685

Donald E. Knuth, Literate programming, *The Computer Journal* **27**
(1984), 97111

Martin Bruce: Profiling

1. Evaluate briefly the research proposal that I will send to the Danish Natural Science Research Council. Could you think to take part in this project?

2. Consider a simple maximum-finding
subroutine that scans through the input array and updates the current
maximum when necessary. Assume that the input for this subroutine is a
random
permutation of the numbers 0, 1, ..., *n*-1. Let *X* denote
the number of times the value of the current maximum (variable *t*
in my program) is updated. What is the expected value of *X*? First
guess the answer, then profile the program, and finally prove the result
analytically. How was your guess? (Hint: To solve puzzles like this, it
is often helpful to consult one of Knuth's books [Donald E. Knuth, *The
Art of Computer Programming*, 3 volumes, Addison-Wesley Publishing Company,
Reading (19681998)].)

3. Read Column 1 on profilers in [Jon Bentley, *More Programming Pearls:
Confessions of a Coder*, Addison-Wesley Publishing Company, Reading
(1988)] and answer to Exercise 2 therein.

4. In [Jon Bentley, *More Programming Pearls: Confessions of a Coder*,
Addison-Wesley Publishing Company, Reading (1988), p. 184 ] Bentley writes:
"For faster implementations of prime sieves, see the Communications
of the ACM paper by Mairson (September 1977), Gries and Misra (December
1978), and Pritchard (January 1981), or Pritchard's 'Linear prime-number
sieves: A family tree' in Science of Computer Programming, vol. 9, pp.
1735, 1987." Go to the library and tell what is your opinion
on the words "faster implementations".

Donald E. Knuth, An empirical study of FORTRAN programs, *SoftwareÑPractice
and Experience* **1** (1971), 105133

1. Consider the problem of sorting an array of four integers. Write a program that solves this problem as efficiently as possible. What is the pure C cost of your program? How many element comparisons do you need?

2. Many sorting routines are hybrid where the small inputs are handled
by insertion sort. Usually the diversion thresholds between 10 and 50 work
well. Write a tuned version of insertion sort and analyse the pure C cost
of your routine. (My tuned version sorts *n* integers with 2*n*^{2}
+ 8*n* + O(1) pure C operations in the worst case. Can you do better?)

3. Below you get two program segments that partition an array *x*[*l*..*h*]
around the value *v*, i.e., after the partitioning all elements smaller
than *v* appear in the same array before the elements larger than
*v*. Tranform the programs to use the pointer notation instead of
the array notation and remove all useless pointer artithmetic. Analyse
both the worst case and the best case of these optimized programs in the
pure C terms.

Partitioning by the wheel technique [Jon Bentley, *Programming Pearls*,
Addison-Wesley Publishing Company, Reading (1986), p. 110]:

`m = l - 1;`

`for (i = l; i <= h; i++) {`

`if (x[i] < v) {`

`m = m + 1;`

`temp = x[m]; x[m] = x[i]; x[i] = temp;`

`}`

`}`

Partitioning with two meeting pointers [Robert Sedgewick, *Algorithms*,
2nd edition, Addison-Wesley Publishing Company, Inc. (1988), p. 118]:

`i = l - 1; j = h + 1;`

`x[i] = v; x[j] = v;`

`do {`

`do { i = i + 1; } while (x[i] < v);`

`do { j = j - 1; } while (x[j] > v);`

`temp = x[i]; x[i] = x[j]; x[j] = temp;`

`} while (i < j);`

`temp = x[i]; x[i] = x[j]; x[j] = temp;`

4. Sedgewick [Robert Sedgewick, *Algorithms in C*, *Parts 14*,
3rd edition, Addison-Wesley Publishing Company, Inc. (1998), p. 346] points
out that by reversing the order of the second subarray and merging from
the outside to the middle, two consecutive subarrays can be merged without
checking to see if subarrays have been exhausted. When one of the two subarrays
is fully consumed, its array index will be incremented off the end of the
subarray and will point at the largest element in the other subarray. Since
this element is larger than each of the other elements in the remaining
subarray, it serves as a sentinel and will not be chosen until the final
merge step.

Program this merge routine as efficiently as you can and verify that it is faster than the standard merge.

Torben Hagerup, Sorting and searching on the word RAM, *Proceedings
of the 15th Annual Symposium on Theoretical Aspects of Computer Science*,
*Lecture Notes in Computer Science* **1373**, Springer-Verlag,
Berlin/Heidelberg (1998), 366398

Morten Nicolaj Pedersen: Sorting on the word RAM

1. Prove analytically that \sum_{i=1}^{*n*} 1/*i* = ln *n*
+ \Theta(1) for an integer *n* >= 1. (Hint: If you have problems
with this kind of basic calculus, the book [Ronald L. Graham, Donald E.
Knuth, and Oren Patashnik, *Concrete Mathematics: A Foundation for Computer
Science*, 2nd edition, Addison-Wesley Publishing Company, Reading (1994)]
is a must for you.)

2. [Jon Bentley, *Programming Pearls*, Addison-Wesley Publishing
Company, Reading (1986), Exercise 6, p. 90] How can sentinels be used in
a program to find a maximum element in an array? How can sentinels decrease
the search time in sets represented by linked lists, hash tables, and binary
search trees?

3. Assume that a binary tree is being represented as follows:

`template <class D>`

`struct Node { `

`D data; `

`Node* left; `

`Node* right;`

`Node(const D & v, Node* p, Node* q) : data(v), left(p), right(q)
{ `

`} `

`}; `

Remove the recursion from the following C++ program that is also available electronically with a small driver. Try to make your solution as space-efficient as possible.

`void inorder(ostream& s, Node* root) {`

`if (root != 0) { `

`inorder(s, root -> left); `

`s << root -> data; `

`inorder(s, root -> right); `

`} `

`} `

4. Assume that *n* is a positive integer that fits into one machine
word. Consider the problem of computing the whole-number logarithm of *n*.
This problem can be solved in constant time with the following float-conversion
hack which is taken from [Arne Andersson and Mikkel Thorup, *A
pragmatic implementation of monotone priority queues*, Worldwide
Web Document (1997)].

`typedef struct { `

`unsigned int signbit : 1; `

`unsigned int exponentbias1023 : 11; `

`/* unsigned int mantissa : 52; */ `

`} doublemask;`

`typedef union { `

`double as_double; `

`doublemask as_mask; `

`} doubleandmask;`

`/* Takes a 32-bit word and finds the leftmost 1-bit */`

`inline unsigned int leftmost_one(unsigned int n) { `

`doubleandmask fm; `

`fm.as_double = (double) n; `

`return (fm.as_mask.exponentbias1023 - 1023); `

`} `

Propose alternative solutions for the problem and time these solutions against the float-conversion hack.

Implement at least two variants of Kruskal's algorithm for finding a
minimum spanning tree of a given undirected, weighted, and connected graph
*G* = (*V*, *E*, *w*). Assume that the vertex set *V*
of the input graph is a subset of {0,1,...,*N*}, where *N* is
some non-negative integer. The vertex set is represented as a boolean array
so that `true` in position *i* of this array means that vertex
*i *is present, and `false` that it is not. The edge set *E*
is represented as an array of edges. Furthermore, assume that edge weights
given by the function *w*are non-negative integers that fit into one
machine word, but note that the sum of two weights can already give an
overflow.

In computing literature several improvements to Kruskal's algorithm have been proposed. Below I list some them; you are welcome to use any of them in your implementation, but you should also think if there are other possibilities.

1) Both in [Bernard M. E. Moret, and Henry D. Shapiro, An
empirical analysis of algorithms for constructing a minimum spanning tree,
*Proceedings of the 2nd International Workshop on Data Structures and
Algorithms*, *Lecture Notes in Computer Science* **519**, Springer-Verlag,
Berlin/Heidelberg (1991), 400-411] and in [John R. Black Jr., Charles U.
Martel, and Hongbin Qi, Graph
and hashing algorithms for modern architectures: Design and performance,
*Proceedings of the 2nd Workshop on Algorithm Engineering*, Technical
Report MPI-I-98-1-019, Max-Planck-Institut für Informatik, Saarbrücken
(1998), 3748] it was reported that the representation of the graph
has a significant effect on the performance of many graph algorithms. Observe
that in our case the edges are given in an array. How costly is it to create
an adjacency-list representation, if such is used?

2) In [Donald E. Knuth, *The Stanford GraphBase: A Platform for Combinatorial
Computing*, Addison-Wesley Publishing Company, Reading (1993), 460468]
it was pointed out that in the case of integer weights sorting can be carried
out efficiently by using radix sort. The question is whether this is still
a good idea when the weights are as large as 2^{32}.

3) In [J. J. Brennan, Minimal spanning trees and partial sorting, *Information
Processing Letters* **1** (1982), 113116] it was observed that
it is profitable to perform partial sorting in the hope that not all edges
need be considered. A related technique is used in the LEDA library, for
further details see the source code available at DIKU [`/usr/local/projects2/algorithm/LEDA-3.4.1/src/graph_alg/min_spanning.c].`

4) In [A. Kershenbaum and R. Van Slyke, Computing minimum spanning trees
efficiently, *Proceedings of the 1972 Annual Conference*, ACM, New
York (1972), 518527] the edges were maintained in a priority queue
and, if the endpoints of the edge to be inserted into the queue were already
in the same fragment, the edge was rejected immediately. This way the sorting
of all edges is avoided.

5) The UNION-FIND part of the algorithm can actually be problematic since most UNION-FIND data structures use heavily random-access capabilities of computers. Due to caching random accesses are expensive. Can this part of the algorithm be implemented so that the memory accesses are more local?

I recommend that you write your programs in C++ and use the tools available in the STL library when possible. Your programs can be built upon the graph class written by me, but it is not obligatory to use it.

Compare the performance of your implementations for randomly generated graphs. When you are measuring actual running times, it might be a good idea to run the programs in at least two different kind of computers. Other relevant measures of goodness are the number of memory references, cache misses, etc.

The emphasis in this exercise is the speed of your programs. In your report you should briefly document your final programs and the results of your experiments. Do not forget to mention the unsuccessful attempts that did not give any speed-up. I would also like hear about the problems you encountered when writing the programs (compiler errors, library bugs, etc.)

Return your answer electronically to `jyrki@di.ku.dk` in a format
I can read (LaTeX, PostScript, rtf, html, plain text). The deadline for
this exercise is 18 October 1998 at 24.00 Danish time.

Al Kelley and Ira Pohl, *A Book on C*, 4th edition, The Benjamin
Cummings Publishing Company, Inc., Redwood City (1998)

Bjarne Stroustrup, *The C++ Programming Language*, 3rd edition,
Addison-Wesley, Reading (1997)

1. How would you write a routine that checks whether a sorting program works properly or not?

2. It is generally accepted that inline functions are as fast as macros. Is this the case in your instalation?

3. Test the quality of the pseudo-random numbers produced by the class
provided in [Bjarne Stroustrup, *The C++ Programming Language*, 3rd
edition, Addison-Wesley, Reading (1997), p. 686].

4. Read through the example programs provided in the lecture notes. Which parts of the programs were difficult to understand and how would you improve them?

Jeffrey S. Vitter, *External
memory algorithms*, Worldwide Web Document (1998)

1. Given two words, show that the position of their first differing bit can be found in constant time.

2. Write a function that computes a table of size 2* ^{b}*
whose

3. Consider the simulation of a sequential access machine with multiple memories (multitape Turing machine) on a normal computer with a two-level paged memory. The standard page-replacement strategy LRU (least recently used) does not necessarily work in a satisfactory manner. Why not? Devise a memory management strategy with which the simulation can be carried out efficiently.

4. Complete the graph
class written by me with a function that tests whether a given graph
is a tree or not. Assume that your function is executed on a two-level
paged memory for a graph with *n* vertices and *m* edges. How
many page transfers will be performed as a function of *n*, *m*,
and *B*, where *B* is the capacity of a page measured in words.
Can you do better?

Alvin R. Lebeck and David A. Wood, Cache profiling and the SPEC benchmarks:
A case study, *Computer* **27**,10 (1994), 1126

Maz H. Spork: Cache-concious design of algorithms

1. [Jon Bentley, *Programming Pearls*, Addison-Wesley Publishing
Company, Reading (1986), Exercise 8, p. 90] Consider the searching problem
where we are given a sorted array of *n* elements of type *T*
and one more element *x* of the same type and the task is to determine
whether the array contains the element *x*. If *x* is in the
array, a reference to it is returned, otherwise a reference to a null element
is returned. It has been claimed that, for *n* = 1 000, this
problem can be solved more efficiently by hashing than by a tuned binary
search. Implement a fast hashing routine and compare it to the tuned binary-search
routine written by me.

2. An alternative to binary search is square-root(*n*) search.
Here a new array is created from a given sorted array of size *n*
by putting every square-root(*n*)'th element of the original array
to the first locations of the new array and the remaining elements in sorted
order after them. Explain how the creation of the new array can be done
in-place with linear cost. By using the new array, an element searched
for is found with O(square-root(*n*)) comparisons by scanning the
first square-root(*n*) elements and then the square-root(*n*)
elements in the subarray that may contain the element. Implement a square-root(*n*)-search
routine and compare it experimentally to the tuned binary search routine
written by me? Which of these two methods is faster when *n* = 1 000 000?

3. In [Alvin R. Lebeck and David A. Wood, Cache profiling and the SPEC
benchmarks: A case study, *Computer* **27**,10 (1994), 1126]
it was pointed out that by reducing the size of a hash table from 69 001
to 5 003 it was possible to speed up a program using it, because the smaller
hash table fits in the data cache. Test whether something similar happens
on your computer.

4. In [John R. Black Jr., Charles U. Martel, and Hongbin Qi, Graph
and hashing algorithms for modern architectures: Design and performance,
*Proceedings of the 2nd Workshop on Algorithm Engineering*, Technical
Report MPI-I-98-1-019, Max-Planck-Institut für Informatik, Saarbrücken
(1998), 3748] the performance of arrays and linked lists were compared.
It was reported that reading all elements of an array can be 10 times faster
than the corresponding linked list scan when the space required by the
elements exceeds the capacity of the data cache. Carry out experiments
to see what happens on your computer.

Cherri M. Pancake, Is parallelism for you?, *IEEE Computational Science
& Engineering* **3**,2 (1996), 1837

1. Compare the performance of the following three programs for multiplying
matrices of size *n* x *n* when *n* is large. The last two
of these programs are taken from [Monica S. Lam, Edward E. Rothberg, and
Michael E. Wolf, The cache performance and optimizations of blocked algorithms,
*Proceedings of the 4th International Conference on Architectural Support
for Programming Languages and Operating Systems*, *SIGPLAN Notices*
**26**,4 (1991), 6374].

The standard matrix multiplication code:

`unsigned int i, j, k;`

`for (i = 0; i < n; ++i)`

`for (j = 0; j < n; ++j) `

`for (k = 0; k < n; ++k)`

`z[i][j] += x[i][k] * y[k][j];`

The standard matrix multiplication code after changing the order of the last two loops:

`unsigned int i, j, k;`

`for (i = 0; i < n; ++i)`

`for (k = 0; k < n; ++k) {`

`r = x[i][k];`

`for (j = 0; j < n; ++j)`

`z[i][j] += r * y[k][j];`

`}`

The matrix multiplication code blocked to reduce cache misses:

`unsigned int kk, jj, i, j, k;`

`for (kk = 0; kk < n; kk += b)`

`for (jj = 0; jj < n; jj += b)`

`for (i = 0; i < n; ++i)`

`for (k = kk; k < min(kk + b, n); ++k) {`

`r = x[i][k];`

`for (j = jj; j < min(jj + b, n); ++j)`

`z[i][j] += r * y[k][j];`

`}`

What is the optimal blocking factor *b* on your computer? How does
the performance of these programs vary with small changes to the matrix
size *n*?

2. Read Chapter 28 of the book [Thomas H. Cormen, Charles E. Leiserson,
and Ronald L. Rivest, *Introduction to Algorithms*, The MIT Press,
Cambridge (1990)] and solve Problem 28-2 therein. (Hint: Correct the error
in the problem formulation before solving the problem. This error is not
yet mentioned in the bug
list of the book.)

3. Show that the computation of the whole-number logarithm of a positive
integer *n* fitting into a machine word is an AC^{0} instruction,
i.e., it is computable by an unbounded-fanin circuit of constant depth
and polynomial size, where "polynomial" means polynomial in the
word length. (Hint: Chapter 29 of the book [Thomas H. Cormen, Charles E.
Leiserson, and Ronald L. Rivest, *Introduction to Algorithms*, The
MIT Press, Cambridge (1990)] might be of some help here.)

4. Read Chapter 30 of the book [Thomas H. Cormen, Charles E. Leiserson,
and Ronald L. Rivest, *Introduction to Algorithms*, The MIT Press,
Cambridge (1990)] and solve Problem 30-1 therein.

Margaret Reid-Miller, List ranking and list scan on the Cray C90, *Journal
of the Computer and System Sciences* **53** (1996), 344356

1. Show how the contents of two registers can be swapped without using any other registers. Prove also that your solution is correct.

2. Analyse the pure C cost of the trivial serial routine for list ranking. Do the same for the serial version of the parallel pointer-jumping routine. Test also experimentally how does the order of the list elements effect on the performance of the serial routine when the size of the input is larger than that of the cache of your computer.

3. Consider the block interchange problem studied in Column 2 of [Jon
Bentley, *Programming Pearls*, Addison-Wesley Publishing Company,
Reading (1986)]. In this problem we are given an array consisting of two
consecutive blocks *AB* and the task is to reorder the elements of
the array to the form *BA*. The standard in-place solution for this
problem is first reverse *A* to get *A ^{R}B*, reverse

4. Write a short evaluation of this course. What was good? What was not so good? What was missing altogether?

David Culler, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Brent Chun,
Steve Lumetta, Alan Mainwaring, Richard Martin, Chad Yoshikawa, Frederick
Wong, Parallel
computing on the Berkeley NOW, Presented at *the 9th Joint Symposium
on Parallel Processing*, Kobe, Japan (1997)

Do A or B, and return your answer electronically to `jyrki@di.ku.dk`
in a format I can read (LaTeX, PostScript, rtf, html, plain text). The
deadline for this exercise is 29 November 1998 at 20.00.

Jesper Bojesen has created a library
of heaps. In this exercise your task is to extend the library. When solving
the exercise, you are welcome to use the source codes available in the
Worldwide Web (provided that the copyright allows that). Jesper's programs
are available at DIKU in directory
`/vol/www/forskningpPerformance-engineering/Jesper/heaplab/`.
In particular, note how Jesper has used the tools make, LaTeX and gnuplot,
and the shell scripts. By imitating him you can probably increase your
productivity in future programming projects. Other relevant resources are
the STL heap
implementation, the LEDA
priority-queue implementations, and LaMarca's
priority-queue implementations.

A. Compare the performance of the following four heap construction algorithms:

- Williams' original algorithm where elements are inserted into the heap
one at the time [J. W. J. Williams, Algorithm 232: HEAPSORT,
*Communications of the ACM***7**(1964), 347--348]. - Floyd's improved algorithm that requires only a linear number of operations
[Robert W. Floyd, Algorithm 245: TREESORT 3,
*Communications of the ACM***7**(1964), 701]. - The modification of Floyd's algorithm proposed by Jesper in Listing 2.5 of his report.
- The heap-construction algorithm given in Theorem 8 of [R. Fadel, K.V.
Jakobsen, J. Katajainen, J. Teuhola, Heaps
and heapsort on secondary storage,
*Theoretical Computer Science*, to appear] that is optimal in a paged environment.

B. Implement (in C++) any priority-queue structure mentioned in any
of the articles in my literature list. The data structure implemented should
be STL compatible and dynamic, i.e., it should use \Theta(*n*) space
for storing *n* elements with their associated priorities. In your
report you should briefly describe the implemented data structure and compare
its performance to the priority queue provided in the STL library and the
dynamic heap structure described in Chapter
4 of Jesper's report.