The following two computational problems are studied: *Duplicate grouping:*
Assume that n items are given, each of which is labeled by an
integer key from the set {0,…, U − 1}.
Store the items in an array of size n
such that items with the same key
occupy a contiguous segment of the array.
*Closest pair:* Assume that a multiset of n points in
the d-dimensional Euclidean space is given,
where d ≥ 1 is a fixed integer.
Each point is represented as a d-tuple of integers in the
range {0,…, U − 1} (or
of arbitrary real numbers). Find a closest pair, i. e., a
pair of points whose distance is minimal
over all such pairs.
In 1976 Rabin described a randomized algorithm for the closest-pair problem
that takes linear expected time.
As a subroutine, he used a hashing procedure whose implementation
was left open. Only years later
randomized hashing schemes
suitable for filling this gap were developed. In this paper, we return to Rabin’s classic algorithm
in order to provide
a fully detailed description and analysis,
thereby also extending and strengthening his result.
As a preliminary step, we study randomized algorithms for the
duplicate-grouping problem.
In the course of solving
the duplicate-grouping problem,
we describe a new universal class of hash functions
of independent interest. It is shown that both of the foregoing problems can be solved
by randomized algorithms that use
O(n) space and finish in O(n) time with probability
tending to 1 as n grows to infinity.
The model of computation is a unit-cost RAM capable of
generating random numbers
and of performing arithmetic
operations from the set {+, −, *, div, log_{2}, exp_{2}},
where div denotes integer division
and log_{2} and exp_{2} are the mappings from
IN to IN ∪ {0} with
log_{2}(m) = ⌊log_{2} m⌋ and
exp_{2}(m) = 2^{m}, for all m ∈ IN.
If the
operations log_{2} and exp_{2} are
not available,
the running time of the algorithms increases by an additive
term of O(log log U).
All numbers manipulated by the algorithms consist of O(log n + log U)
bits. The algorithms for both of the problems exceed the time bound O(n)
or O(n + log log U) with probability 2^{−nΩ(1)}.
Variants of the algorithms are also given that use only O(log n + log U)
random bits and have probability O(n^{−α}) of exceeding the
time bounds,
where α ≥ 1 is a constant that can be chosen arbitrarily. The algorithm for the closest-pair problem
also works if the coordinates of the points are
arbitrary real numbers,
provided that the RAM is able to perform arithmetic operations from
{+, −, *, div} on real numbers,
where a div b now means ⌊a / b⌋. In this case,
the running time is O(n) with log_{2} and exp_{2} and
O(n + log log(δ_{max} / δ_{min}))
without them, where δ_{max} is the maximum and
δ_{min} is the minimum distance between any two
distinct input points. |