The following two computational problems are studied: Duplicate grouping:
Assume that n items are given, each of which is labeled by an integer key
from the set {0,…,U − 1}.
Store the items in an array of size n
such that items with the same key
occupy a contiguous segment of the array. Closest pair: Assume that a multiset of n points in
the d-dimensional Euclidean space is given,
where d ≥ 1 is a fixed integer.
Each point is represented as a d-tuple of integers in the
range {0,…,U − 1} (or
of arbitrary real numbers). Find a closest pair, i. e., a
pair of points whose distance is minimal
over all such pairs. In 1976 Rabin described a probabilistic algorithm for the closest-pair problem
that takes linear expected time.
As a subroutine, he used a hashing procedure whose implementation
was left open. Only years later
randomized hashing schemes
suitable for filling this gap were developed. In this paper, we return to Rabin’s classic algorithm
in order to provide
a fully detailed description and analysis,
thereby also extending and strengthening his result.
As a preliminary step, we study randomized algorithms for the
duplicate-grouping problem.
In the course of solving
the duplicate-grouping problem,
we describe a new universal class of hash functions
of independent interest. It is shown that both of the above problems can be solved
by probabilistic algorithms that use
O(n) space and finish in O(n) time with probability
tending to 1 as n grows to infinity.
The model of computation is a unit-cost RAM capable of
generating random numbers
and of performing arithmetic
operations from the set {+, −, *, div, log2, exp2},
where div denotes integer division,
and log2 and exp2 are the mappings from
ℕ to ℕ∪{0} with
log2(m) = ⌊log2 m⌋ and
exp2(m) = 2m, for all m ∈ ℕ.
If the
operations log2 and exp2 are
not allowed,
the running time of the algorithms increases by an additive
term of O(log log U).
All numbers manipulated by the algorithms consist of O(log n + log U)
bits. We consider two variants of the algorithm
for the closest-pair problem. One uses only
O(log n + log U)
random bits and still has probability O(n−α)
of exceeding the time bound O(n),
where α ≥ 1 is a constant that can be chosen arbitrarily.
The other one uses O(n log n + log U) random bits, but reduces the
probability that the time bound is exceeded to 2−nΩ(1). The algorithm for the closest-pair problem
also works if the coordinates of the points are
arbitrary real numbers,
provided that the RAM is able to perform arithmetic operations from
{+, −, *, div} on real numbers,
where a div b now means ⌊a / b⌋. In this case,
the running time is O(n) with log2 and exp2 and
O(n + log log(δmax / δmin))
without them, where δmax is the maximum and
δmin is the minimum distance between any two
distinct input points.
|