Hi, prashant_jain-ga:
There are several ways to efficiently enumerate all the k-subsets of n
items. The three methods that I'm about to discuss are all based on a
common ordering of the solution search space, so that the same methods
might be used to divide up the search space between multiple
computers. If you have a hard requirement to perform such a long
computation in a fixed time period, then ultimately "parallel
processing" may be your best solution, as more efficient
implementations can only push the performance speed to a certain
limit.
Most likely the approach you'll want to code is the first method that
I'll outline. It is easy to understand and program and very
efficient; it's one drawback is having a "hard coded" dependence on
the size k of subsets. Two more methods are then outlined for the
sake of completeness which overcome that shortcoming.
We will assume that the selectable items themselves can be indexed,
e.g. through an array of "objects", so that our problem can without
much indirection be reduced to one of selecting k-subsets of {1,...,n}
(or for the C purists, of {0,...,n-1}).
The first method that I'd like to discuss may be the closest to what
you are already doing, since it involves a nested set of loops. Now
this method, because it involves a fixed number of loops corresponding
to the cardinality k of the subsets, has a disadvantage in terms of
generality. However the range of values for k which you cited is
narrow (k = 5 to 10), so this disadvantage may be a bit academic. One
can easily write this sort of code six times for the given values of k
and be done.
I will try to express the algorithms in a suitably intuitive
"pseudocode", but if a construction is unclear, please let me know and
I'll be happy to provide further clarification.
Suppose that we have an objective function F(..) that applies to any
k-subset we select, and we want to evaluate this for every k-subset in
order to determine the largest value. Since this first method is "k
specific", we will assume that in fact F(..) takes k arguments,
although the value returned is independent of the order of these
arugments (as a set is independent of any order of its elements). [If
your notion of "best weighted subset" is a minimum rather than a
maximum, then this change in perspective is easily accomodated, e.g.
by taking the minimum over the values found rather than the maximum as
here, or by exchanging -F for F in the code.]
So "nest" k for-loops in this manner (ik' denotes the predecessor variable to ik):
0. Let max := F(1,2,..,k).
1. for i1 := 1 to n-k+1
2. for i2 := i1+1 to n-k+2
. . .
k. for ik := ik'+1 to n
k+1. Let new := F(i1,i2,...,ik',ik).
k+2. if new > max then max := new
next ik
. . .
next i2
next i1
What this approach does is to enumerate the k-subsets of {1,...,n} in
so-called lexicographic ordering (ascending) on the sorted
representations i1,i2,..,ik of these subsets. The arrangements of
nested index ranges guarantees that these arguments will always be
generated in ascending order and that every possible combination is
generated.
Aside from the necessity of committing to a specific value of k, this
method is attractive because it is simple to code and efficient. If
the looping is done in C (or even better, in hand coded assembly!),
then it achieves the purpose of enumeration with quite low overhead.
In fact for the given counts 5 <= k <= 10, the overhead of the looping
is certainly comparable to the overhead of the inner function call
(meaning just the stack operations, not the execution thereof), so
that your CPU cycles will be concentrated on the evaluation of the
objective function.
The second method that I will outline uses function recursion to
enumerate the k-subsets. Here depth of recursion takes the place of
loop nesting to order the distinct items selected. It is less
efficient in most architectures to do a function call than to
increment a loop variable, as the stack frame must be pushed and
popped, but this may be negligible in comparison to the evaluation of
your objective function. The nice thing is that the size k of subsets
need not be "hard coded", because it can be represented instead by a
function argument. It can be a little difficult to grasp how
recursion takes the place of looping though.
Many such implementations are possible, but let me throw out some
ideas. At the deepest level we will need to access all the items
chosen at outer levels of recursion (so that the objective function
can be evaluated there). This can be accomplished either by a
"static" array in the given function (or "global" array, visible there
and elsewhere in the program) or by passing an array of selections
along as a function argument. [I'm describing the structure as an
array for simplicity, but it could be any of various "collection
class" datatypes, e.g. lists or queues.]
In this approach we might have a top level call to function pick(..)
with one argument that specifies the number of items to choose and
another that specifies the items chosen. For notational convenience
I'll use a Prolog-like syntax for lists to represent the array of
items chosen. Thus, at the top level we would invoke:
pick(k,[])
The function pick(..) will then (repeatedly) call itself, decrementing
the value of k by 1 and adjoining a selected element (greater than any
chosen so far by "outer" calls), until by recursion one reaches the
inner call:
pick(0,[i1,i2,...,ik])
At that point we are ready to call the object function, written to
accept the array or list as an argument:
F([i1,i2,...,ik])
The inner invocation of pick(..) must also keep track of the maximum
value returned by an objective function call, and to do this a global
variable "max" probably makes the most sense (though it is possible to
do it with "output" arguments for the function pick as well). The
function pick(..) itself does not return (from a given level) until it
has "tried" all the possible items to adjoin to list/array passed in
to it.
This approach does avoid the need to hard code the size k of the
subsets, but at a price of some loss in efficiency and (arguably) in
legibiity of the code, since the logic for enumerating the subsets is
now "spread around" among coordinated calls to function pick(..).
The third method that I'll propose is an attempt to have the best of
both approaches by using a data structure iteration. We can avoid the
overhead of function recursion on the one hand and avoid having to
hard code the size k of subsets on the other.
The characteristic of this data structure is that it represents any of
the lexicographically ordered k subsets, and indeed has an associated
"next" operation that advances from one such subset to the successor
in this ordering. At a high level the algorithm can be written this
way:
Let s := {k,...,2,1}, max = F(s).
while s < {n,...,n-k+1}
begin
s := next(s)
max := greaterof(max, F(s))
end
I've written the elements of subset s "backwards" to suggest the
implementation for the operation "next", which for efficiency works
mostly on the "tail" of the selections. We can in fact implement the
data structure for s as array of length k, so this can be done in
almost any programming language that supports arrays. Conceptually
though the individual array needs to be taken in context with the
values n (defining the range of possible selections) and k (the size
of the subset or "array length" essentially).
To implement the next(s) operation one starts at the end of an array
representing s:
[i1,i2,...,ik]
and locates the "farthest" index j between 1 and k such that there is
room to increment the selection in that slot.
To describe a bit more clearly, once ik reaches n, it no longer has
"room to increment" because n is the largest possible value. So if we
want to move up in the ordering of subsets, we have to move
"backwards" in the index of items to ik' and ask ourselves whether ik'
has reached n-1. If it has, then there's no room to increment there,
and we must continue to back through the array's entries until we find
a place that can be incremented.
Ultimately we reach the beginning of the array, and if i1 = n-k+1,
then we are done; we've gone from subset [1,..,k] all the way to
subset [n-k+1,..,n], and there are no more subsets to check.
In the meantime when we do locate a position in the array where an
index can be incremented, we do so and reset all the remaining indexes
beyond that point to their successive values.
An illustration may help. Suppose that we are trying to enumerate the
3-subsets of 9 items. If the current s is represented by [1,5,9],
then when we check the final index we find it cannot be incremented
further. Backing up one entry, we see that 5 < 8, so we still have
room there. Hence:
next([1,5,9]) = [1,6,7]
The following iteration will discover that there is new room in the
last entry to increment, and thus the next operation will produce
[1,6,8]. And so on.
A skillful implementation of the data structure iteration can be more
efficient in a high level language than the nesting for-loops
approach, though this requires careful attention to the details. The
"high level" code sketched above would be only a crude road map to
such an implementation. The efficiency comes from needing "most of
the time" only to check the final entry in the array/subset and
increment that one, so that as a fraction of operations it is like
spending most of the time in the innermost nested loop. However to
make the implementation competitive, one must combine the "test" for s
< {n,...,n-k+1} with the "work" of discovering the furthest index that
has room to increment. When one digs into coding this, one discovers
that this is a natural consequence of computing such index j. While
we have not yet reached the end of the enumeration j will lie between
k and 1, but when we finish, this manifests itself in "computing" the
index j = 0.
If you have any need to clarify the algorithmic ideas here, I would be
happy to provide that as a Clarification to this Question.
regards, mathtalk-ga |