Hi, stephen12-ga:
While I'm not entirely satisfied with the presentation of ideas below,
in the interest of expedience I will outline what I believe to be
polynomial-time algorithms for the minimum R and maximum R problems.
I will shortly add some further details on the computation of expected
R as well, and I'll be interested in doing some computations with
larger problems as a check on my conjecture.
Alternative Descriptions of the Problem Data
============================================
The number of distinct ball-stick color combinations can be stated in
terms familiar to a statistician. The counts of ball color vs. stick
color become the entries of what would be called a two-way contingency
table, and the given number p_i of balls of one color (resp. q_j of
sticks of one color) becomes a "fixed marginal total" for a row (resp.
column) of that table.
Another description of the problem data can be given in terms of a
bipartite multigraph, with the ball colors being one set of the
vertices and the stick colors the disjoint complement, and
combinations of the two kinds of colors being edges of the graph.
Restrictions on the available colors then correspond to restrictions
on the degrees of the respective vertices.
The former description seems to be helpful in visualizing the search
for a solution with minimum R, while the latter is a natural choice
for seeking maximum R because the count of distinct outcomes
corresponds to counting the edges of the underlying simple graph.
Let's begin however with "contingency table" description. More
formally we have an s by t table of nonnegative integers with row
sums:
p_1 + p_2 + ... + p_s = N
and column sums:
q_1 + q_2 + ... + q_t = N
We define the "test statistic" R to be the number of nonzero cell entries.
The exact distribution of R, assuming balls and sticks are randomly
paired, eg. from a pair of proverbial urns, is likely to be a complex
calculation. One can of course calculate the distribution of an such
test statistic by exhaustive enumeration of the possible contingency
tables, an enterprise which is taken up in this lengthy doctoral
dissertation:
[Exact Tests via Complete Enumeration: A Distributed Computing Approach]
(Ph.D. dissertation by Danius Takis Michaelides, 1997)
http://eprints.ecs.soton.ac.uk/749/04/thesis.ps
Test statistic R is quite different from the usual test statistics X
and G that are used to test "independence" of categorical variables
(see above for definitions), both of which assume chi-squared
distributions for "sufficiently large" sample sizes. However there is
an important connection to be pointed out, in that the use of the
chi-square approximation (rather than Fisher's Exact Test, which
corresponds to the exactly determined probabilities) depends on having
sufficiently many expected observations in each cell of the
contingency table, conventionally given as at least 5. Contingency
tables for which some cells have expected values below 5 are called
"sparse", so an analysis of how many cells may be expected to be zero
(or nonzero) is at least in this tangential way related to the usual
analysis of contingency tables.
Proposition and Conjecture
==========================
Prop. Let C = {c_ij} be an s by t contingency table with fixed marginal totals:
-----
t
SUM c_ij = p_i for i = 1,..,s
j=1
s
SUM c_ij = q_j for j = 1,..,t
i=1
and consider subtable C', say of dimensions s' by t', whose own row
and column sums, p'_i and q'_j, depend on the entries inherited from
C.
(i) If C is arranged so that the number of nonzero entries R is as
small as possible for any contingency table with the same marginal
totals, then C' has also a minimal number of nonzero entries relative
to the row and column sum constraints inherited as a subtable of C.
(ii) If C is arranged so that the number of nonzero entries R is as
large as possible for any contingency table with the same marginal
totals, then C' has also a maximal number of nonzero entries relative
to the row and column sum constraints inherited as a subtable of C.
Sketch of proof:
----------------
Choosing an subset of rows and columns of C, the entries of the
corresponding subtable C' may be freely rearranged subject to the
conservation of its own row and column sums, without adjusting any
entries in C outside of C' or affecting the overall row and column
sums of C. Thus the optimality of C either with respect to minimum or
maximum R implies that the same must be true of any subtable C', for
otherwise the possibility of reducing or enlarging the number of
nonzero entries in C' would imply that possibility for C.
We will be concerned hereafter only with applying this principle for 2
by 2 subtables, so let me recognize those cases for minimum and
maximum R.
A 2 by 2 subtable with at least one nonzero entry in each row and each
column has min(R) equal to either 2 or 3 and max(R) equal to 2, 3, or
4.
More particularly:
min(R) = 2 if and only if the p's and q's are the same up to permutation
max(R) = 2 if and only if some p_i = q_j = 1
max(R) = 3 if and only if some p_i = 1 or some q_j = 1 but not both
max(R) = 4 if and only if both p_i's and both q_j's are greater than 1
We have begun to omit discussion of the case when a row or column sum
is zero, since this implies that all entries in that row or column
resp. are zero. The only reason for not excluding it earlier is that
subtable C' may have a zero row or column sum despite the fact that
table C does not. However there is certainly nothing special to
optimize one way or the other if a row or column sum is zero; the
problems of min(R) and max(R) for such a table are equivalent to those
for a table with the offending row or column omitted.
Conjecture
----------
Let fixed marginal totals p_i and q_j be prescribed for an s by t
table. Subject to these constraints:
(i) Table C has minimum R if and only if every 2 by 2 subtable has
minimum number of nonzero entries (relative to its row & column sums
inherited from C).
(ii) Table C has maximum R if and only if every 2 by 2 subtable has
maximum number of nonzero entries (relative to its row & column sums
inherited from C).
The Conjecture is of course a strong converse to the Proposition,
which says that optimality of a table implies the corresponding
optimality in each 2 by 2 subtable. It's the sort of conjecture that
I would expect to fail, if it did, with fairly small counterexamples,
and thus I'm taking the absence of these in the less-than-exhaustive
analysis carried out so far as a pretty strong hint that it is true.
An illustration of this will turn up in discussing a method for
finding min(R) below.
In principle the Conjecture implies we can start with an arbitrary
contingency table C having the prescribed marginal totals, and proceed
to optimize either for min(R) or max(R) by incremental modifications
to its 2 by 2 subtables. Since each modification must either decrease
or increase R by at least 1, and since trivially:
s,t ? R ? s*t
the number of such steps is polynomial in s and t. (Perhaps some
discussion is needed about whether these are the right criteria for
polynomial complexity; we will return to the topic later.)
Then we need only consider the complexity of searching a table for 2
by 2 subtables that are not optimal. As is clear from the details of
min(R) and max(R) given above for these small subtables, it suffices
to restrict attention to 2 by 2 subtables with at least two nonzero
entries. Even without this restriction the number of 2 by 2 subtables
is a simple polynomial in s and t:
# of 2 by 2 subtables of C = s*(s-1)*t*(t-1)/4
Note that a scanning step is "expensive" as the degree of polynomial
is 4. It is therefore of practical value to finding starting
configurations that are close to optimal to reduce the number of such
steps.
With these preliminaries behind us, we are ready for the first
algorithm, which seeks a minimum value for R. We begin by
constructing a table C with the required row and column sums and a
"generically minimum" R as is to be explained afterwards.
Method I
========
Let s,t ? 1 be given together with positive summands p_i, q_j such
that both finite series sum to N, as described earlier under Problem
Data:
p_1 + p_2 + ... + p_s = N
q_1 + q_2 + ... + q_t = N
If s,t = 1, then p_1 = q_1 = N is to be assigned to c_11, and the
algorithm terminates.
Otherwise compare p_1 and q_1 and proceed by:
(Case 1) p_1 > q_1
Set the first column of C to (q_1,0,...,0)' and assign entries in the
remaining (t-1) columns by recursively applying the method to the
reduced series:
(p_1 - q_1) + p_2 + ... + p_s = N - q_1
q_2 + ... + q_t = N - q_1
(Case 2) p_1 = q_1
Set the first row and first column of C to zeroes except c_11 = p_1 =
q_1, and assign entries in the remaining (s-1) by (t-1) subtable by
applying this method to the reduced series:
p_2 + ... + p_s = N - p_1
q_2 + ... + q_t = N - q_1
(Case 3) p_1 < q_1
Set the first row of C to (p_1,0,...,0) and assign entries in the
remaining (s-1) rows by recursively applying the method to the reduced
series:
p_2 + ... + p_s = N - p_1
(q_1 - p_1) + q_2 + ... + q_t = N - p_1
--------------------------
It is not hard to verify that each step of Method I prior to
termination defines a new problem of the same form and of reduced
dimension s or t or both by 1. So the method requires at most (s-1) +
(t-1) recursive steps before reaching the termination step. Each step
including the termination step introduces a single nonzero entry to
the final constructed C, so this method constructively gives:
min(R) ? s + t - 1
The actual R obtained by this method is potentially less than this
bound by just so many times as Case 2 occurs during its execution. In
fact if the termination step where p_1 = q_1 also holds is counted as
an instance of Case 2, then:
R = s + t - (# of times Case 2 occurs)
We illustrate these ideas by recalling an example cited earlier:
p_1 = 5, p_2 = p_3 = p_4 = 2
q_1 = 4, q_2 = q_3 = 3, q_4 = 1
Executing Method I on the p's and q's so ordered produces this:
4 1 0 0
0 2 0 0
0 0 2 0
0 0 1 1
Case 2 occurs at step 2 because p_1 + p_2 = q_1 + q_2, so that R = s +
t - 2 in this instance. Determining the arrangements of p's and q's
apriori that will maximize the number of times that Case 2 occurs is a
daunting task. While one can always safely pair off any single terms
p_i = q_j before proceeding to the optimization of the remaining
summands' order, finding the optimum groupings of terms is probably NP
hard.
However if the earlier Conjecture is true, we can simply use the
output of Method I as a convenient starting point for a final
reduction of R. Suppose that the summands were instead taken in this
order:
p_1 = p_2 = 2, p_3 = 5, p_4 = 2
q_1 = q_2 = 3, q_3 = 1, q_4 = 4
The resulting table has the "generic" value R = s + t - 1 = 7:
2 0 0 0
1 1 0 0
0 2 1 2
0 0 0 2
because that ordering of p's and q's did not occasion any Case 2 prior
to the termination step.
But with a little effort we can spot a 2 by 2 subtable in the last two
rows, second and fourth columns, that cries out for minimization:
2 2 0 4
0 2 modifies to 2 0
thereby introducing one additional zero entry:
2 0 0 0
1 1 0 0
0 0 1 4
0 2 0 0
and min(R) = 6, as appears from the absence of any further 2 by 2
subtable that is not minimal with regard to R.
This construction shows that s + t - 1 is an upper bound for min(R),
and one that is generically sharp as treating the p's and q's as free
parameters it is always possible to find s by t examples where no Case
2 occurs prior to the termination step.
Description in terms of bipartite graphs
========================================
Next we present an algorithm which seeks the maximum value of R. It
will be convenient at this point to switch to using the bipartite
graph terminology mentioned briefly above.
Our point of view will be to treat each color of balls as belonging to
one set of nodes S and each color of sticks to a disjoint set of nodes
T.
Note that |S| = s and |T| = t.
For each node i in S we define a "valence" p_i which is the maximum
number of edges (or multi-edges, counted according to multiplicity)
that can be drawn to this node. Similarly for node j in T we have
valence q_j.
Provided that SUM p_i = SUM q_j as our problem definition guarantees,
we can always construct a bipartite multigraph which attains the
prescribed valences. However from the perspective of maximizing the
distinct number of outcomes R, the only thing that counts is the
number of simple edges underlying such a graph. Therefore we can
identify max(R) with the maximum number of edges in a bipartite simple
graph that satisfies the valence (maximum degree) restrictions above,
noting that from such a simple graph one can always attain the
prescribed degrees by adding multi-edges by turns until all counts p_i
and q_j are exhausted.
Method II
=========
Let p_1 ? p_2 ? ... ? p_s.
For i = 1 to s, consider node i in S:
Draw edge from i in S to j in T if possible, q_j > 0,
up to the limit min(p_i, (# of q_j's > 0)), choosing
if p_i is less than (# of q_j's > 0) those nodes j
where q_j is largest.
Decrement q_j by 1 if an edge is drawn from i to j.
--------------------------
The collection of (simple) edges so drawn corresponds to a set of
distinct outcomes in the ball/stick pairing, or to nonzero entries in
a contingency table C as in the earlier treatment.
Note that the values q_j are modified during the algorithm, so that it
might be helpful to write q_j(i-1) for the value of q_j at the
beginning of step i and q_j(i) for the value at the end of step i.
The value of R constructed in this fashion could then be expressed as:
s
SUM min(p_i, (# of q_j(i-1) > 0) )
i=1
Again we illustrate the ideas by reference to the earlier example.
Note that we here impose a descending order on the p's:
Step 1: p_1 = 5 and q(0) = (4,3,3,1)
Draw an edge from the first node in S to each node in T.
Decrement each valence q accordingly.
Step 2: p_2 = 2 and q(1) = (3,2,2,0)
Draw an edge from the second node in S to the first two in T.
Decrement those two valences in q accordingly.
Step 3: p_3 = 2 and q(2) = (2,1,2,0)
Draw an edge from the third node in S to the first and third in T.
Decrement those two valences in q accordingly.
Step 4: p_4 = 2 and q(3) = (1,1,1,0)
Draw an edge from the fourth node in S to the first two in T.
Decrement those two valences in q accordingly.
By the end of Method II we have constructed 10 = 4+2+2+2 distinct
"outcomes" or edges in the bipartite graph G. There was one
unsatisfied valence of p_1 remaining from Step 1, and the final vector
q(4) = (0,0,1,0), so that we could complete the construction of
contingency table C (or the pairings of balls and sticks) by adding a
multi-edge from node 1 in S to node 3 in T, an edge that by
construction duplicates one already added in Step 1.
That is, prior to adding that final edge we would have this partially
populated contingency table corresponding to the simple graph G:
1 1 1 1
1 1 0 0
1 0 1 0
1 1 0 0
and after adding the multi-edge it becomes:
1 1 2 1
1 1 0 0
1 0 1 0
1 1 0 0
which may be seen to satisfy all the required row and column sums.
Furthermore, scanning this table for potential 2 by 2 subtables that
are suboptimal in the sense of maximizing R, none are found.
Note on complexity measurement
==============================
It is easy to become careless about claims of "polynomial time" where
the size of the problem is not of a simply defined size. Above I
referenced polynomials in s and t as a justification for polynomial
time, but I would invite critical consideration of this as the basis
for measuring the size of the input.
To express numbers s and t requires only log(s) + log(t) space. My
proposal to use s and t themselves as the measurements of problem size
is therefore based on an additional consideration, namely the
(effectively nonzero) summands p_i and q_j that are also required by
the problem formulation. Given that at least a bit is required for
the expression of each such summand, we are able to argue that the
problem size is (up to some positive constant multiplier) at least s +
t:
s < c SUM log(p_i)
t < c SUM log(q_j)
The arithmetic required by Methods I and II above is minimal, though
subtraction is called for. Therefore it is helpful to anticipate that
in a careful analysis the cost of such arithmetic will also entail
terms like log(p_i) and log(q_j).
regards, mathtalk-ga |