Google Answers: a probability/combinatorics question

View Question

Q: a probability/combinatorics question ( Answered, 4 Comments )

Question

Subject: a probability/combinatorics question
Category: Science > Math
Asked by: stephen12-ga
List Price: $50.00

Posted: 25 May 2005 12:32 PDT
Expires: 24 Jun 2005 12:32 PDT
Question ID: 525535

Hi there, 

I am looking for a solution to the following problem: 
Suppose we have N balls as well as N sticks.  Each ball takes on one
of the S available colors, with the number of balls having the i-th
color being p_i (i=1,2,...,S). Clearly, p_1+p_2+...+p_S = N.  The
balls having the same color are not distinguishable.  The same applies
for the N sticks: each stick takes on one of the T available colors,
with the number of sticks having the j-th color being q_j
(j=1,2,...,T; q_1+...+q_T = N). Sticks having the same color are not
distinguishable, either.

Now if we pair those balls and sticks up and form N "ball-stick"
pairs, what is the expected number of distinct pairs? And what is the
probability that there are r distinct pairs?

Thanks. A quick answer is highly appreciated.

Request for Question Clarification by mathtalk-ga on 27 May 2005 11:39 PDT

Hi, stephen12-ga:

I gave some thought to your Question and posted a Comment below
outlining a computational approach.  If you are only interested in an
analytic result (one that would take all the p's and q's into
consideration), then this would be of little interest to you.  However
if you have a practical application in mind for which efficient and
accurate calculations are useful, I could post as an Answer a more
detailed outline of the coding, together with some optimizations of
the procedure.

regards, mathtalk-ga

Clarification of Question by stephen12-ga on 27 May 2005 11:58 PDT

hi mathtalk,

although I was initially looking for an analytic result, an efficient
(polynomial time) algorithm will also suffice.  Please post it as an
Answer if you have such an algorithm.

Thanks,
stephen

Answer

Subject: Re: a probability/combinatorics question
Answered By: mathtalk-ga on 05 Jun 2005 17:47 PDT

Hi, stephen12-ga:

While I'm not entirely satisfied with the presentation of ideas below,
in the interest of expedience I will outline what I believe to be
polynomial-time algorithms for the minimum R and maximum R problems. 
I will shortly add some further details on the computation of expected
R as well, and I'll be interested in doing some computations with
larger problems as a check on my conjecture.

Alternative Descriptions of the Problem Data
============================================

The number of distinct ball-stick color combinations can be stated in
terms familiar to a statistician.  The counts of ball color vs. stick
color become the entries of what would be called a two-way contingency
table, and the given number p_i of balls of one color (resp. q_j of
sticks of one color) becomes a "fixed marginal total" for a row (resp.
column) of that table.

Another description of the problem data can be given in terms of a
bipartite multigraph, with the ball colors being one set of the
vertices and the stick colors the disjoint complement, and
combinations of the two kinds of colors being edges of the graph. 
Restrictions on the available colors then correspond to restrictions
on the degrees of the respective vertices.

The former description seems to be helpful in visualizing the search
for a solution with minimum R, while the latter is a natural choice
for seeking maximum R because the count of distinct outcomes
corresponds to counting the edges of the underlying simple graph.

Let's begin however with "contingency table" description.  More
formally we have an s by t table of nonnegative integers with row
sums:

   p_1 + p_2 + ... + p_s = N

and column sums:

   q_1 + q_2 + ... + q_t = N

We define the "test statistic" R to be the number of nonzero cell entries.

The exact distribution of R, assuming balls and sticks are randomly
paired, eg. from a pair of proverbial urns, is likely to be a complex
calculation.  One can of course calculate the distribution of an such
test statistic by exhaustive enumeration of the possible contingency
tables, an enterprise which is taken up in this lengthy doctoral
dissertation:

[Exact Tests via Complete Enumeration: A Distributed Computing Approach]
(Ph.D. dissertation by Danius Takis Michaelides, 1997)
http://eprints.ecs.soton.ac.uk/749/04/thesis.ps

Test statistic R is quite different from the usual test statistics X
and G that are used to test "independence" of categorical variables
(see above for definitions), both of which assume chi-squared
distributions for "sufficiently large" sample sizes.  However there is
an important connection to be pointed out, in that the use of the
chi-square approximation (rather than Fisher's Exact Test, which
corresponds to the exactly determined probabilities) depends on having
sufficiently many expected observations in each cell of the
contingency table, conventionally given as at least 5.  Contingency
tables for which some cells have expected values below 5 are called
"sparse", so an analysis of how many cells may be expected to be zero
(or nonzero) is at least in this tangential way related to the usual
analysis of contingency tables.

Proposition and Conjecture
==========================

Prop.  Let C = {c_ij} be an s by t contingency table with fixed marginal totals:
-----
          t
         SUM c_ij  =  p_i   for i = 1,..,s
         j=1

          s
         SUM c_ij  =  q_j   for j = 1,..,t
         i=1

and consider subtable C', say of dimensions s' by t', whose own row
and column sums, p'_i and q'_j, depend on the entries inherited from
C.

 (i)  If C is arranged so that the number of nonzero entries R is as
small as possible for any contingency table with the same marginal
totals, then C' has also a minimal number of nonzero entries relative
to the row and column sum constraints inherited as a subtable of C.

 (ii) If C is arranged so that the number of nonzero entries R is as
large as possible for any contingency table with the same marginal
totals, then C' has also a maximal number of nonzero entries relative
to the row and column sum constraints inherited as a subtable of C.

Sketch of proof:
----------------

Choosing an subset of rows and columns of C, the entries of the
corresponding subtable C' may be freely rearranged subject to the
conservation of its own row and column sums, without adjusting any
entries in C outside of C' or affecting the overall row and column
sums of C.  Thus the optimality of C either with respect to minimum or
maximum R implies that the same must be true of any subtable C', for
otherwise the possibility of reducing or enlarging the number of
nonzero entries in C' would imply that possibility for C.

We will be concerned hereafter only with applying this principle for 2
by 2 subtables, so let me recognize those cases for minimum and
maximum R.

A 2 by 2 subtable with at least one nonzero entry in each row and each
column has min(R) equal to either 2 or 3 and max(R) equal to 2, 3, or
4.

More particularly:

  min(R) = 2 if and only if the p's and q's are the same up to permutation


  max(R) = 2 if and only if some p_i = q_j = 1

  max(R) = 3 if and only if some p_i = 1 or some q_j = 1 but not both

  max(R) = 4 if and only if both p_i's and both q_j's are greater than 1

We have begun to omit discussion of the case when a row or column sum
is zero, since this implies that all entries in that row or column
resp. are zero.  The only reason for not excluding it earlier is that
subtable C' may have a zero row or column sum despite the fact that
table C does not.  However there is certainly nothing special to
optimize one way or the other if a row or column sum is zero; the
problems of min(R) and max(R) for such a table are equivalent to those
for a table with the offending row or column omitted.

Conjecture
----------

Let fixed marginal totals p_i and q_j be prescribed for an s by t
table.  Subject to these constraints:

 (i)  Table C has minimum R if and only if every 2 by 2 subtable has
minimum number of nonzero entries (relative to its row & column sums
inherited from C).

 (ii) Table C has maximum R if and only if every 2 by 2 subtable has
maximum number of nonzero entries (relative to its row & column sums
inherited from C).

The Conjecture is of course a strong converse to the Proposition,
which says that optimality of a table implies the corresponding
optimality in each 2 by 2 subtable.  It's the sort of conjecture that
I would expect to fail, if it did, with fairly small counterexamples,
and thus I'm taking the absence of these in the less-than-exhaustive
analysis carried out so far as a pretty strong hint that it is true. 
An illustration of this will turn up in discussing a method for
finding min(R) below.

In principle the Conjecture implies we can start with an arbitrary
contingency table C having the prescribed marginal totals, and proceed
to optimize either for min(R) or max(R) by incremental modifications
to its 2 by 2 subtables.  Since each modification must either decrease
or increase R by at least 1, and since trivially:

    s,t ? R ? s*t

the number of such steps is polynomial in s and t.  (Perhaps some
discussion is needed about whether these are the right criteria for
polynomial complexity; we will return to the topic later.)

Then we need only consider the complexity of searching a table for 2
by 2 subtables that are not optimal.  As is clear from the details of
min(R) and max(R) given above for these small subtables, it suffices
to restrict attention to 2 by 2 subtables with at least two nonzero
entries.  Even without this restriction the number of 2 by 2 subtables
is a simple polynomial in s and t:

  # of 2 by 2 subtables of C  =  s*(s-1)*t*(t-1)/4

Note that a scanning step is "expensive" as the degree of polynomial
is 4.  It is therefore of practical value to finding starting
configurations that are close to optimal to reduce the number of such
steps.

With these preliminaries behind us, we are ready for the first
algorithm, which seeks a minimum value for R.  We begin by
constructing a table C with the required row and column sums and a
"generically minimum" R as is to be explained afterwards.

Method I
========

Let s,t ? 1 be given together with positive summands p_i, q_j such
that both finite series sum to N, as described earlier under Problem
Data:

   p_1 + p_2 + ... + p_s = N

   q_1 + q_2 + ... + q_t = N

If s,t = 1, then p_1 = q_1 = N is to be assigned to c_11, and the
algorithm terminates.

Otherwise compare p_1 and q_1 and proceed by:

(Case 1) p_1 > q_1

Set the first column of C to (q_1,0,...,0)' and assign entries in the
remaining (t-1) columns by recursively applying the method to the
reduced series:

   (p_1 - q_1) + p_2 + ... + p_s = N - q_1

                 q_2 + ... + q_t = N - q_1

(Case 2) p_1 = q_1

Set the first row and first column of C to zeroes except c_11 = p_1 =
q_1, and assign entries in the remaining (s-1) by (t-1) subtable by
applying this method to the reduced series:

         p_2 + ... + p_s = N - p_1

         q_2 + ... + q_t = N - q_1

(Case 3) p_1 < q_1

Set the first row of C to (p_1,0,...,0) and assign entries in the
remaining (s-1) rows by recursively applying the method to the reduced
series:

                p_2 + ... + p_s = N - p_1

  (q_1 - p_1) + q_2 + ... + q_t = N - p_1

--------------------------

It is not hard to verify that each step of Method I prior to
termination defines a new problem of the same form and of reduced
dimension s or t or both by 1.  So the method requires at most (s-1) +
(t-1) recursive steps before reaching the termination step.  Each step
including the termination step introduces a single nonzero entry to
the final constructed C, so this method constructively gives:

  min(R) ? s + t - 1

The actual R obtained by this method is potentially less than this
bound by just so many times as Case 2 occurs during its execution.  In
fact if the termination step where p_1 = q_1 also holds is counted as
an instance of Case 2, then:

  R  =  s + t - (# of times Case 2 occurs)

We illustrate these ideas by recalling an example cited earlier:

   p_1 = 5,  p_2 = p_3 = p_4 = 2

  q_1 = 4, q_2 = q_3 = 3, q_4 = 1

Executing Method I on the p's and q's so ordered produces this:

    4  1  0  0
    0  2  0  0
    0  0  2  0
    0  0  1  1

Case 2 occurs at step 2 because p_1 + p_2 = q_1 + q_2, so that R = s +
t - 2 in this instance.  Determining the arrangements of p's and q's
apriori that will maximize the number of times that Case 2 occurs is a
daunting task.  While one can always safely pair off any single terms
p_i = q_j before proceeding to the optimization of the remaining
summands' order, finding the optimum groupings of terms is probably NP
hard.

However if the earlier Conjecture is true, we can simply use the
output of Method I as a convenient starting point for a final
reduction of R.  Suppose that the summands were instead taken in this
order:

    p_1 = p_2 = 2, p_3 = 5, p_4 = 2

    q_1 = q_2 = 3, q_3 = 1, q_4 = 4

The resulting table has the "generic" value R = s + t - 1 = 7:

    2  0  0  0
    1  1  0  0
    0  2  1  2
    0  0  0  2

because that ordering of p's and q's did not occasion any Case 2 prior
to the termination step.

But with a little effort we can spot a 2 by 2 subtable in the last two
rows, second and fourth columns, that cries out for minimization:

    2  2                 0  4
    0  2   modifies to   2  0

thereby introducing one additional zero entry:

    2  0  0  0
    1  1  0  0
    0  0  1  4
    0  2  0  0

and min(R) = 6, as appears from the absence of any further 2 by 2
subtable that is not minimal with regard to R.

This construction shows that s + t - 1 is an upper bound for min(R),
and one that is generically sharp as treating the p's and q's as free
parameters it is always possible to find s by t examples where no Case
2 occurs prior to the termination step.

Description in terms of bipartite graphs
========================================

Next we present an algorithm which seeks the maximum value of R.  It
will be convenient at this point to switch to using the bipartite
graph terminology mentioned briefly above.

Our point of view will be to treat each color of balls as belonging to
one set of nodes S and each color of sticks to a disjoint set of nodes
T.

Note that |S| = s and |T| = t.

For each node i in S we define a "valence" p_i which is the maximum
number of edges (or multi-edges, counted according to multiplicity)
that can be drawn to this node.  Similarly for node j in T we have
valence q_j.

Provided that SUM p_i = SUM q_j as our problem definition guarantees,
we can always construct a bipartite multigraph which attains the
prescribed valences.  However from the perspective of maximizing the
distinct number of outcomes R, the only thing that counts is the
number of simple edges underlying such a graph.  Therefore we can
identify max(R) with the maximum number of edges in a bipartite simple
graph that satisfies the valence (maximum degree) restrictions above,
noting that from such a simple graph one can always attain the
prescribed degrees by adding multi-edges by turns until all counts p_i
and q_j are exhausted.

Method II
=========

Let p_1 ? p_2 ? ... ? p_s.

For i = 1 to s, consider node i in S:

  Draw edge from i in S to j in T if possible, q_j > 0,
    up to the limit min(p_i, (# of q_j's > 0)), choosing
    if p_i is less than (# of q_j's > 0) those nodes j
    where q_j is largest.

  Decrement q_j by 1 if an edge is drawn from i to j.

--------------------------

The collection of (simple) edges so drawn corresponds to a set of
distinct outcomes in the ball/stick pairing, or to nonzero entries in
a contingency table C as in the earlier treatment.

Note that the values q_j are modified during the algorithm, so that it
might be helpful to write q_j(i-1) for the value of q_j at the
beginning of step i and q_j(i) for the value at the end of step i.

The value of R constructed in this fashion could then be expressed as:

   s
  SUM min(p_i, (# of q_j(i-1) > 0) )
  i=1

Again we illustrate the ideas by reference to the earlier example. 
Note that we here impose a descending order on the p's:

Step 1:   p_1 = 5 and q(0) = (4,3,3,1)

  Draw an edge from the first node in S to each node in T.

  Decrement each valence q accordingly.

Step 2:   p_2 = 2 and q(1) = (3,2,2,0)

  Draw an edge from the second node in S to the first two in T.

  Decrement those two valences in q accordingly.

Step 3:   p_3 = 2 and q(2) = (2,1,2,0)

  Draw an edge from the third node in S to the first and third in T.

  Decrement those two valences in q accordingly.

Step 4:   p_4 = 2 and q(3) = (1,1,1,0)

  Draw an edge from the fourth node in S to the first two in T.

  Decrement those two valences in q accordingly.

By the end of Method II we have constructed 10 = 4+2+2+2 distinct
"outcomes" or edges in the bipartite graph G.  There was one
unsatisfied valence of p_1 remaining from Step 1, and the final vector
q(4) = (0,0,1,0), so that we could complete the construction of
contingency table C (or the pairings of balls and sticks) by adding a
multi-edge from node 1 in S to node 3 in T, an edge that by
construction duplicates one already added in Step 1.

That is, prior to adding that final edge we would have this partially
populated contingency table corresponding to the simple graph G:

    1  1  1  1
    1  1  0  0
    1  0  1  0
    1  1  0  0

and after adding the multi-edge it becomes:

    1  1  2  1
    1  1  0  0
    1  0  1  0
    1  1  0  0

which may be seen to satisfy all the required row and column sums.

Furthermore, scanning this table for potential 2 by 2 subtables that
are suboptimal in the sense of maximizing R, none are found.

Note on complexity measurement
==============================

It is easy to become careless about claims of "polynomial time" where
the size of the problem is not of a simply defined size.  Above I
referenced polynomials in s and t as a justification for polynomial
time, but I would invite critical consideration of this as the basis
for measuring the size of the input.

To express numbers s and t requires only log(s) + log(t) space.  My
proposal to use s and t themselves as the measurements of problem size
is therefore based on an additional consideration, namely the
(effectively nonzero) summands p_i and q_j that are also required by
the problem formulation.  Given that at least a bit is required for
the expression of each such summand, we are able to argue that the
problem size is (up to some positive constant multiplier) at least s +
t:

   s < c SUM log(p_i)

   t < c SUM log(q_j)

The arithmetic required by Methods I and II above is minimal, though
subtraction is called for.  Therefore it is helpful to anticipate that
in a careful analysis the cost of such arithmetic will also entail
terms like log(p_i) and log(q_j).


regards, mathtalk-ga

Clarification of Answer by mathtalk-ga on 07 Jun 2005 18:51 PDT

Hi, stephen12-ga:

Sadly my Conjecture is false, at least with regard to minimum values
of R, as the following 2 by 3 contingency table shows:

  2  0  0
  1  1  1

Here p_1 = 2, p_2 = 3; q_1 = 3, q_2 = q_3 = 1.

Each of the three 2 by 2 subtables is minimal with respect to the
number of nonzero entries, but the marginal totals would also be
attained with one fewer nonzero entry overall in this way:

  0  1  1
  3  0  0

A pair of interchanges are needed to reduce the value of R.

It remains my unproven conjecture that R can be minimized by Method I
if the p's and q's are taken in an order which maximizes the number of
times that Case 2 occurs.  However trying all possible arrangements is
clearly a non-polynomial time algorithm, since there are s! * t! of
these.

So, is there a polynomial time algorithm here, in effect to determine
the optimal subgroupings of p's and q's with equal sums?  In absolute
terms I suspect not.  Doing some searches on terms from the
graph-theoretic formulation (bipartite, edge-packing), I came across a
lengthy article, actually a book chapter, dealing with the
classification of NP-hard problems that have polynomial time
"near-optimal" approximations:

[Approximation algorithms for NP-hard optimization problems -- Klein and Young]
http://www.cs.ucr.edu/~neal/index/non_arxiv/crc-approx-algs/final.pdf

Even if the minimum R problem is NP-hard, it seems likely to be one
with "nice" approximation algorithms.  Indeed we already have bounds:

   max(s,t) ? min(R) ? s+t-1

and a sketch of how to find a maximum number of subgroupings of p's
and q's with equal sums is the following:

Note that if subset S' of S = {1,..,s} and subset T' of T = {1,..,t} are such that:

  SUM p_i = SUM q_j
   S'        T'

then the same is true of S\S' and T\T'.  Therefore we can restrict our
search apriori to subtotals for p's and q's less than or equal to half
of N.

Moreover if we are trying to form a maximum number of such subsets, it
would make sense to search for subsets that are as small as possible,
because the larger the subset the less flexibility that remains for
finding additional "matching" subsets (other than the entire
complements).  Here we can note the special case of singletons p_i =
q_j fits smoothly into the approach of looking for the smallest
possible matching subsets, as pairing off two equal summands does not
encumber at all the search for further matching subsets.

Consider then maintaining lists of sums less than a target, say
SQRT(N), that can be attained by a subset of p's, resp. q's.  As
matching sums on the two lists are generated by incrementally
increasing the number of terms allowed (eg. starting from single
terms, adding pairs together, etc.), these are matched off.

Such a procedure of course makes quick work of the "counterexample"
presented above, as in fact there is the singleton match p_2 = q_1 = 3
to remove.


regards, mathtalk-ga

Clarification of Answer by mathtalk-ga on 08 Jun 2005 19:07 PDT

On the other hand I did find references in the graph
theory/combinatorial optimization literature that the maximum R
problem is polynomial complexity.  In fact more general problems than
what we treated here have polynomial time algorithms for exact
solutions, called maximum b-matching problems.  The notes on the Web
tend to be aimed at treating the more complex versions, so I'm working
on winnowing out some references for the basic situation.

regards, mathtalk-ga

Comments

Subject: Re: a probability/combinatorics question
From: mathtalk-ga on 26 May 2005 18:32 PDT

This appears to be a fairly difficult computation, with complexity of
intermediate calculations growing rapidly with increasing numbers of
colors.

As a notational device, I prefer capital letter R for the (random
variable) number of distinguishable pairs of balls and sticks, and
lowercase letters s and t for the (deterministic) numbers of ball and
stick colors, respectively.  Of course the roles of the p's and the
q's are symmetric in this problem, so we will omit repeating
observations that could well be restated by swapping them.

Without loss of generality we may assume that all p_i, q_j are positive.

The case s = 1 is trivial, since Pr(R = t) is 1.

One approach to managing a computation of the distribution of R is to
formulate the problem in terms of sampling from an urn without
replacement.  The contents of the urn, together with the number of
distinguishable outcomes already attained, will define a state for the
purposes of describing a system of transition probabilities.

Let an urn be filled with the N balls, having p_i of the corresponding color:

   s
  SUM p_i  =  N
  i=1

We propose to conduct t withdrawals (without replacement) from the
urn, taking q_j balls at the j-th step, joining the selected q_j balls
with the respective q_j sticks of a common color:

   t
  SUM q_j  =  N
  j=1

Let the initial state be "concentrated" with probability 1 on a state
(p_1,p_2,...,p_s;0) that reflects the initial contents of the urn and
the absence of any ball/stick combinations yet formed.  Each step will
"distribute" that probability among new states.  The number of
distinguishable outcomes for any consequent state will increase with
each step, being the number of distinct colors of balls present in the
sample withdrawn.  Since each q_j > 0, there must be at least one
color of ball in each step, so the final number of distinguishable
outcomes is at least the number of steps t.

We illustrate the computation with s = t = 2 and equal numbers m of
both colors for sticks and balls.  The initial state is (m,m;0).  The
first withdrawal will produce a binomial probability distribution on
m+1 new states, (k,m-k;r) where r = 1 if k = 0,m and r = 2 otherwise:

  Pr(k,m-k;r) = C(m,k)/(2^m)

The second withdrawal leads to the "random" outcome (0,0;R) as all the
remaining balls will be finally used up.  The possible values for R
are 2 and 4, with these respective probabilities:

  Pr( R = 2 ) = 2^(1-m)

  Pr( R = 4 ) = 1 - 2^(1-m)

The expected value for R is then:

  E(R) = 2*Pr(R = 2) + 4*Pr(R = 4)
       =   2^(2-m)   + 4 - 2^(3-m)
       =     4  -  2^(2-m)

Note in particular that for m = 1 (N = 2) we have E(R) = 4-2 = 2.  But
as m tends to infinity, E(R) rapidly approaches 4.


regards, mathtalk-ga

Subject: Re: a probability/combinatorics question
From: stephen12-ga on 27 May 2005 11:50 PDT

Hi mathtalk, 

Thanks very much for the detailed analysis.  I was also doubting there
exists a closed-form expression for the expected number of distinct
pairs.  Calculating it in a procedural fashion is possible, but seems
to have very high complexity.  Now let's relax the problem a little
bit: is there an efficient way (polynomial time) to find the maximum
and minimum number of distinct pairs obtainable under the constraints
posed by those p_i, and q_j's?

thanks,
stephen

Subject: Re: a probability/combinatorics question
From: mathtalk-ga on 30 May 2005 11:39 PDT

Hi, stephen12-ga:

I strongly suspect there are fast "greedy" algorithms to find the
maximum and minimum values of R (number of distinguisable ball-stick
pairs).  Greedy algorithms exist where there is a "matroid" structure
to the search space, and here the search space can be conceived of as
a permutation of the balls keeping the sticks in some fixed order (or
vice-versa).

Here's a fairly interesting case, s = t = 4 with N = 11:

  p_1 = 5, p_2 = p_3 = p_4 = 2

  q_1 = 4, q_2 = q_3 = 3, q_4 = 1

where min(R) = 6 and max(R) = 10.

It seems plausible that a deeper investigation of the matroid
structure behind the problem might lead to an efficient (polynomial
time) algorithm for the expected value of R (lower complexity than the
complete distribution for R), but I have doubts about my own ability
to work it out.

regards, mathtalk-ga

Subject: Expected number of color pairs in the balls and sticks problem
From: edcetera-ga on 21 Sep 2005 05:09 PDT

Hi, stephen12.

In your question (May 27th 2005) you asked for the expectation of the
number of distinct colors of ball-stick pairs in a random matching of
balls to sticks, given the data p_i and q_j describing the numbers of
balls and sticks of each color. I think the discussion that followed
was all about how to compute the distribution of this number, in
particular the maximum and minimum. However, there is a simple way to
compute the expectation, using a standard technique from probability
theory.

 Let x(i,j) be the random variable which takes the value 1 when the
ball-stick combination (i,j) occurs, and 0 when it does not occur. So
the number of distinct ball-stick color pairs is just the sum of all
the x(i,j). The expectation of a sum of random variables is the sum of
their expectations. The expectation of x(i,j) is just the probability
that the pair (i,j) occurs, which is easy to compute: it is

P[(i,j) occurs] =  1 - P[(i,j) does not occur]
 =  1 - P[ each of the q_j sticks gets a ball of some color other than i ]
 =  1 - (N-p_i)(N-1-p_i)...(N-(q_j-1) - p_i)/(N(N-1)...(N-(q_j-1)) 
 = 1 - (N-p_i)!(N-q_j)!/N!(N - q_j - p_i)!

Here the final expression in terms of factorials is only valid if  q_j
+ p_i is less than or equal to N; in the case where it is not, the
probability is 1, of course.

Now just sum over all pairs (i,j) to get the expectation you wanted.
Of course this will only be practical if your problem is of a
reasonably small size. If there are thousands of colors then you would
be better off doing a Monte-Carlo simulation to estimate the
expectation, or use the approximation that when all the p_i and q_j
are small compared to N, then P[(i,j) occurs] is close to
 1 - (1- (p_i/N))^q_j, which is well approximated by p_i q_j / N . So
then the expectation is approximately the sum of p_i q_j /N, which is
exactly N. You will notice that N is also the maximum possible number
of pairs, so this approximation is saying that the expected number of
repeats is small. In this case it might be of interest to know just
how small, and you can work this out  using an approximation to
P[(i,j) occurs] correct up to second order in the p_i and up to second
order in the q_j. The resulting approximation is valid when the q_j
and p_i are small compared to the square root of N. It is:

 E(number of distinct color pairs) 
   = N - (sum {p_i(p_i-1)})(sum {q_j(q_j-1)})/(2N^2)


To interpret this in another way, consider choosing a random stick and
counting the number of other sticks of the same color. Let x be the
expectation of this number. Do the same with the balls to define y.
Then

 E(number of distinct color pairs in a matching) = N - xy/2 + small error.

 Compare this to a different model in which instead of choosing a
matching, you start with blank sticks and balls and paint them,
choosing the colors independently with the same probabilities as in
your problem, i.e. the probability of choosing color i for a given
ball is p_i/N and so on. You get the same first approximation for the
expected number of distinct color pairs when all the probabilities are
small. But in this model the exact formula for the expected number of
distinct color pairs is the sum over all (i,j) of
 (1 - (1 - p_i q_j/N^2)^N).

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy