Google Answers Logo
View Question
 
Q: Calculate the median using only counters ( Answered 5 out of 5 stars,   0 Comments )
Question  
Subject: Calculate the median using only counters
Category: Computers > Algorithms
Asked by: jh963-ga
List Price: $20.00
Posted: 06 Oct 2006 13:20 PDT
Expires: 05 Nov 2006 12:20 PST
Question ID: 771372
Construct an algorithm to determine the median of a set of integers.

You are given:
The number of integers in the set (which may be even or odd).
The range of the integers (from 0 to 2^12 - 1)
Counter 1, which counts all values <= an integer limit value you set
Counter 2, which counts all values >= an integer limit value you set 
(The limit values can be different on the 2 counters.)

You can stream the entire set of integers past the counters any number of times
(although, obviously, fewer times is better).  After each streaming, you can read
the counter values and adjust the limit values.  One streaming will increment the
values in both counters.

You cannot:
* Change any values in the stream (e.g., to sort the numbers)
* Stream only part of the set past the counters. 
* Determine the index in the stream of any particular value.
* Read the values off the stream into memory and manipulate them there.
 
Each time you stream the integers past the counters takes 5 milliseconds. An
excellent solution will take less than 100 milliseconds to compute.  An 
acceptable solution can take up to 200 milliseconds.

For your testing, consider the following abbreviated data sets with a sample
size of 8 and range of values from 0 to 7:

Sample 1	6	2	2	7	5	1	6	5
Sample 2	4	7	3	3	2	1	6	1
Sample 3	5	2	7	2	4	1	7	2
Sample 4	1	6	6	5	6	3	2	7
Sample 5	7	2	7	6	6	6	2	5
Sample 6	1	0	2	6	5	4	3	7
Sample 7	0	0	0	0	0	0	0	0
Sample 8	1	1	1	1	1	1	1	1
Sample 9	4	4	4	4	4	4	4	4
Sample 10	2	7	7	2	2	7	7	2


Your algorithm must correctly calculate the median value.  If the sample size
is odd, the median is the (N/2 + 1)'th element (if the sequence were sorted).
If the sample size is even, the median is defined as the AVERAGE of the
N/2 and N/2 + 1 element.

No, this isn't a homework problem.  We have a custom piece of hardware that
can stream a data set past the 2 counters.  We need to be able to calculate
the median of this data set.

J.

Request for Question Clarification by maniac-ga on 07 Oct 2006 09:52 PDT
Hello Jh963,

Are there any additional constraints that should be accounted for?

For example, can you [or NOT] remember the results of the counters
between streams? [this would make the algorithm REALLY hard]
Do you know the total number of integers that will be streamed before
the first run? [if false, use the first stream with C1<=max and/or
C2>=0 to determine the total count - adding one to the number of
streams required]

Also - you don't indicate how you want the algorithm represented -
would something like the explanation below be acceptable or do you
need a specific format for the algorithm?

Let me outline a solution and see if it meets your needs. The
constraints listed include:
 - an "excellent solution" will have at most 19 cycles of the counters
 - an "acceptable" solution will have at most 40 cycles of the counters
 - the solution must accomodate both odd & even values
 - the values are seen only by the counters (and not available for
other processing)

In addition, I believe we should also assume that
 - the data can be sparse (one or more values can be omitted)

For the odd case, a relatively simple approach is to perform a binary
search on the 2^12 possible values to locate the median value. That
require the data to be streamed across the counters a dozen times or
less. You could set both counter 1 and 2 to the same limit value &
stop when the counters of both exceed (N/2)+1.

For the even case, its necessary to find both the N/2 and (N/2)+1
values. We can apply a method similar to the odd case, to find one
(say the N/2 value), but then we need to find the (N/2)+1 value within
8 streams which will work in most cases - but not all.

As an alternative, do a pair of parallel binary searches to locate the
N/2 and (N/2)+1 values at the same time. For example, use
  C1 & search for the minimum value which has the counter >= N/2
  C2 & search for the maximum value which has the counter >= N/2
If the counters reach the same value - that's the median. If not, use
the average calculation to determine the median. This latter solution
should find the solution within the same dozen streams as the odd
solution.

Please reply with clarifications to assumptions I've referred to and
if you need more detail in the algorithm to complete the answer.

  --Maniac

Clarification of Question by jh963-ga on 07 Oct 2006 11:45 PDT
I'll try to answer your questions:

Yes, you know beforehand how many integers are in the stream.  Consider it an 
argument passed in to the function.

Yes, you can remember the counter results between streaming the integers.

I'd like the algorithm represented as pseudo code or C code.  Something like:

counter1_limit = 0
counter2_limit = 0xffff
stream_integers();
if counter1_value > X ...

You get the idea.

Yes, as my sample data sets show, the data can be "sparse".  In particular, see
samples 7, 8, 9, and 10.

The difficulty in this problem is in the actual design of the binary search.
I'd like to see the pseudo code solution, and you should have run all of the
sample sets through the algorithm to ensure your algorithm works.  

J.
Answer  
Subject: Re: Calculate the median using only counters
Answered By: maniac-ga on 07 Oct 2006 16:37 PDT
Rated:5 out of 5 stars
 
Hello Jh963,

I pasted your data into an Excel spreadsheet, generated a pair of
functions & compared the result of those functions with the built in
MEDIAN function. These pass all your sample data without problem for
the even cases & when removing the last value from each sample does
the same for odd cases. The functions are included at the end of the
answer for reference / as "pseudo code" for the solution. Every place
that INT is used in that code - be sure to use integer arithmetic (or
truncate floating point values) to get the right result.

Let's explain a little bit what the functions do and how they do it.

The CountRange function acts as your "counter". The first parameter is
the set of values, the second is the limit set, and the third should
be True for counter C1, and False for C2. The return value is the
number of values that match the counter condition / limit.

The DoMedian function has two main cases as described before, the
"odd" and the "even" case. The odd case is much simpler - let's
describe it first.

The loop is initialized to start at the middle of the range of
possible values. The loop then repeats until the values of the
counters are both greater or equal to LCount ((Count+1) / 2). The top
of the loop passes the data pass both counters (calling CountRange for
each), setting C1 and C2 respectively. The IF statement that follows
allows an early exit if we stumble onto the result "early". The
sequence of IF statements that follow have three conditions:
 - IF LDelta is more than one, divide the range by 2 & move up/down
based on the relative values of C1 and LCount.
 - the second condition is used is LDelta is one & this is the second
time we've gotten to this point. If that's the case, return the
current index (CurL). [note - my code did not trigger this condition
when I set a breakpoint there, but I needed this for a limit on the
even side]
 - the third condition means that LDelta is one & this is the first
time we've gotten here. If that's the case, we try "one more time" and
move the index in the proper direction. This is necessary for a corner
condition at the extreme ranges of possible values (e.g., if the
median is 7).

Now, for the even case, the coding is similar but handles the index
for C1 and C2 separately. It also has to remember the highest / lowest
match values to determine if an index is acceptable. I believe I
misstated the conditions previously - we need to maximize the index
for C1 and minimize the index for C2. So, after setting up the loop
(with a few additional variables), The loop again streams the data
against the counters C1 and C2. The first IF statement checks C1
against the limit & based on that computes the new index for C1,
remembering a "match" index if this is the maximum seen so far. The
second IF statement does the same for C2, remembering the match to
minimize seen so far. The termination condition is similar to the end
case for the odd code - adjusting the index delta to 1/2 each time
until one is reached. When one is reached the first time, go ahead and
do that. The second time the delta is one, we now know we've generated
a solution - and the average of the two indices is returned as the
function value.

That's it. As you noted, there are some issues with the careful design
and implementation of a binary search. It is sometimes difficult to
handle the edge cases (e.g., zero and your limit) - the code has to be
carefully constructed to start at the right location in your data.

If you have any difficulty understanding the answer or if the answer
is incomplete, please make a clarification request and I would be glad
to provide further information.

Good luck with your work.

  --Maniac

Function DoMedian(X As Range, Limit) As Double
'
'   Function DoMedian determines the median of a set of values based
'   using a pair of "counters" that count values within X as being above
'   or below some limit. The limit to this function is the maximum value
'   of the values expected (zero is the minimum).
'   Created by maniac on October 7, 2006
'
    DoMedian = 0
    XCount = X.Count
    LDelta = Int((Limit + 1) / 2)
    SeenOne = False
    '   Two different algorithms, simple for odd number of samples,
more complex if even
    If XCount = (Int(XCount / 2) * 2) Then
        '   Even case
        '   Remember how many samples for the counters
        LCount = LDelta
        CurL1 = LDelta
        CurL2 = LDelta
        MatchL1 = -1
        MatchL2 = Limit + 1
        LDelta = Int((LDelta + 1) / 2)
        Do
            '   We're trying to maximize CurL1 and minimize CurL2
            '   where they satisfy C1 (or C2) >= LCount
            C1 = CountRange(X, CurL1, True)
            C2 = CountRange(X, CurL2, False)
            '   Have we found the "right" values?
            If (C1 >= LCount) Then
                '   OK, we have a possible match to maximize
                If (CurL1 > MatchL1) Then
                    MatchL1 = CurL1
                End If
                CurL1 = CurL1 + LDelta
            Else
                CurL1 = CurL1 - LDelta
            End If
            If (C2 >= LCount) Then
                '   OK, we have a possible match to minimize
                If (CurL2 < MatchL2) Then
                    MatchL2 = CurL2
                End If
                CurL2 = CurL2 - LDelta
            Else
                CurL2 = CurL2 + LDelta
            End If
            If (LDelta > 1) Then
                LDelta = Int((1 + LDelta) / 2)
            ElseIf SeenOne Then
                '   All done, return the result
                DoMedian = (MatchL1 + MatchL2) / 2
                Exit Do
            Else
                SeenOne = True
            End If
        Loop
    Else
        '   Odd case
        '   remember how many samples for the counters
        LCount = LDelta
        CurL = LDelta - 1
        Do
            C1 = CountRange(X, CurL, True)
            C2 = CountRange(X, CurL, False)
            '   Have we found the "right" value?
            If ((C1 >= LCount) And (C2 >= LCount)) Then
                DoMedian = CurL
                Exit Do
            End If
            If LDelta > 1 Then
                LDelta = Int((1 + LDelta) / 2)
                If (C1 < LCount) Then
                    '   Need to go down
                    CurL = CurL - LDelta
                Else
                    '   Need to go up
                    CurL = CurL + LDelta
                End If
            ElseIf SeenOne Then
                DoMedian = CurL
                Exit Do
            Else
                SeenOne = True
                If (C1 < LCount) Then
                    '   Need to go down
                    CurL = CurL - LDelta
                Else
                    '   Need to go up
                    CurL = CurL + LDelta
                End If
            End If
        Loop
    End If

End Function

Function CountRange(X As Range, Limit, GE As Boolean) As Integer
    Count = 0
    If GE Then
        For Each C In X
            If C.Value >= Limit Then
                Count = Count + 1
            End If
        Next C
    Else
        For Each C In X
            If C.Value <= Limit Then
                Count = Count + 1
            End If
        Next C
    End If
    CountRange = Count
End Function

Request for Answer Clarification by jh963-ga on 09 Oct 2006 10:59 PDT
I converted your code to C code and ran my 10 samples: they worked.  I
then tried this data set:

2061 2106 1537 2230 1962 1024 2075 616 2522 487 2630 3205 
3299 2795 1568 3209 2404 2559 3166 2976 488 3409 261 524

I pass in this set of 24 integers and pass in a limit of 4095.
I think the median should be 2168.
The answer I get from the algorithm is 2047.5.

It's possible that I made a mistake in my translation to C.  When I stepped 
through your code, however, I saw CurL1 and CurL2 DIVERGING.  One getting 
closer to 0 and 1 getting closer to 4096.  There may be a problem with your
code when the limit size is not equal to the sample size.  

Could you try this data set in your test code and see what result you get?

Thanks,
J.

Clarification of Answer by maniac-ga on 10 Oct 2006 08:29 PDT
Hello Jh963,

Hmm. Not enough diversity on the test cases. Please change the value of LCount to
   LCount = Int((XCount + 1) / 2)
near the top of each case (even & odd) [note: the +1 is not needed on
the even case]. Its coincidental that I used 7 or 8 samples (with a
range of 0-7) for all my testing so the value set to LCount was always
correct. As you noted - the big difference in the number of samples
and range triggered the bug.

As a side note - you may get some diversion between CurL1 and CurL2 in
the even case as they try to find the N/2 and N+1/2 samples. They
should go to the limits only if the data drives them there (e.g.,
alternating 0, 4095). Oops - I found another bug. Add
  SeenTwo = False
near the top & revise the ElseIF SeenOne case to read
  ElseIf SeenTwo Then
     '   All done, return the result
     DoMedian = (MatchL1 + MatchL2) / 2
     Exit Do
  ElseIf SeenOne Then
     '   In even case, make sure we get to the minimum value (0)
     SeenTwo = True

The even case was stopping one "too soon" in my sample testing, this
tries "one more time" to force CurL1 and CurL2 to the limits if
needed. The odd case doesn't need this extra pass (checked that too -
can return both 0 & 4095).

Please let me know if you have any further problems - I am glad to help.

  --Maniac

Request for Answer Clarification by jh963-ga on 10 Oct 2006 10:13 PDT
Ok, it's better but I'm still getting some failures.  I created a test
loop that generates a random number of sample elements (between 1 and
128).  I then ran this loop 64 times and got errors on all of the odd
sample size cases.  Below are a few representative examples of the
failures (I chose the smaller sample sizes for brevity here).  "Median
= XXX" is result from the DoMedian routine.  I then calculate the
median myself by sorting the sample.  BTW "s/b" == "should be".

J.


Sample (size 7): 709 1097 1142 1447 1977 3389 4016      Median = 1535.000000
******** Error: median s/b 1447.000000 but is 1535.000000.

Sample (size 21): 29 331 598 707 746 995 1264 1514 1774 1785 2029 2035 2791 2916
 2997 3261 3523 3548 3567 3817 3916     Median = 1791.000000
******** Error: median s/b 2029.000000 but is 1791.000000.

Sample (size 9): 886 1000 1005 1334 2260 2792 3775 3780 4087    Median = 2047.00
0000
******** Error: median s/b 2260.000000 but is 2047.000000.

Sample (size 17): 138 322 806 1011 1080 1427 1567 1595 1756 1882 1910 2614 2867
3188 3353 3621 3818     Median = 1791.000000
******** Error: median s/b 1756.000000 but is 1791.000000.

Request for Answer Clarification by jh963-ga on 10 Oct 2006 10:16 PDT
OOOPPPPPSSSS!

That was my mistake, not yours.  I misread your last message and
thought it said the "+ 1" was not needed on the ODD case, not the even
case.  I fixed that and the algorithm seems to be working now.

J.

Clarification of Answer by maniac-ga on 10 Oct 2006 11:16 PDT
Hello Jh963,

Thank you for the kind words and I am glad everthing has worked out
for your problem.

  --Maniac
jh963-ga rated this answer:5 out of 5 stars and gave an additional tip of: $20.00
Great extra effort on this harder-than-it-looks problem.

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy