Hi, epoulsen-ga:
Let's begin by stating some assumptions about the data in your problem
and the required output.
Your first sample data shows integer endpoints for the "event"
intervals. The last example uses "times" instead of integers.
What is important about these data types is that they are discrete;
for any particular value there is a well-defined predecessor value and
a well-defined successor value. This characteristic of the datatype
comes into play in your treatment of the input segments and output
segments as "closed" intervals. An artifact that results from this is
that output segments may have "end points" which differ by "one" from
some "end point" that appears in the input.
This "off by one" aspect is tricky. The input and output segments in
your illustration seem to be "closed" intervals, in the sense of
containing both their beginning and ending points. For example, if
the input consists of two segments:
{begin 10, end 20}
{begin 20, end 30}
then judging by your example, the proper output would be three
segments:
{begin 10, end 19, count 1}
{begin 20, end 20, count 2}
{begin 21, end 30, count 1}
The "natural" data structure to perform this counting operation turns
out to be "half-open" intervals which contain their beginning point
but not their ending point. The "first" and "last" steps of our
algorithm provide adjustments to these ending points so that the body
of the algorithm is effectively working with such half-open intervals.
A less intricate aspect of your illustrated output is that intervals
during which no events are active are suppressed in the output.
After we outline the algorithm next, we will apply it to two example
data sets. The final section discusses possible ways to optimize the
algorithm by combining multiple "steps" into a common pass through the
data.
If you have any questions about the algorithm or examples, please use
the "Request Answer Clarification" button and I will be notified of
your concerns.
regards, mathtalk-ga
STEP 0: Input
------
Let's assume the event intervals are represented in some sequential
data file as Begin, End pairs:
{begin Tx1, end Ty1}
{begin Tx2, end Ty2}
. . .
{begin Txn, end Tyn}
STEP 1: Separate beginnings and endings
------
Make a pass through the file, and put the "times" on separate
lines/records with increment +1 for begins and -1 for ends. Adjust
the ending times so that these become "plus one" in whatever sense is
appropriate for the datatype:
Tx1, +1
(Ty1 + 1), -1
Tx2, +1
(Ty2 + 1), -1
.
.
.
Txn, +1
(Tyn + 1), -1
This resulting data can be sent to a new text file or, if feasible, to
an array/list, depending on the facilities of your programming
environment.
STEP 2: Sort the resulting records
------
Sort the output of the first step according to "time" values, either
in a file or in memory (if the array was used). Efficient sorting is
provided by a variety of external file utilities. If the number of
segments in the input is not much more than 25, then an insertion sort
can be done in combination with the activities of Step 1. If the
number of segments is typically well beyond this, it will probably pay
in the long run to implement a quicksort or heapsort algorithm. A
typical perl interpreter under Unix would use the system provided
qsort( ) routine for sorting.
STEP 3: Combine records with same time
------
Make a pass through the sorted list of "times" and increments:
Tz1, +1
Tz2, +/-1
Tz3, +/-1
Tz4, +/-1
.
.
.
TzN, -1
where N = 2n. Naturally e1 = +1 and eN = -1, but we cannot predict in
general more than this about the pattern of increments.
There may be duplicates among the "times" given here. As we pass
through the list, when a group of records with the same "time" Tzk is
encountered, we treat that group as a whole by summing up all the
increments. If the total of increments for such a group is zero, it
can be ignored (skipped over). Otherwise create a single entry for
any group of entries that has a nonzero total increment:
Tw1, e1
Tw2, e2
. . .
TwK, eK
STEP 4: Create draft output segments
------
Initialize a counter $count = e1, the first increment of a distinct
entry, at the beginning of the pass. Also store a "time" value
$timeOld = Tw1 of that entry.
Loop over each additional entry listed:
Set the new "time" value $timeNew = Twj and increment $incr = ej.
If $count > 0, then
output {begin [$timeOld], end [$timeNew], count [$count]}
Then update the variables $count and $timeOld:
$count = $count + $incr
$timeOld = $timeNew
When the last entry pair TwK, eK is processed, we are done with this
step.
STEP 5: Adjust output ending points
------
Thus far the ending point of one segment will match the beginning
point of the next segment. In your desired output there will not be
any "overlap" of these ending and beginning points. To remove the
overlap, "subtract one" from the ending point of each output record.
Example 1
---------
Let's illustrate the steps of the algorithm using the simple input
described above:
{begin 10, end 20}
{begin 20, end 30}
In Step 1 we make a pass through the input, separating the beginning
and ending values. The ending "times" are adjusted by adding 1, and
an increment of +1 or -1 is associated with the beginning and ending
times, respectively:
10, +1
21, -1
20, +1
31, -1
In Step 2 we sort the records according to "time" values:
10, +1
20, +1
21, -1
31, -1
In Step 3 we would combine records sharing a common "time", but in
this case no records are combined.
In Step 4 we create the draft output records. After initializing
$count = 1 and $timeOld = 10 using the first pair of entries, we
proceed to output a record for each of the three remaining pairs of
entries:
{begin 10, end 20, count 1}
{begin 20, end 21, count 2}
{begin 21, end 31, count 1}
In Step 5 we adjust the ending points of these output records:
{begin 10, end 19, count 1}
{begin 20, end 20, count 2}
{begin 21, end 30, count 1}
Example 2
---------
Now let's do the more complicated data given in your initial example:
{begin 10, end 50 }
{begin 10, end 20 }
{begin 15, end 20 }
{begin 40, end 70 }
{begin 100, end 120 }
In Step 1 we get this list of entries by separating the begin and end
"times":
10, +1
51, -1
10, +1
21, -1
15, +1
21, -1
40, +1
71, -1
100, +1
121, -1
After sorting in Step 2 we have:
10, +1
10, +1
15, +1
21, -1
21, -1
40, +1
51, -1
71, -1
100, +1
121, -1
After "telescoping" the entries in Step 3 to obtain distinct "time"
entries:
10, +2
15, +1
21, -2
40, +1
51, -1
71, -1
100, +1
121, -1
Now Step 4 creates these draft output records:
{begin 10, end 15, count 2}
{begin 15, end 21, count 3}
{begin 21, end 40, count 1}
{begin 40, end 51, count 2}
{begin 51, end 71, count 1}
Note: We supress outputting {begin 71, end 100, count 0}
next, because $count is zero.
{begin 100, end 121, count 1}
At last in Step 5 we adjust the ending points:
{begin 10, end 14, count 2}
{begin 15, end 20, count 3}
{begin 21, end 39, count 1}
{begin 40, end 50, count 2}
{begin 51, end 70, count 1}
{begin 100, end 120, count 1}
to get the final output.
Discussion
----------
The purpose of breaking the algorithm down into five separate steps is
partly for clarity, esp. to explain the treatment of the adjustments
to ending points, and partly to allow for flexibility. In
implementing the algorithm it would be natural to combine various
steps so that the number of passes through the data is reduced.
Step 0 doesn't "do" anything; it merely acts to identify our input.
Step 1 is a first pass through the data. If one were implementing the
sort of Step 2 in custom code, e.g. as a straight insertion sort, then
Steps 1 and 2 could be naturally combined. However for larger data
sets it would be desirable to reuse a well tested "fast" sorting
routine, and thus to keep Step 1 and 2 distinct.
Likewise Steps 3,4,5 can be combined into a single pass through the
data. One would aggregate the first set of records and use that to
initialize $count and $timeOld. Thereafter each set of records that
gets aggregated, summing the increments of all pairs of entries up to
one with a distinct "time" value (or to the end of the list), allows
us to create an new output record. The ending point can be adjusted
as the output record is written. |