Consider a set of n points that all lie in a plane. Call them
individually p_1, p_2, ?, p_n, and for the sake of simplicity, let's
assume they all fall in the unit square.
Now suppose that we assign to each point a height, h(p_i), so that the
points and their heights together define a type of landscape. One way
to imagine this landscape is to say that it is defined by a rubber
sheet stretched over sticks perpendicular to the plane, where each
stick corresponds to one of the points at a particular height.
We stipulate that the height of each point p_i is between zero and
one, i.e., 0<=h(p_i) <=1
Now, consider two different landscapes L_1 and L_2 over the same set
of n points, i.e., each landscape assigns (most likely) different
heights to the points p_1, ? p_n.
How should I define a metric that measures the difference between two
different landscapes, based on their heights and the positions of the
given points? Informally, I think that I'm looking for a measure of
how well the two landscapes would "fit" if you tried to push them
together. I realize there are many different possible measures, from
the utterly trivial to an actual physical simulation, but I'm looking
for something in between that simply makes sense.
So in summary, I want a function d(L_1, L_2, p_1, ?, p_n) that
provides this measure.
Also, what, if anything, is changed if we further stipulate that for
any landscape, the sum of all the point heights = 1, namely:
For each landscape:
sum[i=1 to n] (h(p_i)) = 1
(In this case, one could assume the landscapes correspond to
probability distributions defined over the points.) |
Request for Question Clarification by
mathtalk-ga
on
16 Nov 2004 18:20 PST
Hi, prettypolynomial:
I worked out a fairly simple formula for integrating the square of
differences between two "landscapes" over each triangle (of the
underlying Delaunay triangulation). I can present that as a solution,
together with some remarks about the restricted case of the landscape
points summing to 1, if you think it might be satisfactory.
regards, mathtalk-ga
|
Clarification of Question by
prettypolynomial-ga
on
22 Nov 2004 13:47 PST
Mathtalk, apologies for the delay in responding.
May I ask you a question first about the solution you've formulated?
I'd like to know if it addresses an issue I'm concerned about.
To make this easier, let's consider for the moment a 1-dimensional
version of this problem along the unit interval. Say we have a set of
points along a line. We define a landscape just as above, but now a
landscape is a 2-d rather than 3-d object.
I'm concerned about cases like the following. Suppose we examine a
small subset of this interval and have:
Landscape 1:
|
|
|
__|___|___|___|___
x x+epsilon
Landscape 2:
|
|
|
__|___|___|___|___
x x+epsilon
If we perform a type of averaging by considering Delaunay
triangulations rather than the points themselves, the fact that the
two peaks are near each other in the interval -- but may have an
arbitrary number (here, I've used two) of intervening points in our
set -- seems to get lost.
Does your solution address this? Can you give me a brief idea of how?
|
Clarification of Question by
prettypolynomial-ga
on
22 Nov 2004 13:50 PST
It just occurred to me that I should have used at least 3 intervening
points to make my concern clearer, and with that addendum, I look
forward to your response.
|
Request for Question Clarification by
mathtalk-ga
on
22 Nov 2004 19:35 PST
Hi, prettypolynomial-ga:
I think the device of using a one-dimensional problem to discuss your
concerns is a good one, but I'm not yet clear on what consideration
you wish to give for "spikes" that are at nearby but distinct points
of the "landscape".
To back up a minute, the nature of your application is of course the
most important source of criteria for what makes a good measure of
distance between "landscapes". I've assumed that referring to a
"landscape" is a bit metaphorical, it has some significance for how
best to measure any discrepancy.
In particular you've noted that the underlying points p_i for both
landscapes are precisely the same, and that these form a Delauney
triangulation of a region. These aspects indicate to me that, as with
a real geodetic application, the intra-landmark distances are known
with greater precision a priori than the relative elevations.
Now let's contrast how the two proposed methods for comparing
landscapes handle the sorts of "spiking" discrepancies suggested by
your one-dimension example, and then consider a further alternative
that may address your concern.
If we use a simple pointwise sum of squares as hfshaw-ga proposed
initially, then the locations of points p_i are not important for the
calculation. All points are equally weighted.
An integral over the areas defined by the Delauney triangulation, as I
proposed, weights the contributions of points by the size of the
triangles which they bound, since for piecewise linear "elements" the
influence of any measured h(p_i) extends no further than that.
Hence in this measure of discrepancy, the appearance of two spikes at
different nearby locations in the landscapes will contribute to their
computed distance, but the magnitude of the contribution is limited by
the granularity of the distance between them (a distance epsilon, as
you described it).
I can think of applications where this would be just the result
desired. For example, in comparing two (one-dimensional) NMR spectra,
the horizontal coordinates (magnetic frequencies) are known with high
precision and form a reproducible basis for comparing two curves. (If
necessary samples may be "doped" with a compound that has a known
spike somewhere in the frequency range of interest to produce a
synthetic "landmark" for reference purposes.)
On the other hand you may have in mind a application of a different
sort, in which not only the heights but also the "spatial" coordinates
p_i may have inter-sampling uncertainties. We measured a spike at
"x", but really (given the registration of instruments from run to
run), it might have been at x plus or minus epsilon.
In that case I would think about "smoothing" the data. Instead of
comparing the two landscape datasets "as is", I'd "smear out" the
contributions of individual heights by replacing them with nearest
neighbor averages (the distance over which the averaging is done being
a function of the uncertainty in spatial coordinates). In this way,
depending on their proximity, two spikes that appear quite separated
in their respective landscapes would have overlapping support, and
this reduces their overall contribution to the computed discrepancy.
In the extreme limit all heights would be averaged out, and the
comparison made by differencing the mean values of h(p_i) between the
two landscapes.
Something similar to this is involved in image compression, where a
full palette of 24-bit colors needs to be approximated by a skillful
selection of an 8-bit subset. A contrary kind of algorithm is needed
for edge detection/enhancement in imaging, with the goal of
identifying and contrasting more sharply some perceived
discontinuities in a picture.
regards, mathtalk-ga
|
Clarification of Question by
prettypolynomial-ga
on
23 Nov 2004 17:35 PST
Mathtalk, thank you for your thoughtful comments.
Let me briefly explain where the data comes from, and I think that
will help make matters clearer. I'll stay in the 1-d framework for
this.
Suppose we have an instrument that periodically outputs a value in
[0,1]. We collect these measurements and cluster them using k-means.
(For our purposes, k is a fairly large value, i.e., k>64, so one might
view this clustering as creating a codebook.) The cluster prototypes
are the points referred to in the original problem statement as p_1,
...,p_n.
We then create a Voronoi partitioning of the space around these
prototypes. For each partitioned region, we count its number of
observations with respect to the measured data. This number, when
normalized over all observations, is assigned as the prototype's
height. In this way, we define the (indeed) metaphorical landscape in
the problem statement.
The positional data therefore have no privileged status over the
height data, as both sets are determined by the particular (and often
somewhat arbitrary) output of our k-means clustering.
(One could ask here how the points, i.e., the cluster prototypes,
remain the same for different landscapes, which essentially correspond
to different sets of measurements, and might seem by necessity to be
clustered differently. Here, I'll beg your indulgence that for
reasons beyond the scope of my question, the sets of prototypes points
are in fact identical.)
Thus, my question is how can I measure differences between
"landscapes" created in this fashion. I'll mention that I did
initially consider a spreading (or "smearing") mechanism by
considering each landscape as a parameterized Gaussian mixture model
and iteratively increasing the influence of each Voronoi region on its
neighbors. However, I was unable to find a principled means for
effecting this spreading or any theoretical justification for what I
was doing, which is what lead me here.
|
Request for Question Clarification by
mathtalk-ga
on
25 Nov 2004 14:04 PST
Hi, prettypolynomial-ga:
I applaud your search for a "principled" justification, but given my
ignorance of what underlies the identification of p_i's in the two
datasets, I'm reluctant to guess at an appropriate treatment.
It seems to me that the Voronoi partitioning suggests a meaningful
notion of distance between points and of the areas of Delaunay
triangulation, but since you've considered this so thoroughly, I'm
sure your doubts about their significance are well founded.
regards, mathtalk-ga
|
Clarification of Question by
prettypolynomial-ga
on
25 Nov 2004 15:42 PST
Hi Mathtalk,
Please do me the following the favor, as I would like to reward you
for the time you've spent in discussion with me. Post the original
solution you worked out as the answer, which I'm interested in seeing
in any event, and I'll then mark this question as answered.
I really do appreciate the effort you've exerted so far, particularly
given the overly abbreviated problem statement.
Best regards,
prettypolynomial
|
Request for Question Clarification by
mathtalk-ga
on
28 Nov 2004 16:54 PST
Hi, prettypolynomial-ga:
Have you studied the Comment posted by ram_nathaniel-ga? The Earth
Mover's Distance (EMD) certainly sounds like the sort of thing your
application requires. I looked at the linear programming definition
but haven't yet reached a clear grasp of why it matches the intuitive
notion.
regards, mathtalk-ga
|
Clarification of Question by
prettypolynomial-ga
on
28 Nov 2004 18:46 PST
Hi Mathtalk,
I've now looked through several papers on the Earth Mover's Distance
(EMD), which I've found extremely helpful. In particular, I read one
by Levina and Bickel which demonstrates that the EMD is equivalent in
many cases to something called the Mallows Distance and gives some
insight into why it works.
(http://www.stat.lsa.umich.edu/~elevina/EMD.pdf)
As for the linear programming definition, the only thing it doesn't
seem to actually guarantee is that one-to-one peak matching is
preferred over one-to-many peak matching, as in Figure 1d of the
original Rubner et al. paper, although that specification is somewhat
informal.
I also took slight issue with the notion that people have intuitive
preferences for comparing histograms, which seems somewhat farfetched
outside of graduate schools. People surely have strong intuitions
governing image comparisons, but assuming those preferences
automatically transfer to other image representations, e.g.,
histograms, needs to be substantiated, although I would certainly tend
to believe it.
|