Google Answers: Identifying zip codes between two points

View Question

Q: Identifying zip codes between two points ( Answered 5 out of 5 stars

Question

Subject: Identifying zip codes between two points
Category: Computers > Programming
Asked by: maxout-ga
List Price: $75.00

Posted: 22 Aug 2003 17:03 PDT
Expires: 21 Sep 2003 17:03 PDT
Question ID: 247783

I have a databaes of zip codes along with the corresponding latitude and longitude. I need to find all of the zip codes between the two inputs, including a buffer of N% or N miles of the distance between the two. For example, Dallas to Houston is 250 miles (75207 to 77050). I need to find all of the zip codes betweent he two zips as well as 12 miles (5%) on either side of the line. This is all in SQL Server 2K and t-sql. The steps I expect to see are: - Convert lat/lon to polar - Project polar coords down to cartesian coordinates - An algorithm/SQL function that takes the line between the two points and make a polygon that represents the buffer area - A SQL function that tests whether the point/zip is inside the polygon above - A SQL query that calls the function to give the results
Request for Question Clarification by mathtalk-ga on 23 Aug 2003 20:48 PDT Hi, maxout-ga: The steps you outline are not quite the ones I envision to solve your problem. Let me describe my approach and if it seems satisfactory to you, I will post the details as an answer (including the T-SQL). Given any two points on a sphere, it can be rotated so that those two points are actually on the "equator". In effect we can treat a great circle passing through two points as the equator of the sphere so far as distances between points are concerned. So consider the "special case" of two points on the equator. If I've understood the region of interest to you, it consists of those points within a given latitude above or below the "equator" whose longitudes lie between that of those two given points. Note that this region is not exactly a spherical rectangle. Although our beginning and ending segments are each arcs of a great circle and bisected by one of the given points, the upper and lower segments are arcs of circles of longitude and hence not of great circles. However the area is more nearly a spherical rectangle as its width narrows. If we can agree this is the solution, at least in the special case of two points on the equator, then by providing an explicit formula that rotates all the points on the sphere rigidly so as to position any of two given points on the equator, we can map a prospective point using this formula and see if it lies in the proper relationship to the two derived points on the equator. There are additional issues I plan to discuss in regard to making the query efficient, both in terms of an execution plan and computations. regards, mathtalk-ga
Clarification of Question by maxout-ga on 24 Aug 2003 09:28 PDT So if I understand correctly, you are wanting to move the two points to the equator, then perform the operation on an easer to work with spherical rectangle. Having an exact rectangle is not of great importance. As long as the amount of variance is not significant in a NY -> LA search it'll work for my problem. I do have to offer that the computations need to efficient enough to allow for real time searches. (Real time being defined as < 15-20 seconds on an average server) I don't think this will be a problem, especially given that these are simple operations, however. It has been about 15 years since I took any geometry, so forgive me if my terms are off. I'm glad others still remember/use it :)
Clarification of Question by maxout-ga on 26 Aug 2003 21:29 PDT Please unlock the question if you are not going to answer. It has been locked for almost 36 hours with no answer. Thanks.
Request for Question Clarification by mathtalk-ga on 27 Aug 2003 00:27 PDT Hi, maxout-ga: Please read the Comment which I posted at bottom. If you wish to provide a deadline to work to, or if you simply wish to have another Researcher attempt your question, please advise me. I will be happy to respect your wishes. regards, mathtalk-ga
Clarification of Question by maxout-ga on 27 Aug 2003 07:34 PDT THanks for your detailed response mathtalk. Having waited for a bit I was a bit concerned there was either a bug in the system causing the lock not to release, or a rogue expert keeping me from getting an answer. In the mean time I've read through many of your past answers, and I appreciate it is you working on the problem as I know I will receive a correct and detailed answer. As best as I can follow, the steps outlined below seem to approach a solution. There seems to be a bit of uncertainty at the SQL point, but something you are addressing via one of the two alternatives at the end. To further clarify my problem (and hopefully simplify it): - Stored procedures and sQL functions are both fine and are the preferred way to house this code as they will perform well - The buffer area should be N miles or N%. It does not need to support both unless you feel like it. - On your SQL query below, providing the endpoints is probably redundant. You should just be able to provide the coordiantes of the area and return all the results. Also, a question. Why is it necessary to perform the orthogonal decomposition after we convert to cartesian coordinates if we are dealing in a plane? Is it because we are converting and need the handedness? (Please forgive my lack of knowledge in this area of math)
Request for Question Clarification by mathtalk-ga on 27 Aug 2003 09:21 PDT Hi, maxout-ga: Thanks for clearing up your comfort with a stored procedure as an approach. The orthogonal decomposition "milks" the Cartesian coordinates for information needed to decide whether a point C falls inside the region or not: u = v + w How big \|\|w\|\| is determines how far C (corresponding to unit vector u) lies above or below the "equator" (great circle) passing through A,B. The direction of v, i.e. whether it lies between v_A and v_B or not, tells us whether C is closer to the "short" segment arc'ing from A to B, or to the "long" portion of that great circle. We only want those points such that v is "in between". It is not necessary to explicitly compute both v,w from u, as you will see in my final answer. It is possible to compute: <u, v_A + v_B> = <v, v_A + v_B> and to rely on \|\|v\|\|^2 = 1 - \|\|w\|\|^2, so that our computation with individual vectors u can be limited to a couple of inner products (once the vectors v_A and v_B have been suitably "preprocessed"). regards, mathtalk-ga
Clarification of Question by maxout-ga on 27 Aug 2003 11:35 PDT Thanks, mathtalk. I look forward to seeing your final response. Out of curiousity, what is your background? Your answers have ranged from math related problems to theological analysis, and you appear to grasp the minutia of each topic well.
Request for Question Clarification by mathtalk-ga on 28 Aug 2003 15:34 PDT Hi, maxout-ga: I've written and tested a SQL script that creates a table T with test case info, see my second comment. I threw in a user-defined function which computes the angular separation between points on a unit sphere from latitude and longitude coordinates. When this function is scaled up to the Earth's radius, i.e. slightly less than 4000 miles, it gives the "great circle" distance between points. Note that this function expects the coordinates to be given in degrees (heresy to mathematicians) because I suspect this is how your actual table will be laid out. I chose the test cases to illustrate some of the circumstances which might arise, but did not want to spend a great deal of time on it since presumably you have a complete dataset to play with. The paths I have in mind to illustrate are: Dallas to Houston (per your original example) Dallas to San Antonio (changing one endpoint) Los Angeles to New York (a longer distance) I ran these commands on my local SQL Server 2K, using Windows authentication. It was mildly surprising to me that the name of the function needs to be qualified by an owner name (dbo.SphereDist in this case), but I guess a "user-defined function" really does imply association with a specific user. Please try to execute the script in a "scratch" database and let me know if there are any questions or concerns. I expect to post an answer tonight. regards, mathtalk-ga
Request for Question Clarification by mathtalk-ga on 28 Aug 2003 21:37 PDT Hi, maxout-ga: I finished the initial coding of a stored procedure, ZipStrip, which takes two zipcodes (for endpoints A,B) and a "half-width" distance N as arguments. Some debugging of the rather complex arithmetic inside is unfortunately needed, and given the lateness of the hour, I no longer expect to be able to post an answer tonight. regards, mathtalk-ga
Clarification of Question by maxout-ga on 28 Aug 2003 23:06 PDT Not a problem. If you are able to finish by tomorrow EOB I should be able to test it, otherwise it may be Monday before I have a chance. Thanks..

Answer

Subject: Re: Identifying zip codes between two points
Answered By: mathtalk-ga on 02 Sep 2003 07:49 PDT
Rated: 5 out of 5 stars

Hi, maxout-ga:

REMARKS
=======

Having kicked around the math portion of the question as much as we
already have, I'm now thinking that I'll lead off the answer with the
SQL stored procedure that selects the appropriate zipcodes, and then
back into the mathematical derivation by way of explaining the code.

First some words about optimization.  This code is written to optimize
for the case when a very large number of zipcode locations are to be
checked for the geographic conditions (what we've called the
"latitude" and "longitude" tests).  Thus the stored procedure is
designed to minimize the computation that is done by querying each
individual record (little more than two inner products and inequality
comparisons) at the expense of some "up front" computations that are
shared across all locations.

Improvements in speed beyond this will depend on "winnowing" the set
of records that are to be checked.  For example, although my
definition of the table T has a primary key on zip[code]:
 
CREATE TABLE T ( 
    zip    char(5) not Null PRIMARY KEY, 
    phi    float   not Null, 
    theta  float   not Null 
    ) 

which by default would produce a clustered index on this field, we
might prefer to build a clustered index on phi (latitude) or theta
(longitude).  A clustered index uses the physical arrangement of the
records to "sort" its key, so that in creating the execution plan if a
query includes upper and lower bounds on (say) the latitude T.phi,
only those records whose latitudes are within the given bounds would
need to be checked.

So the next level of performance optimization I would look at would be
along these lines; I did not explore it here since I'm working only
with my own table definition, and decisions about the restructuring of
indexes on that table need to be made in perspective of all uses
(including population) placed on it by the application.

Also a few words about the mathematical background assumed in our
derivation.  We will of course discuss converting latitude and
longitude into Cartesian coordinates, albeit in a slightly customized
fashion for the purposes of our problem.  We assume some familiarity
with the notion of inner product or "dot product" between vectors in
three dimensions, which we abbreviate:

    u.v = u_x*v_x + u_y*v_y + u_z*v_z

where the subscripts on the right-hand side denote x,y,z components of
standard Cartesian coordinates.  We will also use the cross-product
and explain its definition in three dimensions.

STORED PROCEDURE
================

Here then is the definition of the stored procedure ZipStrip,
including some commented-out "instrumentation" used to verify its
validity:

if exists (select * from dbo.sysobjects 
    where id = object_id(N'[dbo].[ZipStrip]') 
    and OBJECTPROPERTY(id, N'IsProcedure') = 1)
drop procedure [dbo].[ZipStrip]
GO

-- create a stored procedure to find all locations by zipcode
-- within N miles of the shortest path between locations A and B

CREATE PROCEDURE ZipStrip(
    @zipA    char(5),
    @zipB    char(5),
    @N_mi    float
    )
AS
BEGIN

-- convert coordinate degrees to radians
-- TODO: add logic in case locations for A,B don't exist

DECLARE @phiA float, @thetaA float, @phiB float, @thetaB float

SELECT @phiA = T.phi, @thetaA = T.theta FROM T WHERE T.zip = @zipA

SELECT @phiB = T.phi, @thetaB = T.theta FROM T WHERE T.zip = @zipB

DECLARE @deg2rad float, @phiDiff float, @thetaDiff float

SET @deg2rad = PI( )/180
SET @phiDiff = @deg2rad * (@phiA - @phiB)
SET @thetaDiff = @deg2rad * (@thetaA - @thetaB)
SET @phiA = @phiA * @deg2rad
SET @thetaA = @thetaA * @deg2rad
SET @phiB = @phiB * @deg2rad
SET @thetaB = @thetaB * @deg2rad

-- convert N miles to angle for "latitude" test
-- NB: the only place R = 3958 is used is here

DECLARE @CosNR float

SET @CosNR = COS(@N_mi/3958)

-- convert latitude and longitude to Cartesian 3D coordinates

-- (phi,theta) --> [cos(phi)cos(theta),cos(phi)sin(theta),sin(phi)]

-- Works with Northern latitudes and Western longitudes regarded
-- as positive angles, placing +x-axis on the Prime Meridian and
-- +z-axis running out the North Pole.  While this produces a 
-- "left-handed" coordinate system, that's not an issue if one is
-- interested only in distances, as here.

DECLARE @vAx float, @vAy float, @vAz float

SET @vAz = SIN(@phiA)
SET @vAx = COS(@phiA)
SET @vAy = @vAx * SIN(@thetaA) -- COS(@phiA)*SIN(@thetaA)
SET @vAx = @vAx * COS(@thetaA) -- COS(@phiA)*COS(@thetaA)

DECLARE @vBx float, @vBy float, @vBz float

SET @vBz = SIN(@phiB)
SET @vBx = COS(@phiB)
SET @vBy = @vBx * SIN(@thetaB) -- COS(@phiB)*SIN(@thetaB)
SET @vBx = @vBx * COS(@thetaB) -- COS(@phiB)*COS(@thetaB)

-- find the sum V = v_A + v_B and scale it by 1/(1 + v_A . v_B)

DECLARE @Vx float, @Vy float, @Vz float, @scaleV float

SET @scaleV = 1.0/(1.0 + ( COS(@phiDiff) 
      - COS(@phiA)*COS(@phiB)*SIN(@thetaDiff)*TAN(0.5*@thetaDiff) ))
        
SET @Vx = (@vAx + @vBx) * @scaleV
SET @Vy = (@vAy + @vBy) * @scaleV
SET @Vz = (@vAz + @vBz) * @scaleV

-- find the unit vector W perpendicular to both v_A and v_B
-- take cross product of those two vectors and normalize it

DECLARE @Wx float, @Wy float, @Wz float, @scaleW float

SET @Wz = -COS(@phiA)*COS(@phiB)*SIN(@thetaDiff)
SET @Wy = @vAz*@vBx - @vBz*@vAx
SET @Wx = @vBz*@vAy - @vAz*@vBy

SET @scaleW = 1.0/SQRT( @Wx*@Wx + @Wy*@Wy + @Wz*@Wz )

SET @Wx = @Wx * @scaleW
SET @Wy = @Wy * @scaleW
SET @Wz = @Wz * @scaleW

-- We've now precomputed all the quantities depending solely
-- on v_A and v_B, so we're ready to perform the main SELECT

-- check V.W, should be zero
-- SELECT @Vx*@Wx + @Vy*@Wy + @Vz*@Wz as VdotW

-- check V.vA and V.vB, should be equal
-- SELECT @Vx*@vAx + @Vy*@vAy + @Vz*@vAz as VdotvA,
--        @Vx*@vBx + @Vy*@vBy + @Vz*@vBz as VdotvB

-- check norm of V squared
-- SELECT @Vx*@Vx + @Vy*@Vy + @Vz*@Vz as NormVSqr


SELECT T.zip
FROM T
WHERE SQRT( 1.0 - SQUARE( SIN(T.phi*@deg2rad)*@Wz 
              + COS(T.phi*@deg2rad)*SIN(T.theta*@deg2rad)*@Wy
              + COS(T.phi*@deg2rad)*COS(T.theta*@deg2rad)*@Wx ) )
    BETWEEN @CosNR AND 1.0E-8 + ( SIN(T.phi*@deg2rad)*@Vz 
              + COS(T.phi*@deg2rad)*SIN(T.theta*@deg2rad)*@Vy
              + COS(T.phi*@deg2rad)*COS(T.theta*@deg2rad)*@Vx )
END

TEST CASES
==========

Recalling the test case records which were given in a Comment below,
here are a few tests of the stored procedure based on them:

-- this first series tests the Dallas/Houston strip

ZipStrip '75207','77050',5
go

-- 75110 'Corsicana TX'
-- 75207 'Dallas'
-- 77050 'Houston'

ZipStrip '75207','77050',50
go

-- increased width, no additional records

ZipStrip '75207','77050',100
go

-- 76710 'Waco TX' appears now

ZipStrip '75207','77050',500
go

-- 78210 'San Antonio TX' at this width

-- NB:  76110 'Fort Worth TX' never appears, despite
-- its proximity to Dallas, because of its angle to
-- the path from Dallas to Houston

-- this second series tests the LA/NY strip

ZipStrip '90010','10010',50
go

-- 10010 'New York'
-- 63401 'Hannibal MO'
-- 90010 'Los Angeles'

ZipStrip '90010','10010',500
go

-- two more records appear at this distance
-- 75207 'Dallas'
-- 76110 'Fort Worth'

HOW IT WORKS
============

The stored procedure ZipStrip is intended to take as input the
zipcodes for two distinct locations A,B within an existing table T of
such locations.

A constant @deg2rad which gives the number pi/180 of radians per
degree is defined for our use in converting degrees of latitude phi
and longitude theta given in that table into the radian values
expected by the trigonometric functions in MS SQL Server's
Transact-SQL implementation.

The mapping from latitude/longitude pairs we use here:

[x,y,z] = [ COS(phi)*COS(theta), COS(phi)*SIN(theta), SIN(phi) ]

deserves a bit of explanation.  First of all the mapping is to
Cartesian coordinates on a unit sphere.  This effectively assumes that
distances on the Earth can be modelled as if it were a perfect sphere.
 Actually the Earth is more of a oblate spheroid or ellipsoid, with a
slightly greater radius of 3963 miles at the equator versus 3950 miles
at the poles, but I've chosen to use an intermediate value of R = 3958
miles for the purpose of this computation.  The upshot of the scaling
is that the radius is only needed in one expression, so for the rest
of the procedure we are fine with working only on a unit sphere.

The one place where R is needed is in the condition for "half-width"
of the strip of zipcode locations, which elsewhere we refer to as the
"latitude test" because the interpretation of the path from A to B as
an "equator" would make an distance from this path correspond to an
angle of latitude.  Precisely, a spherical distance of N miles on a
sphere of radius R corresponds to an angle (as viewed from the
sphere's center) of N/R radians.  Therefore we will want to select
points whose "unit vector" u lies within this angle of "latitude" with
respect to an "equator" passing through input points A,B.

The other test that we impose is the "longitude test," which asks that
this same unit vector u should fall between the "lines of longitude"
that pass through A,B respectively.  Here we are not speaking of real
lines of longitude, but rather of great circles at A,B resp. which are
perpendicular there to the great circle "equator" through A and B.

Computationally we need to decompose the vector u into "orthogonal"
components:

    u = v + w

such that v lies in the same plane as the great circle "equator"
through A and B, and w is perpendicular to this plane.  Our approach
is to take the two unit vectors v_A and v_B that correspond to points
A,B and construct a unit vector:

    v_A x v_B

Here is meant a special "cross product" of three dimensional vectors
which has the characteristic of being perpendicular to both "factors"
v_A and v_B.  We then "normalize" the resulting vector to have unit
length and call this W.

The relationship of u,w, and W is then given by a simple inner or
"dot" product:

    w = (u.W) W

This is the scalar multiple (u.W) of W, and it has length or "norm":

   ||w|| = |u.W|

because ||W|| = 1.

The cross-product of two vectors is only defined in three dimensions,
and it can be a bit cumbersome to state.  The most compact form of
definition is to describe it in terms of a determinant, but here it is
written out:

    u x v = [ u_y*v_z - u_z*v_y,
              u_z*v_x - u_x*v_z,
              u_x*v_y - u_y*v_x ]

All we really need to know for our problem can be verified by direct
computation, namely that:

    (u x v).u = 0

    (u x v).v = 0

i.e. that the cross-product is orthogonal (perpendicular) to both its
factors.

In our procedure we apply this to get W from v_A x v_B to be
perpendicular to both of them, and hence to the entire plane of the
"equator" passing through A and B.

Once we have w from W by the inner product relation above, the other
component v is also known:

    v = u - w

In fact we don't need to explicitly compute v at all, because the only
things we really will need about v are:

  1) its norm ||v||, and

  2) its inner product with the "bisector" of v_A and v_B

        v.(v_A+v_B)

The first of these values is easily obtained by the Pythagorean
relationship:

   ||v||^2 + ||w||^2 = ||u||^2 = 1

because v,w are the perpendicular "legs" that form sum vector u (which
is of unit length), and thus:

   ||v|| = SQRT( 1 - ||w||^2 )

The second of those values above, the inner product of v with the sum
of vectors of v_A and v_B, is important because of its relationship to
the "longitude test".  Imagine for a moment that v is restricted to
being a unit vector lying in the plane of the "equator" passing
through A and B, i.e. the plane that contains both v_A and v_B.  This
expression:

    v.(v_A + v_B) 

will then take on positive and negative values as v "rotates" around
that great circle.  In fact the maximum value occurs when v is
parallel to v_A + v_B, namely at the midpoint of the (shortest) path
between A and B, and the expression has a strictly smaller value at
the "endpoints" of this path:

    when v = v_A or v = v_B,

    v.(v_A + v_B) = 1 + (v_A . v_B)

Therefore if v were a unit vector (in the plane of v_A and v_B), our
test for whether v falls "between" v_A and v_B would simply be:

    v.(v_A + v_B) >= 1 + (v_A . v_B)

where the right-hand side of this inequality is simply a constant,
depending only on v_A and v_B, that we can precompute.

More generally, since v need not be a unit vector, we need to scale
the test according to the actual length of v:

    v.(v_A + v_B) >= ( 1 + (v_A . v_B) ) ||v||

A couple of tricks are used to make the test more "compact".  We
define:

    V = ( 1/ (1 + (v_A . v_B) ) (v_A + v_B)

so that we have only to check:

    v.V >= ||v||

and second we note that by the decomposition u = v + w, w is
orthogonal to V and:

    u.V = v.V

Therefore, we have the implemented form of the longitude test:

    u.V >= ||v||

But this ties in easily with the other half, our "latitude test". 
Instead of asking:

    ||w|| <= SIN(N/R)

which expresses the notion that the perpendicular distance ||w|| that
u lies above the plane of v_A and v_B is within the allowed N miles,
we can turn the relationship around to a condition on ||v||:

    ||v|| = SQRT( 1 - ||w||^2 )

          >=  COS(N/R)  [note reversal in inequality direction]

Now the advantage of stating the combined conditions like this:

    COS(N/R) <= ||v|| <= u.V

is that we can use the BETWEEN syntax of a WHERE condition to insure
that the expressions are only mentioned and thus only evaluated once:

    WHERE SQRT( 1 - ||w||^2 )
        BETWEEN  COS(N/R)  AND  u.V

I need to look back over my notes to pick out some links for
additional resources, but I'm posting this now for you to test.

regards, mathtalk-ga

maxout-ga rated this answer: 5 out of 5 stars

I'd sure like to crawl in Mathtalk's brain for a day.  My answer
turned out to be fairly involved, but he broke it down step by step
not only providing the answer, but also the reasoning behind the
answer.

Upon testing his answer operates exactly as expected, and allows my
application to run real time instead of batch.

Thanks!

Comments

Subject: Re: Identifying zip codes between two points
From: mathtalk-ga on 27 Aug 2003 00:25 PDT

Hi, maxout-ga:

I've been actively working on your question during this time, and
while I can appreciate that you might think this is an overly long
time, please bear in mind that it involves both a mathematical problem
and SQL Server issues.

To post a partial answer for either part as an answer would be against
the Google Answer's Terms of Service.  Thus I will use this comment to
give an outline of the answer.


MATHEMATICAL PROBLEM
====================

Let two distinct points A,B taken from your zipcode table be
specified.  These provide the coordinates of latitude and longitude on
the Earth's surface, which we will model as a perfect sphere of radius
R miles.

As can be seen from the special case in which A,B are endpoints on the
equator, the set of points within N miles of the (great circle arc)
shortest path from A to B has two pairs of boundaries.  The first pair
we will describe by a "latitude test" since in the special case the
upper and lower boundaries are simply two lines of latitude, parallel
to the equator.  The second pair we will describe by a "longitude test
since in the special case the left and right boundaries are simly two
lines of longitude (passing through A,B respectively).

In the general case we will require these formulas for the
mathematical expressions.  Here v_A and v_B represent unit vectors in
three dimensions whose Cartesian coordinates are converted from the
latitude/longitude measurements of points A,B respectively.  We will
use ||v|| to denote the norm (length) of a vector v, and <v,w> to
denote the inner product (dot product) of two vectors.


Coordinate Conversion:

Given the latitude phi and longitude theta of a point on the unit
sphere, its Cartesian coordinates can be mapped by these formulas:

    x = cos(phi) cos(theta)

    y = cos(phi) sin(theta)

    z = sin(phi)

[Note: This mapping produces a left-handed coordinate system in 3D,
but it is especially convenient for locations in the United States,
which all lie in the Northern latitudes and (with minor exceptions)
have Western longitudes.]


Orthogonal Decomposition:

Let u be a unit vector.  Assuming that v_A and v_B are not anti-podal
(exactly on opposite sides of the sphere), then one can uniquely
express:

    u = v + w

where v is a vector in the plane spanned by {v_A,v_B}, and w is a
vector that is orthogonal (perpendicular) to that plane.

In general v,w will each have length less than 1, as the unit length
of u is the hypotenuse of a right triangle formed with "legs" of
length equal to that of v,w.

  ||v||^2 + ||w||^2 = 1

Think of v as the "shadow" of u cast onto the plane that contains both
v_A and v_B, if the Earth were a crystal sphere and the Sun were moved
directly overhead of this plane.  Hmm.  Okay, I'd better just stick to
the math...


Latitude Test:

  ||w|| <= sin( N/R )

Longitude Test:

  <v, v_A + v_B>  >= (1 + <v_A, v_B>) ||v||

Note that, as is to be expected, the first "latitude" test involves
the parameter of N miles (suitably scaled by radius R), but the second
"longitude" test does not.


Summary:

Any points u which pass these two tests correspond to locations within
the desired region "near" the shortest path from A to B.


SQL SERVER ISSUES
=================

For the sake of discussion we assume the existence of a simplified
table T in which only a primary five digit zipcode key (T.zip) and the
two dependent fields of latitude and longitude are present (T.phi and
T.theta, respectively).

Your original question asks for an implementation based on a
user-defined function and a single SQL query.  Ignoring for the moment
the distinction between specifying N miles versus a percentage N% (of
the distance from A to B) as the region half-width, such a formulation
might look like this:

SELECT C.*
FROM T as A, T as B, T as C
WHERE A.zip = @zipA
 AND  B.zip = @zipB
 AND  MyTestFunction(A.phi,A.theta,B.phi,B.theta,C.phi,C.theta,@N) = 1

in which we've made use of "parameters" @zipA and @zipB to identify
the endpoints by their zipcodes and @N to specify the threshold
distance in miles.

Here MyTestFunction( ) should be written to compute v_A from
A.phi,A.theta and v_B from B.phi,B.theta, to compute u from
C.phi,C.theta and determine its orthogonal decomposition, and to
return 1 if both tests described above are satisfied (otherwise 0).

While such an implementation is possible, its structure requires the
evaluation of MyTestFunction( ) on all records in table T (as aliased
here by C).  I fear that the performance of this approach would be
much slower than optimal.

Two alternatives to this can be considered.  One is to use a stored
procedure, and another is to use a user-defined function that returns
a table value (which could be an image of the selected rows from T). 
Either of these approaches would allow us to compute once for all rows
those values, like the inner product <v_A,v_B>, which depend only on
the endpoints A,B and not on the prospective test point.

I therefore request your patience as I work to supply a clear and
complete explanation, both for the mathematical methods and SQL
techniques which I will recommend.

regards, mathtalk-ga

Subject: Re: Identifying zip codes between two points
From: mathtalk-ga on 28 Aug 2003 15:21 PDT

-- create a sample table T to hold latitude and longitude
-- coordinates keyed to zipcode

if exists (SELECT * FROM dbo.sysobjects
           WHERE id = object_id(N'[T]')
            AND  OBJECTPROPERTY(id, N'IsUserTable') = 1)
DROP TABLE [T]
GO

CREATE TABLE T (
    zip    char(5) not Null PRIMARY KEY,
    phi    float   not Null,
    theta  float   not Null
    )
GO

INSERT INTO T VALUES ('75207', 32.7853, 96.8223)  -- Dallas
go
INSERT INTO T VALUES ('77050', 29.9023, 95.2677)  -- Houston
go
INSERT INTO T VALUES ('75110', 32.0983, 96.5183)  -- Corsicana TX
go
INSERT INTO T VALUES ('76710', 31.5356, 97.1921)  -- Waco TX
go
INSERT INTO T VALUES ('78210', 29.3962, 98.4659)  -- San Antonio TX
go
INSERT INTO T VALUES ('76110', 32.7066, 97.3382)  -- Fort Worth TX
go
INSERT INTO T VALUES ('90010', 34.0589, 118.3299)  -- Los Angeles CA
go
INSERT INTO T VALUES ('10010', 40.7386, 73.9826)  -- New York NY
go
INSERT INTO T VALUES ('63401', 39.7145, 91.4209)  -- Hannibal MO
go


-- create a user-defined function which finds distance/angle between
-- two points on the unit sphere from "geographic" coordinates 
-- latitude phi and longitude theta measured in degrees

CREATE FUNCTION SphereDist (
    @phiA    float,
    @thetaA  float,
    @phiB    float,
    @thetaB  float
    )

RETURNS float AS

BEGIN

-- convert all degrees to radians

DECLARE @deg2rad float, @phiDiff float

SET @deg2rad   = PI( )/180
SET @phiDiff   = @deg2rad * (@phiA - @phiB)
SET @phiA      = @deg2rad * @phiA
SET @phiB      = @deg2rad * @phiB
SET @deg2rad   = @deg2rad * (@thetaA - @thetaB)

-- this form of the dot product between vectors A & B
-- minimizes cancellation errors due to small differences
-- between pairs of latitude or longitude coordinates

    RETURN( ACOS( COS(@phiDiff) 
        - COS(@phiA)*COS(@phiB)*SIN(@deg2rad)*TAN(0.5*@deg2rad) )  )

END

-- test the user-defined function by finding all distance pairs
-- using 3958 miles as the approximate radius of the earth

SELECT A.zip, B.zip, 3958*dbo.SphereDist(A.phi,A.theta,B.phi,B.theta)
FROM T as A, T as B
WHERE A.zip < B.zip
go

Subject: Re: Identifying zip codes between two points
From: sanjay_gandhi-ga on 30 Mar 2004 13:20 PST

This is amazing. We spent 12 hours just to find the formula for
calculating distance between 2 lat/long pairs!!!

We have a separate problem though. We got a database of around 150
different stores and 1.6 million customers. I came up with a SQL
procedure to create a table with these fields: Store, Customer,
Distance, Rank. The rank is based on the closest store to a customer.
So sample output will be somthing like this:
Customer  Store  Distance  Rank
1          100     2.1      1
1          101     2.9      2
1          120     6.0      3

It is currently taking 60 hours of processing time on a dual proc server.

Appreciate if anyone could give any tips to speed it up. I will post
it as a question to the math genius maxout-ga if no one responds!

Subject: Re: Identifying zip codes between two points
From: mathtalk-ga on 02 Apr 2004 06:17 PST

Hi, sanjay_gandhi-ga:

Your description is too brief to make detailed suggestions for
improvements possible.  It seems unlikely that your goal is to rank
all 150 stores for each of the 1.6 million customers; surely some
savings in computation can be obtained by limiting the results to
finding the 5 closest stores for each customer, although of course
this is still a whopping big job if you need to do it for all
customers at once.

I believe that it would be practical to take an "on demand" approach
as I did for maxout-ga, given a particular customer, to find the 5
closest stores.  You may wish, for practical reasons, to limit the
store search to a distance of (say) 500 miles.  One element of this
would be to quickly apply longitude and latitude limits so that the
distance calculation is carried out, not over all 150 stores, but only
over a modest subset of them that are "reasonably" close.

If you are doing this as a batch, then I'd suggest that you consider
an incremental strategy in which you update the results only as
entries for stores or customers are added, changed, or deleted.

The more details you can supply, the more likely it is that someone
with an interest in assisting you could be successful.  If you are
interested in programming this yourself, I think you'll find much of
the computation has been done here for maxout-ga.  My efforts to
optimize the calculations for his particular need may have obscured
some of the basics, however.

regards, mathtalk-ga

Subject: Re: Identifying zip codes between two points
From: sanjay_gandhi-ga on 02 Apr 2004 10:08 PST

--Your description is too brief to make detailed suggestions for
--improvements possible.  It seems unlikely that your goal is to rank
--all 150 stores for each of the 1.6 million customers; surely some
--savings in computation can be obtained by limiting the results to
--finding the 5 closest stores for each customer, although of course
--this is still a whopping big job if you need to do it for all
--customers at once.

Let me clarify my goal: Get all stores within 30 miles of a customer
and rank them based on distance. If a customer has less than 5 stores
within that range, then go further till 300 miles to get at least 5
closest stores.

I cannot use an incremental approach because when a new store is
added, it can change the rankings of any of the 1.6 million customers.
So I still have to scan the entire customer database.

BTW, I got the processing down to ***10 hours*** on a dual 1.2Ghz
machine. Just in case anyone is curious how I did it:

NOTE: Avoid using SQL cursors. That's why it took 40 hrs before!

1. First, fill up the destination table (let's say Output_Result)
without the rank information. So, just get all stores within 30 miles
of the 1.6 million customers.
You can speed this up by using what I call "perfect square" method.
Instead of using a formula to calculate distance between a customer
and all 150 stores, we can limit the number of stores used in the
query. Remember, we need all stores within 30 mile radius. So, imagine
this 30 mile radius within a square of width=(Diameter of circle).
Doing some geometric calulcations, get the upper and lower limit of
lat/long first. This will limit your calculation to around 30 stores.
And then use the distance formula to calulate distance beteen 2
lat/long pairs and batch insert the result set.
This takes about ***6 hours processing time***. 

2. Then, do the ranking using SQL Update statement. This was the
toughest part. I got this off from some newsgroup. Here is the SQL
statement:

update Output_Result
set Rank = (SELECT Count_Big(*) + 1
     FROM Output_Result OR2
     Where Output_Result.CustID = OR2.CustID and 
     Output_Result.DistanceToStore > OR2.SADA_DistanceToStore
    )

This part takes around ***4 hours*** when the Output_Result table has
around 20 million rows.
Thats it!!!

Subject: Re: Identifying zip codes between two points
From: sanjay_gandhi-ga on 02 Apr 2004 10:24 PST

CORRECTION:
The update sql statement should have been:

update Output_Result
set Rank = (SELECT Count_Big(*) + 1
     FROM Output_Result OR2
     Where Output_Result.CustID = OR2.CustID and 
     Output_Result.DistanceToStore > OR2.DistanceToStore
    )

Subject: Re: Identifying zip codes between two points
From: mathtalk-ga on 02 Apr 2004 10:45 PST

You don't mention what indexes, if any, are defined on your
Output_Results table, but that might be one area to investigate.

I believe that a table like Output_Results can be most efficiently
maintained incrementally.  If a new store is added, you don't have to
revise results for all 1.6 million customers, only those within 300
miles of the store (actually only for a subset of these, customers
within 30 miles of the store and those with at most 5 store rankings).
 However running a batch refresh of all the rankings may not be a
burden for you if this only needs to be done infrequently, and it's
certainly a good idea to have this capability.

Cursors can be a good thing, but one needs to be careful to use them
efficiently.  A cursor over the entire set of customers, for example,
is not a good place to begin (because the number of stores is much
smaller than the number of customers).

If I were starting from scratch I'd try a cursor over the list of
Stores and use your "perfect square" method to find all the candidate
Customers that might be within 30 miles of each Store.  Do the
rankings with that smaller set of "output results", then go back and
identify Customers with fewer than 5 rankings.  Then work with this
smaller set of Customers, which requires throwing "perfect squares"
out to 300 miles for any Stores.

You undoubtedly already have an idea what percentage of your Customers
will wind up in the latter category (fewer than 5 stores within 30
miles), so you can judge for yourself if this divide-and-conquer
scheme is promising.

regards, mathtalk-ga

Subject: Re: Identifying zip codes between two points
From: sanjay_gandhi-ga on 02 Apr 2004 14:30 PST

WOW! You are quick in responding.

I got 2 Foreign Key indices on CustID and StoreID. 

I am still not confident about updating table incrementally, since we
run this update 2 times a year and its not just the store addresses
that change, its the customers too. So I'll have to build the logic of
tracking which addresses changed and all that. Not worth spending time
on.

I did try the cursor on store instead of doing a "batch"
(Select...Insert Into..) statement. It didnt give any performance
improvement - in fact it took an hour more to execute. I think the SQL
2000 is smart enough to optimize the batch insert. It was a good try
though.

Thanks mathtalk-ga. It really helps to get a second opinion

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy