|
|
Subject:
Identifying zip codes between two points
Category: Computers > Programming Asked by: maxout-ga List Price: $75.00 |
Posted:
22 Aug 2003 17:03 PDT
Expires: 21 Sep 2003 17:03 PDT Question ID: 247783 |
I have a databaes of zip codes along with the corresponding latitude and longitude. I need to find all of the zip codes between the two inputs, including a buffer of N% or N miles of the distance between the two. For example, Dallas to Houston is 250 miles (75207 to 77050). I need to find all of the zip codes betweent he two zips as well as 12 miles (5%) on either side of the line. This is all in SQL Server 2K and t-sql. The steps I expect to see are: - Convert lat/lon to polar - Project polar coords down to cartesian coordinates - An algorithm/SQL function that takes the line between the two points and make a polygon that represents the buffer area - A SQL function that tests whether the point/zip is inside the polygon above - A SQL query that calls the function to give the results | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
|
|
Subject:
Re: Identifying zip codes between two points
Answered By: mathtalk-ga on 02 Sep 2003 07:49 PDT Rated: |
Hi, maxout-ga: REMARKS ======= Having kicked around the math portion of the question as much as we already have, I'm now thinking that I'll lead off the answer with the SQL stored procedure that selects the appropriate zipcodes, and then back into the mathematical derivation by way of explaining the code. First some words about optimization. This code is written to optimize for the case when a very large number of zipcode locations are to be checked for the geographic conditions (what we've called the "latitude" and "longitude" tests). Thus the stored procedure is designed to minimize the computation that is done by querying each individual record (little more than two inner products and inequality comparisons) at the expense of some "up front" computations that are shared across all locations. Improvements in speed beyond this will depend on "winnowing" the set of records that are to be checked. For example, although my definition of the table T has a primary key on zip[code]: CREATE TABLE T ( zip char(5) not Null PRIMARY KEY, phi float not Null, theta float not Null ) which by default would produce a clustered index on this field, we might prefer to build a clustered index on phi (latitude) or theta (longitude). A clustered index uses the physical arrangement of the records to "sort" its key, so that in creating the execution plan if a query includes upper and lower bounds on (say) the latitude T.phi, only those records whose latitudes are within the given bounds would need to be checked. So the next level of performance optimization I would look at would be along these lines; I did not explore it here since I'm working only with my own table definition, and decisions about the restructuring of indexes on that table need to be made in perspective of all uses (including population) placed on it by the application. Also a few words about the mathematical background assumed in our derivation. We will of course discuss converting latitude and longitude into Cartesian coordinates, albeit in a slightly customized fashion for the purposes of our problem. We assume some familiarity with the notion of inner product or "dot product" between vectors in three dimensions, which we abbreviate: u.v = u_x*v_x + u_y*v_y + u_z*v_z where the subscripts on the right-hand side denote x,y,z components of standard Cartesian coordinates. We will also use the cross-product and explain its definition in three dimensions. STORED PROCEDURE ================ Here then is the definition of the stored procedure ZipStrip, including some commented-out "instrumentation" used to verify its validity: if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[ZipStrip]') and OBJECTPROPERTY(id, N'IsProcedure') = 1) drop procedure [dbo].[ZipStrip] GO -- create a stored procedure to find all locations by zipcode -- within N miles of the shortest path between locations A and B CREATE PROCEDURE ZipStrip( @zipA char(5), @zipB char(5), @N_mi float ) AS BEGIN -- convert coordinate degrees to radians -- TODO: add logic in case locations for A,B don't exist DECLARE @phiA float, @thetaA float, @phiB float, @thetaB float SELECT @phiA = T.phi, @thetaA = T.theta FROM T WHERE T.zip = @zipA SELECT @phiB = T.phi, @thetaB = T.theta FROM T WHERE T.zip = @zipB DECLARE @deg2rad float, @phiDiff float, @thetaDiff float SET @deg2rad = PI( )/180 SET @phiDiff = @deg2rad * (@phiA - @phiB) SET @thetaDiff = @deg2rad * (@thetaA - @thetaB) SET @phiA = @phiA * @deg2rad SET @thetaA = @thetaA * @deg2rad SET @phiB = @phiB * @deg2rad SET @thetaB = @thetaB * @deg2rad -- convert N miles to angle for "latitude" test -- NB: the only place R = 3958 is used is here DECLARE @CosNR float SET @CosNR = COS(@N_mi/3958) -- convert latitude and longitude to Cartesian 3D coordinates -- (phi,theta) --> [cos(phi)cos(theta),cos(phi)sin(theta),sin(phi)] -- Works with Northern latitudes and Western longitudes regarded -- as positive angles, placing +x-axis on the Prime Meridian and -- +z-axis running out the North Pole. While this produces a -- "left-handed" coordinate system, that's not an issue if one is -- interested only in distances, as here. DECLARE @vAx float, @vAy float, @vAz float SET @vAz = SIN(@phiA) SET @vAx = COS(@phiA) SET @vAy = @vAx * SIN(@thetaA) -- COS(@phiA)*SIN(@thetaA) SET @vAx = @vAx * COS(@thetaA) -- COS(@phiA)*COS(@thetaA) DECLARE @vBx float, @vBy float, @vBz float SET @vBz = SIN(@phiB) SET @vBx = COS(@phiB) SET @vBy = @vBx * SIN(@thetaB) -- COS(@phiB)*SIN(@thetaB) SET @vBx = @vBx * COS(@thetaB) -- COS(@phiB)*COS(@thetaB) -- find the sum V = v_A + v_B and scale it by 1/(1 + v_A . v_B) DECLARE @Vx float, @Vy float, @Vz float, @scaleV float SET @scaleV = 1.0/(1.0 + ( COS(@phiDiff) - COS(@phiA)*COS(@phiB)*SIN(@thetaDiff)*TAN(0.5*@thetaDiff) )) SET @Vx = (@vAx + @vBx) * @scaleV SET @Vy = (@vAy + @vBy) * @scaleV SET @Vz = (@vAz + @vBz) * @scaleV -- find the unit vector W perpendicular to both v_A and v_B -- take cross product of those two vectors and normalize it DECLARE @Wx float, @Wy float, @Wz float, @scaleW float SET @Wz = -COS(@phiA)*COS(@phiB)*SIN(@thetaDiff) SET @Wy = @vAz*@vBx - @vBz*@vAx SET @Wx = @vBz*@vAy - @vAz*@vBy SET @scaleW = 1.0/SQRT( @Wx*@Wx + @Wy*@Wy + @Wz*@Wz ) SET @Wx = @Wx * @scaleW SET @Wy = @Wy * @scaleW SET @Wz = @Wz * @scaleW -- We've now precomputed all the quantities depending solely -- on v_A and v_B, so we're ready to perform the main SELECT -- check V.W, should be zero -- SELECT @Vx*@Wx + @Vy*@Wy + @Vz*@Wz as VdotW -- check V.vA and V.vB, should be equal -- SELECT @Vx*@vAx + @Vy*@vAy + @Vz*@vAz as VdotvA, -- @Vx*@vBx + @Vy*@vBy + @Vz*@vBz as VdotvB -- check norm of V squared -- SELECT @Vx*@Vx + @Vy*@Vy + @Vz*@Vz as NormVSqr SELECT T.zip FROM T WHERE SQRT( 1.0 - SQUARE( SIN(T.phi*@deg2rad)*@Wz + COS(T.phi*@deg2rad)*SIN(T.theta*@deg2rad)*@Wy + COS(T.phi*@deg2rad)*COS(T.theta*@deg2rad)*@Wx ) ) BETWEEN @CosNR AND 1.0E-8 + ( SIN(T.phi*@deg2rad)*@Vz + COS(T.phi*@deg2rad)*SIN(T.theta*@deg2rad)*@Vy + COS(T.phi*@deg2rad)*COS(T.theta*@deg2rad)*@Vx ) END TEST CASES ========== Recalling the test case records which were given in a Comment below, here are a few tests of the stored procedure based on them: -- this first series tests the Dallas/Houston strip ZipStrip '75207','77050',5 go -- 75110 'Corsicana TX' -- 75207 'Dallas' -- 77050 'Houston' ZipStrip '75207','77050',50 go -- increased width, no additional records ZipStrip '75207','77050',100 go -- 76710 'Waco TX' appears now ZipStrip '75207','77050',500 go -- 78210 'San Antonio TX' at this width -- NB: 76110 'Fort Worth TX' never appears, despite -- its proximity to Dallas, because of its angle to -- the path from Dallas to Houston -- this second series tests the LA/NY strip ZipStrip '90010','10010',50 go -- 10010 'New York' -- 63401 'Hannibal MO' -- 90010 'Los Angeles' ZipStrip '90010','10010',500 go -- two more records appear at this distance -- 75207 'Dallas' -- 76110 'Fort Worth' HOW IT WORKS ============ The stored procedure ZipStrip is intended to take as input the zipcodes for two distinct locations A,B within an existing table T of such locations. A constant @deg2rad which gives the number pi/180 of radians per degree is defined for our use in converting degrees of latitude phi and longitude theta given in that table into the radian values expected by the trigonometric functions in MS SQL Server's Transact-SQL implementation. The mapping from latitude/longitude pairs we use here: [x,y,z] = [ COS(phi)*COS(theta), COS(phi)*SIN(theta), SIN(phi) ] deserves a bit of explanation. First of all the mapping is to Cartesian coordinates on a unit sphere. This effectively assumes that distances on the Earth can be modelled as if it were a perfect sphere. Actually the Earth is more of a oblate spheroid or ellipsoid, with a slightly greater radius of 3963 miles at the equator versus 3950 miles at the poles, but I've chosen to use an intermediate value of R = 3958 miles for the purpose of this computation. The upshot of the scaling is that the radius is only needed in one expression, so for the rest of the procedure we are fine with working only on a unit sphere. The one place where R is needed is in the condition for "half-width" of the strip of zipcode locations, which elsewhere we refer to as the "latitude test" because the interpretation of the path from A to B as an "equator" would make an distance from this path correspond to an angle of latitude. Precisely, a spherical distance of N miles on a sphere of radius R corresponds to an angle (as viewed from the sphere's center) of N/R radians. Therefore we will want to select points whose "unit vector" u lies within this angle of "latitude" with respect to an "equator" passing through input points A,B. The other test that we impose is the "longitude test," which asks that this same unit vector u should fall between the "lines of longitude" that pass through A,B respectively. Here we are not speaking of real lines of longitude, but rather of great circles at A,B resp. which are perpendicular there to the great circle "equator" through A and B. Computationally we need to decompose the vector u into "orthogonal" components: u = v + w such that v lies in the same plane as the great circle "equator" through A and B, and w is perpendicular to this plane. Our approach is to take the two unit vectors v_A and v_B that correspond to points A,B and construct a unit vector: v_A x v_B Here is meant a special "cross product" of three dimensional vectors which has the characteristic of being perpendicular to both "factors" v_A and v_B. We then "normalize" the resulting vector to have unit length and call this W. The relationship of u,w, and W is then given by a simple inner or "dot" product: w = (u.W) W This is the scalar multiple (u.W) of W, and it has length or "norm": ||w|| = |u.W| because ||W|| = 1. The cross-product of two vectors is only defined in three dimensions, and it can be a bit cumbersome to state. The most compact form of definition is to describe it in terms of a determinant, but here it is written out: u x v = [ u_y*v_z - u_z*v_y, u_z*v_x - u_x*v_z, u_x*v_y - u_y*v_x ] All we really need to know for our problem can be verified by direct computation, namely that: (u x v).u = 0 (u x v).v = 0 i.e. that the cross-product is orthogonal (perpendicular) to both its factors. In our procedure we apply this to get W from v_A x v_B to be perpendicular to both of them, and hence to the entire plane of the "equator" passing through A and B. Once we have w from W by the inner product relation above, the other component v is also known: v = u - w In fact we don't need to explicitly compute v at all, because the only things we really will need about v are: 1) its norm ||v||, and 2) its inner product with the "bisector" of v_A and v_B v.(v_A+v_B) The first of these values is easily obtained by the Pythagorean relationship: ||v||^2 + ||w||^2 = ||u||^2 = 1 because v,w are the perpendicular "legs" that form sum vector u (which is of unit length), and thus: ||v|| = SQRT( 1 - ||w||^2 ) The second of those values above, the inner product of v with the sum of vectors of v_A and v_B, is important because of its relationship to the "longitude test". Imagine for a moment that v is restricted to being a unit vector lying in the plane of the "equator" passing through A and B, i.e. the plane that contains both v_A and v_B. This expression: v.(v_A + v_B) will then take on positive and negative values as v "rotates" around that great circle. In fact the maximum value occurs when v is parallel to v_A + v_B, namely at the midpoint of the (shortest) path between A and B, and the expression has a strictly smaller value at the "endpoints" of this path: when v = v_A or v = v_B, v.(v_A + v_B) = 1 + (v_A . v_B) Therefore if v were a unit vector (in the plane of v_A and v_B), our test for whether v falls "between" v_A and v_B would simply be: v.(v_A + v_B) >= 1 + (v_A . v_B) where the right-hand side of this inequality is simply a constant, depending only on v_A and v_B, that we can precompute. More generally, since v need not be a unit vector, we need to scale the test according to the actual length of v: v.(v_A + v_B) >= ( 1 + (v_A . v_B) ) ||v|| A couple of tricks are used to make the test more "compact". We define: V = ( 1/ (1 + (v_A . v_B) ) (v_A + v_B) so that we have only to check: v.V >= ||v|| and second we note that by the decomposition u = v + w, w is orthogonal to V and: u.V = v.V Therefore, we have the implemented form of the longitude test: u.V >= ||v|| But this ties in easily with the other half, our "latitude test". Instead of asking: ||w|| <= SIN(N/R) which expresses the notion that the perpendicular distance ||w|| that u lies above the plane of v_A and v_B is within the allowed N miles, we can turn the relationship around to a condition on ||v||: ||v|| = SQRT( 1 - ||w||^2 ) >= COS(N/R) [note reversal in inequality direction] Now the advantage of stating the combined conditions like this: COS(N/R) <= ||v|| <= u.V is that we can use the BETWEEN syntax of a WHERE condition to insure that the expressions are only mentioned and thus only evaluated once: WHERE SQRT( 1 - ||w||^2 ) BETWEEN COS(N/R) AND u.V I need to look back over my notes to pick out some links for additional resources, but I'm posting this now for you to test. regards, mathtalk-ga |
maxout-ga
rated this answer:
I'd sure like to crawl in Mathtalk's brain for a day. My answer turned out to be fairly involved, but he broke it down step by step not only providing the answer, but also the reasoning behind the answer. Upon testing his answer operates exactly as expected, and allows my application to run real time instead of batch. Thanks! |
|
Subject:
Re: Identifying zip codes between two points
From: mathtalk-ga on 27 Aug 2003 00:25 PDT |
Hi, maxout-ga: I've been actively working on your question during this time, and while I can appreciate that you might think this is an overly long time, please bear in mind that it involves both a mathematical problem and SQL Server issues. To post a partial answer for either part as an answer would be against the Google Answer's Terms of Service. Thus I will use this comment to give an outline of the answer. MATHEMATICAL PROBLEM ==================== Let two distinct points A,B taken from your zipcode table be specified. These provide the coordinates of latitude and longitude on the Earth's surface, which we will model as a perfect sphere of radius R miles. As can be seen from the special case in which A,B are endpoints on the equator, the set of points within N miles of the (great circle arc) shortest path from A to B has two pairs of boundaries. The first pair we will describe by a "latitude test" since in the special case the upper and lower boundaries are simply two lines of latitude, parallel to the equator. The second pair we will describe by a "longitude test since in the special case the left and right boundaries are simly two lines of longitude (passing through A,B respectively). In the general case we will require these formulas for the mathematical expressions. Here v_A and v_B represent unit vectors in three dimensions whose Cartesian coordinates are converted from the latitude/longitude measurements of points A,B respectively. We will use ||v|| to denote the norm (length) of a vector v, and <v,w> to denote the inner product (dot product) of two vectors. Coordinate Conversion: Given the latitude phi and longitude theta of a point on the unit sphere, its Cartesian coordinates can be mapped by these formulas: x = cos(phi) cos(theta) y = cos(phi) sin(theta) z = sin(phi) [Note: This mapping produces a left-handed coordinate system in 3D, but it is especially convenient for locations in the United States, which all lie in the Northern latitudes and (with minor exceptions) have Western longitudes.] Orthogonal Decomposition: Let u be a unit vector. Assuming that v_A and v_B are not anti-podal (exactly on opposite sides of the sphere), then one can uniquely express: u = v + w where v is a vector in the plane spanned by {v_A,v_B}, and w is a vector that is orthogonal (perpendicular) to that plane. In general v,w will each have length less than 1, as the unit length of u is the hypotenuse of a right triangle formed with "legs" of length equal to that of v,w. ||v||^2 + ||w||^2 = 1 Think of v as the "shadow" of u cast onto the plane that contains both v_A and v_B, if the Earth were a crystal sphere and the Sun were moved directly overhead of this plane. Hmm. Okay, I'd better just stick to the math... Latitude Test: ||w|| <= sin( N/R ) Longitude Test: <v, v_A + v_B> >= (1 + <v_A, v_B>) ||v|| Note that, as is to be expected, the first "latitude" test involves the parameter of N miles (suitably scaled by radius R), but the second "longitude" test does not. Summary: Any points u which pass these two tests correspond to locations within the desired region "near" the shortest path from A to B. SQL SERVER ISSUES ================= For the sake of discussion we assume the existence of a simplified table T in which only a primary five digit zipcode key (T.zip) and the two dependent fields of latitude and longitude are present (T.phi and T.theta, respectively). Your original question asks for an implementation based on a user-defined function and a single SQL query. Ignoring for the moment the distinction between specifying N miles versus a percentage N% (of the distance from A to B) as the region half-width, such a formulation might look like this: SELECT C.* FROM T as A, T as B, T as C WHERE A.zip = @zipA AND B.zip = @zipB AND MyTestFunction(A.phi,A.theta,B.phi,B.theta,C.phi,C.theta,@N) = 1 in which we've made use of "parameters" @zipA and @zipB to identify the endpoints by their zipcodes and @N to specify the threshold distance in miles. Here MyTestFunction( ) should be written to compute v_A from A.phi,A.theta and v_B from B.phi,B.theta, to compute u from C.phi,C.theta and determine its orthogonal decomposition, and to return 1 if both tests described above are satisfied (otherwise 0). While such an implementation is possible, its structure requires the evaluation of MyTestFunction( ) on all records in table T (as aliased here by C). I fear that the performance of this approach would be much slower than optimal. Two alternatives to this can be considered. One is to use a stored procedure, and another is to use a user-defined function that returns a table value (which could be an image of the selected rows from T). Either of these approaches would allow us to compute once for all rows those values, like the inner product <v_A,v_B>, which depend only on the endpoints A,B and not on the prospective test point. I therefore request your patience as I work to supply a clear and complete explanation, both for the mathematical methods and SQL techniques which I will recommend. regards, mathtalk-ga |
Subject:
Re: Identifying zip codes between two points
From: mathtalk-ga on 28 Aug 2003 15:21 PDT |
-- create a sample table T to hold latitude and longitude -- coordinates keyed to zipcode if exists (SELECT * FROM dbo.sysobjects WHERE id = object_id(N'[T]') AND OBJECTPROPERTY(id, N'IsUserTable') = 1) DROP TABLE [T] GO CREATE TABLE T ( zip char(5) not Null PRIMARY KEY, phi float not Null, theta float not Null ) GO INSERT INTO T VALUES ('75207', 32.7853, 96.8223) -- Dallas go INSERT INTO T VALUES ('77050', 29.9023, 95.2677) -- Houston go INSERT INTO T VALUES ('75110', 32.0983, 96.5183) -- Corsicana TX go INSERT INTO T VALUES ('76710', 31.5356, 97.1921) -- Waco TX go INSERT INTO T VALUES ('78210', 29.3962, 98.4659) -- San Antonio TX go INSERT INTO T VALUES ('76110', 32.7066, 97.3382) -- Fort Worth TX go INSERT INTO T VALUES ('90010', 34.0589, 118.3299) -- Los Angeles CA go INSERT INTO T VALUES ('10010', 40.7386, 73.9826) -- New York NY go INSERT INTO T VALUES ('63401', 39.7145, 91.4209) -- Hannibal MO go -- create a user-defined function which finds distance/angle between -- two points on the unit sphere from "geographic" coordinates -- latitude phi and longitude theta measured in degrees CREATE FUNCTION SphereDist ( @phiA float, @thetaA float, @phiB float, @thetaB float ) RETURNS float AS BEGIN -- convert all degrees to radians DECLARE @deg2rad float, @phiDiff float SET @deg2rad = PI( )/180 SET @phiDiff = @deg2rad * (@phiA - @phiB) SET @phiA = @deg2rad * @phiA SET @phiB = @deg2rad * @phiB SET @deg2rad = @deg2rad * (@thetaA - @thetaB) -- this form of the dot product between vectors A & B -- minimizes cancellation errors due to small differences -- between pairs of latitude or longitude coordinates RETURN( ACOS( COS(@phiDiff) - COS(@phiA)*COS(@phiB)*SIN(@deg2rad)*TAN(0.5*@deg2rad) ) ) END -- test the user-defined function by finding all distance pairs -- using 3958 miles as the approximate radius of the earth SELECT A.zip, B.zip, 3958*dbo.SphereDist(A.phi,A.theta,B.phi,B.theta) FROM T as A, T as B WHERE A.zip < B.zip go |
Subject:
Re: Identifying zip codes between two points
From: sanjay_gandhi-ga on 30 Mar 2004 13:20 PST |
This is amazing. We spent 12 hours just to find the formula for calculating distance between 2 lat/long pairs!!! We have a separate problem though. We got a database of around 150 different stores and 1.6 million customers. I came up with a SQL procedure to create a table with these fields: Store, Customer, Distance, Rank. The rank is based on the closest store to a customer. So sample output will be somthing like this: Customer Store Distance Rank 1 100 2.1 1 1 101 2.9 2 1 120 6.0 3 It is currently taking 60 hours of processing time on a dual proc server. Appreciate if anyone could give any tips to speed it up. I will post it as a question to the math genius maxout-ga if no one responds! |
Subject:
Re: Identifying zip codes between two points
From: mathtalk-ga on 02 Apr 2004 06:17 PST |
Hi, sanjay_gandhi-ga: Your description is too brief to make detailed suggestions for improvements possible. It seems unlikely that your goal is to rank all 150 stores for each of the 1.6 million customers; surely some savings in computation can be obtained by limiting the results to finding the 5 closest stores for each customer, although of course this is still a whopping big job if you need to do it for all customers at once. I believe that it would be practical to take an "on demand" approach as I did for maxout-ga, given a particular customer, to find the 5 closest stores. You may wish, for practical reasons, to limit the store search to a distance of (say) 500 miles. One element of this would be to quickly apply longitude and latitude limits so that the distance calculation is carried out, not over all 150 stores, but only over a modest subset of them that are "reasonably" close. If you are doing this as a batch, then I'd suggest that you consider an incremental strategy in which you update the results only as entries for stores or customers are added, changed, or deleted. The more details you can supply, the more likely it is that someone with an interest in assisting you could be successful. If you are interested in programming this yourself, I think you'll find much of the computation has been done here for maxout-ga. My efforts to optimize the calculations for his particular need may have obscured some of the basics, however. regards, mathtalk-ga |
Subject:
Re: Identifying zip codes between two points
From: sanjay_gandhi-ga on 02 Apr 2004 10:08 PST |
--Your description is too brief to make detailed suggestions for --improvements possible. It seems unlikely that your goal is to rank --all 150 stores for each of the 1.6 million customers; surely some --savings in computation can be obtained by limiting the results to --finding the 5 closest stores for each customer, although of course --this is still a whopping big job if you need to do it for all --customers at once. Let me clarify my goal: Get all stores within 30 miles of a customer and rank them based on distance. If a customer has less than 5 stores within that range, then go further till 300 miles to get at least 5 closest stores. I cannot use an incremental approach because when a new store is added, it can change the rankings of any of the 1.6 million customers. So I still have to scan the entire customer database. BTW, I got the processing down to ***10 hours*** on a dual 1.2Ghz machine. Just in case anyone is curious how I did it: NOTE: Avoid using SQL cursors. That's why it took 40 hrs before! 1. First, fill up the destination table (let's say Output_Result) without the rank information. So, just get all stores within 30 miles of the 1.6 million customers. You can speed this up by using what I call "perfect square" method. Instead of using a formula to calculate distance between a customer and all 150 stores, we can limit the number of stores used in the query. Remember, we need all stores within 30 mile radius. So, imagine this 30 mile radius within a square of width=(Diameter of circle). Doing some geometric calulcations, get the upper and lower limit of lat/long first. This will limit your calculation to around 30 stores. And then use the distance formula to calulate distance beteen 2 lat/long pairs and batch insert the result set. This takes about ***6 hours processing time***. 2. Then, do the ranking using SQL Update statement. This was the toughest part. I got this off from some newsgroup. Here is the SQL statement: update Output_Result set Rank = (SELECT Count_Big(*) + 1 FROM Output_Result OR2 Where Output_Result.CustID = OR2.CustID and Output_Result.DistanceToStore > OR2.SADA_DistanceToStore ) This part takes around ***4 hours*** when the Output_Result table has around 20 million rows. Thats it!!! |
Subject:
Re: Identifying zip codes between two points
From: sanjay_gandhi-ga on 02 Apr 2004 10:24 PST |
CORRECTION: The update sql statement should have been: update Output_Result set Rank = (SELECT Count_Big(*) + 1 FROM Output_Result OR2 Where Output_Result.CustID = OR2.CustID and Output_Result.DistanceToStore > OR2.DistanceToStore ) |
Subject:
Re: Identifying zip codes between two points
From: mathtalk-ga on 02 Apr 2004 10:45 PST |
You don't mention what indexes, if any, are defined on your Output_Results table, but that might be one area to investigate. I believe that a table like Output_Results can be most efficiently maintained incrementally. If a new store is added, you don't have to revise results for all 1.6 million customers, only those within 300 miles of the store (actually only for a subset of these, customers within 30 miles of the store and those with at most 5 store rankings). However running a batch refresh of all the rankings may not be a burden for you if this only needs to be done infrequently, and it's certainly a good idea to have this capability. Cursors can be a good thing, but one needs to be careful to use them efficiently. A cursor over the entire set of customers, for example, is not a good place to begin (because the number of stores is much smaller than the number of customers). If I were starting from scratch I'd try a cursor over the list of Stores and use your "perfect square" method to find all the candidate Customers that might be within 30 miles of each Store. Do the rankings with that smaller set of "output results", then go back and identify Customers with fewer than 5 rankings. Then work with this smaller set of Customers, which requires throwing "perfect squares" out to 300 miles for any Stores. You undoubtedly already have an idea what percentage of your Customers will wind up in the latter category (fewer than 5 stores within 30 miles), so you can judge for yourself if this divide-and-conquer scheme is promising. regards, mathtalk-ga |
Subject:
Re: Identifying zip codes between two points
From: sanjay_gandhi-ga on 02 Apr 2004 14:30 PST |
WOW! You are quick in responding. I got 2 Foreign Key indices on CustID and StoreID. I am still not confident about updating table incrementally, since we run this update 2 times a year and its not just the store addresses that change, its the customers too. So I'll have to build the logic of tracking which addresses changed and all that. Not worth spending time on. I did try the cursor on store instead of doing a "batch" (Select...Insert Into..) statement. It didnt give any performance improvement - in fact it took an hour more to execute. I think the SQL 2000 is smart enough to optimize the batch insert. It was a good try though. Thanks mathtalk-ga. It really helps to get a second opinion |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |