Google Answers: SQL Update Query to Calculate Monthly Return

View Question

Q: SQL Update Query to Calculate Monthly Return ( Answered 5 out of 5 stars

, 0 Comments )

Question

Subject: SQL Update Query to Calculate Monthly Return
Category: Computers > Algorithms
Asked by: behave-ga
List Price: $40.00

Posted: 28 Oct 2002 17:52 PST
Expires: 27 Nov 2002 17:52 PST
Question ID: 91539

I would like to create a SQL Server 2000 update query that calculates
the Monthly Return column of a table, as in the following formula:

MR = (Pcm / Plm) - 1

where
MR = Monthly Return
Pcm = Price (current month)
Plm = Price (last month)

The table looks something like this:

tblMonthlySecurityHistory

SourceCode  SecurityCode  EndOfMonth  Price  MonthReturn
    1           101        19990131    7.25    <null>
    1           101        19990229    7.50    <null>
    1           101        19990331    8.30    <null>
    1           101        19990430    8.45    <null>
    2           101        19990131   26.40    <null>
    2           101        19990229   28.05    0.0625
    2           101        19990331   29.47    0.0506
    1           102        19990131   12.90    <null>
    1           102        19990229   10.85    <null>
    1           102        19990331   17.92    <null>

Notes:

1)  I would like to avoid using cursors, if possible
2)  A series is defined as a group of rows with the same Source Code
and Security Code (so there are 3 different series in the above
example)
3)  The first month of a series should remain null, since you can't
compute the Monthly Return value for a first month.
3)  Ideally, the update query should only calculate on a series which
has not yet been populated.  (ie. In the above example only the 1st &
3rd series needs to be updated)
4)  The actual table has over 1.1 million records with many more
columns that are not shown here, so an efficient query is important.

Answer

Subject: Re: SQL Update Query to Calculate Monthly Return
Answered By: mathtalk-ga on 28 Oct 2002 20:51 PST
Rated: 5 out of 5 stars

Hi, behave-ga: Thanks for posting this very interesting SQL problem. The trickiest part is matching the EndOfMonth dates, and it is not clear just from the given sample data what datatype you may have designed for this column. I'm going to assume that the column is char(8) using yyyymmdd format; feel free to ask for a clarification if you have another datatype in mind. I'll also assume that of the columns shown, only the MonthReturn column will allow Nulls. The update you want to do involves implicitly a self-join of the table with itself. This can be expressed in Transact-SQL using a syntax such as: UPDATE tblMonthlySecurityHistory SET new.MonthReturn = ( new.Price/old.Price ) - 1.0 FROM tblMonthlySecurityHistory new, tblMonthlySecurityHistory old WHERE new.MonthReturn is Null AND new.SourceCode = old.SourceCode AND new.SecurityCode = old.SecurityCode AND convert(char(8), dateadd(dd,-1, dateadd(mm,-1, dateadd(dd,+1, convert(datetime,new.EndOfMonth) ))) , 112) = old.EndOfMonth Let me explain a bit about the update statement, and then make some suggestions about how to tweak things for optimal performance. Transact-SQL is the flavor of SQL supported by Microsoft SQL Server. For a documentation of the full variety of options available in an update statement, see this page from Microsoft's MSDN site: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_ua-uz_82n9.asp Essentially the above statement is making a selection of rows from one "copy" of the tblMonthlySecurityHistory table, which we have aliased as "new", using information from another "copy" of the same table, which we have aliased as "old" in order to frame the update conditions without ambiguity. The WHERE clause then filters the "new" copy so that only records with Null values in the MonthReturn column will be updated, and adds to this conditions for the (inner) self-join: - equal values of SourceCode - equal values of SecurityCode - values of "new" and "old" EndOfMonth "dates" are one month apart As you would guess, the latter condition is the cause of most of the complexity in this SQL formulation. I used the "date functions" in Transact SQL to perform "date arithmetic" on the "new" EndOfMonth string. Recall our assumption that this field would be a char(8) value. After converting it to a datetime value, I add one day (getting to presumably the beginning of the next month), subtract one month (getting back to the beginning of the same month as the original field), and then subtract one day (which should give me the last day of the month for the previous month). Now this assumes that these dates are always precisely a last day of a month. If there is any doubt about how the date may appear in this column, then a more robust (but probably less efficient) method of matching the "new" and "old" records must be applied. For example, we might convert both EndOfMonth fields into datetime values and require that their "date difference" in days be between some pair of values (such as 20 and 35 days, if that much flexibility were to be needed in the match). But as I've already gone somewhat out on a limb with assuming that the field is to be represented as a fixed-length string, I'll defer any more discussion of the exact condition until you have a chance to clarify the circumstances. Even so the efficiency of the UPDATE statement's execution plan will depend on the details of how the database engine is able to match the "new" and "old" records. One clear recommendation is to put a clustered index on the fields: (SourceCode, SecurityCode, EndOfMonth) which by appearances might well be the intended primary key for this table in any case. We could get in to details of "in place" updates and some more arcane aspects of performance tuning, but as you have already stated that the actual table is a great deal more complicated than the few fields shown might indicate, I will refrain from trying to guess what might be helpful along those lines. Another assumption I've made here is that Nulls are not allowed in any of the columns shown _except_ MonthReturn. I would certainly be interested to hear from you whether that is correct; if not we would need more robust logic to deal with those Nulls. I think I've addressed your points, but for clarity let me run through your list of notes in sequence: 1) We are not using any cursors. The update statement could be wrapped in a stored procedure, if that were desirable. Not only would a stored procedure be more "brief" in its statement, the stored procedure may have a "pre-compiled" execution plan. I don't think that is significant for an update of this size, however. 2) The definition of the series is captured in our join conditions; the "new" records will be compared only to their (month) "old" counterparts, even if no indexes are constructed (by you; SQL Server is now smart enough to construct some indexes on the fly, if it discovers that they would be helpful). 3) We avoid touching (updating) records whose MonthReturn field is already populated (nonNull) by the first condition of the WHERE clause. 4) With today's hardware (CPU and memory) updating a million records (even with many more columns than shown) is not an intractable task. I'd be happy to discuss techniques for performance tuning of the update with you at some point in the future. regards, mathtalk-ga
Request for Answer Clarification by behave-ga on 29 Oct 2002 07:22 PST Sorry, I should have clarified some points: All columns are integer (including EndOfMonth) except Price and MonthReturn, which are numeric(15,5) Also, all columns in the table will allow nulls; however, nulls will never show up in the first 3 columns because of a pre-processing step prior to the data getting to SQL Server. (Don't ask me why it was set up this way, I've inherited this database late in the game...) The EndOfMonth dates are actually the last trading day of the month, which is normally the last business day of the month, unless there is also a holiday. However, I discovered another column called EndOfMonthYYYYMM (also integer) -- maybe this would be a better column to use to join to the previous month? There are individual indexes on the first 3 columns of the table--would a clustered index still be needed?
Clarification of Answer by mathtalk-ga on 29 Oct 2002 08:38 PST Hi, behave-ga: You really did a nice job of presenting the essence of your problem, shorn of unnecessary details. The additional information can be dealt with as follows. With regard to the presence/absence of Nulls, I understand from your request for clarification that we are "guaranteed" by the process (but not by table design) that the first three columns will not contain Nulls, but that it is possible for the Price column to have Nulls. So we should throw some conditions into the update to make sure that we don't try to compute monthly returns when either the current month or previous month has a missing Price. With regard to having individual indexes on the first three columns rather than a clustered index on the combination of the three, I would be concerned that it implies a logical design flaw in the table. What is the "primary key" on the table, ie. what uniquely identifies a row in this table? If the key is the combination of Source Code, Security Code, and one of the EndOfMonth fields, then SQL Server needs to have a unique compound index to enforce uniqueness. A "clustered" index here would likely affect the efficiency of the processs, but is not logically required. The presence of multiple rows in the table where your application clearly expects a single row can cause indeterminate behaviour in the UPDATE. Now back to the join condition, the integer type of EndOfMonth field, and the possibility of using EndOfMonthYYYYMM instead of EndOfMonth. It would be better to use that year-and-month-only field, since its values are "better behaved" (no pun intended) than those in EndOfMonth. We still have a tricky bit with the year-end wrap-around logic, but it's certainly an improvement. Here's my updated UPDATE: UPDATE tblMonthlySecurityHistory SET new.MonthReturn = ( new.Price/old.Price ) - 1.0 FROM tblMonthlySecurityHistory new, tblMonthlySecurityHistory old WHERE new.MonthReturn is Null AND new.Price is not Null AND new.SourceCode = old.SourceCode AND new.SecurityCode = old.SecurityCode AND new.EndOfMonthYYYYMM - CASE WHEN new.EndOfMonthYYYYMM % 100 = 1 THEN 89 ELSE 1 END = old.EndOfMonthYYYYMM AND old.Price is not Null The "magic" number 89 appears in the year and month computation whenever the current month is a January, eg. 200301 - 89 = 200212. Otherwise we are simply subtracting 1 to get the previous year and month. The rationale for efficiency of this update is to encourage SQL Server to go through the table with a "linear search" to find "new" records that might be updateable (MonthReturn is Null), and then use the indexes to locate (or fail to locate) corresponding "old" records that meet the join condition. There is a facility in Microsoft's SQL Query client to view the execution plan, to make sure that SQL Server is doing the planning this way (or perhaps in an even more clever way). Let me know if you want directions for viewing the plan and interpreting it. You may have noticed that the full UPDATE syntax allows one to introduce heavy-handed "hints" to influence the execution plan. Given that you are working with an existing table design whose every field allows Nulls, we cannot expect to get away with an "in place" update. Indeed, now that I think clearly about it, the fact that the MonthReturn column allows Nulls, which I already knew, prohibits this optimization. I think the structure of the UPDATE conditions above, however, is close to optimal. regards, mathtalk-ga
Request for Answer Clarification by behave-ga on 29 Oct 2002 10:14 PST A very impressive response to my problem. Thank you very much. Yes, I would be interested in how to view the execution plan in Query Analyzer, and what is an "in place" update? Again, a fantastic job on helping me out. Thanks!
Clarification of Answer by mathtalk-ga on 29 Oct 2002 23:30 PST Hi, behave-ga: Sorry that it took me a bit of a long time to reply. I installed SQL Server 2000 on my laptop (which is what I use to prepare my google answers), created a database and table with five columns: SourceCode int (nullable) SecurityCode int (nullable) EndOfMonthYYYYMM int (nullable) Price numeric(15,5) (nullable) MonthReturn numeric(15,5) (nullable) For an initial cut at explaining how to view the execution plan, I went ahead and created a unique cluster index on the first three columns (in their given order). I pasted your sample data above into TextPad and massaged it into ten INSERT statements. I then pasted the resulting SQL into a Query Analyzer window, so that I could view the execution plan. It was fortunate that I did, because I uncovered a _bug_ in my SQL. It's a little tedious to explain, but kind of a funny story on me. Going back at least to SQL Server 6.5 there was a disconnect between the documentation and the reality. The help file for the SQL 6.5 UPDATE statement used an example: UPDATE titles SET t.ytd_sales = t.ytd_sales + qty FROM titles t, sales s WHERE t.title_id = s.title_id of how aliases are supposed to work with UPDATE's, and even goes on to add, "The order of the tables and views after the FROM keyword does not affect the results set returned." This is curious, since an UPDATE statement does not return any result set, possible evidence of a cut and paste error in making the help file. But mostly it just turns out to be all wrong. The statement given doesn't work now & didn't work under 6.5. SQL Server then (and now) will not allow you to qualify the columns to be "SET" with the aliased tablename (in the example, t.ytd_sales) because those columns are already understood to belong to the table mentioned after the UPDATE keyword. As I recall we deduced a workaround, like the table mentioned after the UPDATE keyword got identified with the first mention of that table in the FROM clause, if all such tables were aliases, so that (if I remember correctly) the order does make a difference in that backhanded way. Now when I went to look at the SQL Server 2000 documentation in preparing this reply, I saw the exact same example given (toward the bottom of the page which I gave a link to above at MSDN). Ever the optimist of sorts, I said to myself, ha, they must have fixed this in SQL Server 2000! What I failed to notice is that the surrounding text in that documentation now admits that the example does _not_ work. Frankly, I wasn't expecting that! Anyway, here are two ways to fix my bug. The first way is more or less the obvious approach of abandoning the use of an alias in connection with the "copy" of the table to be updated. The statement would then read: UPDATE tblMonthlySecurityHistory SET MonthReturn = ( tblMonthlySecurityHistory.Price/old.Price ) - 1.0 FROM tblMonthlySecurityHistory, tblMonthlySecurityHistory old WHERE tblMonthlySecurityHistory.MonthReturn is Null AND tblMonthlySecurityHistory.Price is not Null AND tblMonthlySecurityHistory.SourceCode = old.SourceCode AND tblMonthlySecurityHistory.SecurityCode = old.SecurityCode AND tblMonthlySecurityHistory.EndOfMonthYYYYMM - CASE WHEN tblMonthlySecurityHistory.EndOfMonthYYYYMM % 100 = 1 THEN 89 ELSE 1 END = old.EndOfMonthYYYYMM AND old.Price is not Null The better way, in my opinion, is this: UPDATE new SET MonthReturn = ( new.Price/old.Price ) - 1.0 FROM tblMonthlySecurityHistory new, tblMonthlySecurityHistory old WHERE new.MonthReturn is Null AND new.Price is not Null AND new.SourceCode = old.SourceCode AND new.SecurityCode = old.SecurityCode AND new.EndOfMonthYYYYMM - CASE WHEN new.EndOfMonthYYYYMM % 100 = 1 THEN 89 ELSE 1 END = old.EndOfMonthYYYYMM AND old.Price is not Null where the aliased name now appear behind the UPDATE keyword. With that correction behind us (I tried both approaches, the results agree), let me move on to the two things you asked about. How do we view the execution plan? If you open Query Analyzer, there are a couple of menu items under the Query menu heading which relate to this. The first is labelled "Display Estimated Execution Plan". If you have one or more statements selected in the window (it assumes you mean all statements if none are selected), then _without_ executing any of them, the Query Analyzer will immediately show a sequence of graphical outlines for the corresponding execution plans. The graphics read more or less from right to left, and as you run the mouse over each iconic node, a text box pops up giving more details of what exactly that node represents. The second menu item under Query is labelled "Show Execution Plan". This is actually a toggle between a mode of showing the execution plan along with the result sets (if any) and messages in a tabbed control below the query box itself _upon_ the actual execution of the query (or queries). So this second option is really about a passive mode for viewing the execution plan, and more or less requires you to go ahead and execute the queries (while the first "Display" option gives us the information without executing the queries). These options can also be accomplished by icons on the default Toolbar (instead of using the Query menu items). The more useful "Display" command appears on the Toolbar in Query Analyzer with the same icon that appears next to both options on the Query menu, three green squares stacked to the right with lines drawn to a central yellow square on the left. I guess it is supposed to be symbolic of how the execution plans are graphically represented. I won't go into any details of what the execution plans showed, since it was only done for my sample table anyway (with indexes the way I wanted them), but it did show the sort of approach I had in mind (scanning through the first "new" copy of the table for potentially updateable records, then matching each of these to the corresponding "old" record). Finally, what is an "in place" update? Well, when you update records in SQL Server, by and large the way it works is that the old record is deleted and a new record is inserted into the database. This seems like unnecessary work if only one field is being changed. Furthermore, if you have a clustered index on the table and none of the fields in the clustered index are being changed, the record is apt to wind up (after a certain amount of shuffling data) right back in the same place it was before! How to avoid this waste of effort? Enter the "in place" update. Rather than deleting and reinserting the entire record, an "in place" update provides the ultimate efficiency of changing just the fields that need to be changed, right in the location where they already exist. Unfortunately there are datatypes that make this impossible in many cases. Strings can have variable lengths; nullable fields may hold data or simply a placeholder; and if key fields are changing, then there are lots of "what if" issues. But it can be done with some restrictions on the leading columns of the table. They should be of "fixed" lengths (meaning no Nulls allowed and no variable length character arrays, etc.) up to the field or fields being changed, and the fields changing should not be part of a primary key or clustered index. A nice bonus about this when it can be done it that it reduces the impact on the transaction logging as well, since only changes to the fields need to be logged (instead of the delete and insert pair that typically accompanied the full record update process). To be honest I'm not sure how "in place" updates are supported in SQL Server 2000. A quick search of MSDN did not turn up a discussion on that issue. Here's a link to an article by the inimitable Kalen Delaney on how to make it work in SQL Server 7: http://www.sqlmag.com/Articles/Index.cfm?ArticleID=8031 About halfway down this page at Amazon.com/UK, there's a user review of Kalen Delaney's book Inside SQL Server 2000 which claims she discusses it there: http://www.amazon.co.uk/exec/obidos/ASIN/0735609985/202-4812466-0647851 best wishes, mathtalk-ga

behave-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $10.00

An exceptional job in answering my question with a full set of details
that supported the answer.  I'm impressed!

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy