Hi, behave-ga:
Thanks for posting this very interesting SQL problem. The trickiest
part is matching the EndOfMonth dates, and it is not clear just from
the given sample data what datatype you may have designed for this
column. I'm going to assume that the column is char(8) using yyyymmdd
format; feel free to ask for a clarification if you have another
datatype in mind. I'll also assume that of the columns shown, only
the MonthReturn column will allow Nulls.
The update you want to do involves implicitly a self-join of the table
with itself. This can be expressed in Transact-SQL using a syntax
such as:
UPDATE tblMonthlySecurityHistory
SET new.MonthReturn = ( new.Price/old.Price ) - 1.0
FROM tblMonthlySecurityHistory new, tblMonthlySecurityHistory old
WHERE new.MonthReturn is Null
AND new.SourceCode = old.SourceCode
AND new.SecurityCode = old.SecurityCode
AND convert(char(8),
dateadd(dd,-1, dateadd(mm,-1, dateadd(dd,+1,
convert(datetime,new.EndOfMonth) )))
, 112) = old.EndOfMonth
Let me explain a bit about the update statement, and then make some
suggestions about how to tweak things for optimal performance.
Transact-SQL is the flavor of SQL supported by Microsoft SQL Server.
For a documentation of the full variety of options available in an
update statement, see this page from Microsoft's MSDN site:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_ua-uz_82n9.asp
Essentially the above statement is making a selection of rows from one
"copy" of the tblMonthlySecurityHistory table, which we have aliased
as "new", using information from another "copy" of the same table,
which we have aliased as "old" in order to frame the update conditions
without ambiguity.
The WHERE clause then filters the "new" copy so that only records with
Null values in the MonthReturn column will be updated, and adds to
this conditions for the (inner) self-join:
- equal values of SourceCode
- equal values of SecurityCode
- values of "new" and "old" EndOfMonth "dates" are one month apart
As you would guess, the latter condition is the cause of most of the
complexity in this SQL formulation. I used the "date functions" in
Transact SQL to perform "date arithmetic" on the "new" EndOfMonth
string. Recall our assumption that this field would be a char(8)
value. After converting it to a datetime value, I add one day
(getting to presumably the beginning of the next month), subtract one
month (getting back to the beginning of the same month as the original
field), and then subtract one day (which should give me the last day
of the month for the previous month).
Now this assumes that these dates are always precisely a last day of a
month. If there is any doubt about how the date may appear in this
column, then a more robust (but probably less efficient) method of
matching the "new" and "old" records must be applied. For example, we
might convert both EndOfMonth fields into datetime values and require
that their "date difference" in days be between some pair of values
(such as 20 and 35 days, if that much flexibility were to be needed in
the match). But as I've already gone somewhat out on a limb with
assuming that the field is to be represented as a fixed-length string,
I'll defer any more discussion of the exact condition until you have a
chance to clarify the circumstances.
Even so the efficiency of the UPDATE statement's execution plan will
depend on the details of how the database engine is able to match the
"new" and "old" records.
One clear recommendation is to put a clustered index on the fields:
(SourceCode, SecurityCode, EndOfMonth)
which by appearances might well be the intended primary key for this
table in any case. We could get in to details of "in place" updates
and some more arcane aspects of performance tuning, but as you have
already stated that the actual table is a great deal more complicated
than the few fields shown might indicate, I will refrain from trying
to guess what might be helpful along those lines.
Another assumption I've made here is that Nulls are not allowed in any
of the columns shown _except_ MonthReturn. I would certainly be
interested to hear from you whether that is correct; if not we would
need more robust logic to deal with those Nulls.
I think I've addressed your points, but for clarity let me run through
your list of notes in sequence:
1) We are not using any cursors. The update statement could be
wrapped in a stored procedure, if that were desirable. Not only would
a stored procedure be more "brief" in its statement, the stored
procedure may have a "pre-compiled" execution plan. I don't think
that is significant for an update of this size, however.
2) The definition of the series is captured in our join conditions;
the "new" records will be compared only to their (month) "old"
counterparts, even if no indexes are constructed (by you; SQL Server
is now smart enough to construct some indexes on the fly, if it
discovers that they would be helpful).
3) We avoid touching (updating) records whose MonthReturn field is
already populated (nonNull) by the first condition of the WHERE
clause.
4) With today's hardware (CPU and memory) updating a million records
(even with many more columns than shown) is not an intractable task.
I'd be happy to discuss techniques for performance tuning of the
update with you at some point in the future.
regards, mathtalk-ga |
Request for Answer Clarification by
behave-ga
on
29 Oct 2002 07:22 PST
Sorry, I should have clarified some points:
All columns are integer (including EndOfMonth) except Price and
MonthReturn, which are numeric(15,5)
Also, all columns in the table will allow nulls; however, nulls will
never show up in the first 3 columns because of a pre-processing step
prior to the data getting to SQL Server. (Don't ask me why it was set
up this way, I've inherited this database late in the game...)
The EndOfMonth dates are actually the last trading day of the month,
which is normally the last business day of the month, unless there is
also a holiday. However, I discovered another column called
EndOfMonthYYYYMM (also integer) -- maybe this would be a better column
to use to join to the previous month?
There are individual indexes on the first 3 columns of the
table--would a clustered index still be needed?
|
Clarification of Answer by
mathtalk-ga
on
29 Oct 2002 08:38 PST
Hi, behave-ga:
You really did a nice job of presenting the essence of your problem,
shorn of unnecessary details. The additional information can be dealt
with as follows.
With regard to the presence/absence of Nulls, I understand from your
request for clarification that we are "guaranteed" by the process (but
not by table design) that the first three columns will not contain
Nulls, but that it is possible for the Price column to have Nulls. So
we should throw some conditions into the update to make sure that we
don't try to compute monthly returns when either the current month or
previous month has a missing Price.
With regard to having individual indexes on the first three columns
rather than a clustered index on the combination of the three, I would
be concerned that it implies a logical design flaw in the table. What
is the "primary key" on the table, ie. what uniquely identifies a row
in this table? If the key is the combination of Source Code, Security
Code, and one of the EndOfMonth fields, then SQL Server needs to have
a unique compound index to enforce uniqueness. A "clustered" index
here would likely affect the efficiency of the processs, but is not
logically required. The presence of multiple rows in the table where
your application clearly expects a single row can cause indeterminate
behaviour in the UPDATE.
Now back to the join condition, the integer type of EndOfMonth field,
and the possibility of using EndOfMonthYYYYMM instead of EndOfMonth.
It would be better to use that year-and-month-only field, since its
values are "better behaved" (no pun intended) than those in
EndOfMonth. We still have a tricky bit with the year-end wrap-around
logic, but it's certainly an improvement.
Here's my updated UPDATE:
UPDATE tblMonthlySecurityHistory
SET new.MonthReturn = ( new.Price/old.Price ) - 1.0
FROM tblMonthlySecurityHistory new, tblMonthlySecurityHistory old
WHERE new.MonthReturn is Null
AND new.Price is not Null
AND new.SourceCode = old.SourceCode
AND new.SecurityCode = old.SecurityCode
AND new.EndOfMonthYYYYMM
- CASE WHEN new.EndOfMonthYYYYMM % 100 = 1 THEN 89 ELSE 1 END
= old.EndOfMonthYYYYMM
AND old.Price is not Null
The "magic" number 89 appears in the year and month computation
whenever the current month is a January, eg. 200301 - 89 = 200212.
Otherwise we are simply subtracting 1 to get the previous year and
month.
The rationale for efficiency of this update is to encourage SQL Server
to go through the table with a "linear search" to find "new" records
that might be updateable (MonthReturn is Null), and then use the
indexes to locate (or fail to locate) corresponding "old" records that
meet the join condition. There is a facility in Microsoft's SQL Query
client to view the execution plan, to make sure that SQL Server is
doing the planning this way (or perhaps in an even more clever way).
Let me know if you want directions for viewing the plan and
interpreting it. You may have noticed that the full UPDATE syntax
allows one to introduce heavy-handed "hints" to influence the
execution plan.
Given that you are working with an existing table design whose every
field allows Nulls, we cannot expect to get away with an "in place"
update. Indeed, now that I think clearly about it, the fact that the
MonthReturn column allows Nulls, which I already knew, prohibits this
optimization. I think the structure of the UPDATE conditions above,
however, is close to optimal.
regards, mathtalk-ga
|
Request for Answer Clarification by
behave-ga
on
29 Oct 2002 10:14 PST
A very impressive response to my problem. Thank you very much.
Yes, I would be interested in how to view the execution plan in Query
Analyzer, and what is an "in place" update?
Again, a fantastic job on helping me out. Thanks!
|
Clarification of Answer by
mathtalk-ga
on
29 Oct 2002 23:30 PST
Hi, behave-ga:
Sorry that it took me a bit of a long time to reply. I installed SQL
Server 2000 on my laptop (which is what I use to prepare my google
answers), created a database and table with five columns:
SourceCode int (nullable)
SecurityCode int (nullable)
EndOfMonthYYYYMM int (nullable)
Price numeric(15,5) (nullable)
MonthReturn numeric(15,5) (nullable)
For an initial cut at explaining how to view the execution plan, I
went ahead and created a unique cluster index on the first three
columns (in their given order).
I pasted your sample data above into TextPad and massaged it into ten
INSERT statements. I then pasted the resulting SQL into a Query
Analyzer window, so that I could view the execution plan.
It was fortunate that I did, because I uncovered a _bug_ in my SQL.
It's a little tedious to explain, but kind of a funny story on me.
Going back at least to SQL Server 6.5 there was a disconnect between
the documentation and the reality. The help file for the SQL 6.5
UPDATE statement used an example:
UPDATE titles
SET t.ytd_sales = t.ytd_sales + qty
FROM titles t, sales s
WHERE t.title_id = s.title_id
of how aliases are supposed to work with UPDATE's, and even goes on to
add, "The order of the tables and views after the FROM keyword does
not affect the results set returned." This is curious, since an
UPDATE statement does not return any result set, possible evidence of
a cut and paste error in making the help file. But mostly it just
turns out to be all wrong. The statement given doesn't work now &
didn't work under 6.5. SQL Server then (and now) will not allow you
to qualify the columns to be "SET" with the aliased tablename (in the
example, t.ytd_sales) because those columns are already understood to
belong to the table mentioned after the UPDATE keyword. As I recall
we deduced a workaround, like the table mentioned after the UPDATE
keyword got identified with the first mention of that table in the
FROM clause, if all such tables were aliases, so that (if I remember
correctly) the order does make a difference in that backhanded way.
Now when I went to look at the SQL Server 2000 documentation in
preparing this reply, I saw the exact same example given (toward the
bottom of the page which I gave a link to above at MSDN). Ever the
optimist of sorts, I said to myself, ha, they must have fixed this in
SQL Server 2000! What I failed to notice is that the surrounding text
in that documentation now admits that the example does _not_ work.
Frankly, I wasn't expecting that!
Anyway, here are two ways to fix my bug. The first way is more or
less the obvious approach of abandoning the use of an alias in
connection with the "copy" of the table to be updated. The statement
would then read:
UPDATE tblMonthlySecurityHistory
SET MonthReturn = ( tblMonthlySecurityHistory.Price/old.Price ) -
1.0
FROM tblMonthlySecurityHistory, tblMonthlySecurityHistory old
WHERE tblMonthlySecurityHistory.MonthReturn is Null
AND tblMonthlySecurityHistory.Price is not Null
AND tblMonthlySecurityHistory.SourceCode = old.SourceCode
AND tblMonthlySecurityHistory.SecurityCode = old.SecurityCode
AND tblMonthlySecurityHistory.EndOfMonthYYYYMM
- CASE WHEN tblMonthlySecurityHistory.EndOfMonthYYYYMM % 100 =
1
THEN 89 ELSE 1 END
= old.EndOfMonthYYYYMM
AND old.Price is not Null
The better way, in my opinion, is this:
UPDATE new
SET MonthReturn = ( new.Price/old.Price ) - 1.0
FROM tblMonthlySecurityHistory new, tblMonthlySecurityHistory old
WHERE new.MonthReturn is Null
AND new.Price is not Null
AND new.SourceCode = old.SourceCode
AND new.SecurityCode = old.SecurityCode
AND new.EndOfMonthYYYYMM
- CASE WHEN new.EndOfMonthYYYYMM % 100 = 1 THEN 89 ELSE 1 END
= old.EndOfMonthYYYYMM
AND old.Price is not Null
where the aliased name now appear behind the UPDATE keyword.
With that correction behind us (I tried both approaches, the results
agree), let me move on to the two things you asked about.
How do we view the execution plan? If you open Query Analyzer, there
are a couple of menu items under the Query menu heading which relate
to this. The first is labelled "Display Estimated Execution Plan".
If you have one or more statements selected in the window (it assumes
you mean all statements if none are selected), then _without_
executing any of them, the Query Analyzer will immediately show a
sequence of graphical outlines for the corresponding execution plans.
The graphics read more or less from right to left, and as you run the
mouse over each iconic node, a text box pops up giving more details of
what exactly that node represents.
The second menu item under Query is labelled "Show Execution Plan".
This is actually a toggle between a mode of showing the execution plan
along with the result sets (if any) and messages in a tabbed control
below the query box itself _upon_ the actual execution of the query
(or queries). So this second option is really about a passive mode
for viewing the execution plan, and more or less requires you to go
ahead and execute the queries (while the first "Display" option gives
us the information without executing the queries).
These options can also be accomplished by icons on the default Toolbar
(instead of using the Query menu items). The more useful "Display"
command appears on the Toolbar in Query Analyzer with the same icon
that appears next to both options on the Query menu, three green
squares stacked to the right with lines drawn to a central yellow
square on the left. I guess it is supposed to be symbolic of how the
execution plans are graphically represented.
I won't go into any details of what the execution plans showed, since
it was only done for my sample table anyway (with indexes the way I
wanted them), but it did show the sort of approach I had in mind
(scanning through the first "new" copy of the table for potentially
updateable records, then matching each of these to the corresponding
"old" record).
Finally, what is an "in place" update? Well, when you update records
in SQL Server, by and large the way it works is that the old record is
deleted and a new record is inserted into the database. This seems
like unnecessary work if only one field is being changed.
Furthermore, if you have a clustered index on the table and none of
the fields in the clustered index are being changed, the record is apt
to wind up (after a certain amount of shuffling data) right back in
the same place it was before! How to avoid this waste of effort?
Enter the "in place" update. Rather than deleting and reinserting the
entire record, an "in place" update provides the ultimate efficiency
of changing just the fields that need to be changed, right in the
location where they already exist. Unfortunately there are datatypes
that make this impossible in many cases. Strings can have variable
lengths; nullable fields may hold data or simply a placeholder; and if
key fields are changing, then there are lots of "what if" issues. But
it can be done with some restrictions on the leading columns of the
table. They should be of "fixed" lengths (meaning no Nulls allowed
and no variable length character arrays, etc.) up to the field or
fields being changed, and the fields changing should not be part of a
primary key or clustered index.
A nice bonus about this when it can be done it that it reduces the
impact on the transaction logging as well, since only changes to the
fields need to be logged (instead of the delete and insert pair that
typically accompanied the full record update process). To be honest
I'm not sure how "in place" updates are supported in SQL Server 2000.
A quick search of MSDN did not turn up a discussion on that issue.
Here's a link to an article by the inimitable Kalen Delaney on how to
make it work in SQL Server 7:
http://www.sqlmag.com/Articles/Index.cfm?ArticleID=8031
About halfway down this page at Amazon.com/UK, there's a user review
of Kalen Delaney's book Inside SQL Server 2000 which claims she
discusses it there:
http://www.amazon.co.uk/exec/obidos/ASIN/0735609985/202-4812466-0647851
best wishes, mathtalk-ga
|