Google Answers Logo
View Question
Q: Xeon P4 latency ( Answered,   0 Comments )
Subject: Xeon P4 latency
Category: Computers > Hardware
Asked by: stelle-ga
List Price: $18.00
Posted: 29 Mar 2004 00:12 PST
Expires: 28 Apr 2004 01:12 PDT
Question ID: 321522
On a Xeon P4 (3.06GHZ, 1MB Level 3 cache), what is the latency from
Level 1 to the CPU,the latecy from Level2 to Level1 and the latency
from Level3 to Level2 ?
Subject: Re: Xeon P4 latency
Answered By: haversian-ga on 30 Mar 2004 02:12 PST
Good morning Stelle-ga,

The question you ask is not as simple as it at first appears.  The P4
has separate L1 instruction and data caches.  The L1 Dcache is fairly
standard, but the Icache bears further comment.  The Icache does not
actually cache the contents of the L2, but rather caches
already-decoded instructions.  As a result of this, the data in the L1
Icache is not fed into the same pipeline stage as a traditional L1
Icache would.  Since the instructions are already decoded, the P4's
Icache skips the initial X86 decoding stages which form the head of
the P4's pipeline.  When an instruction is not found in the L1 trace
cache, the CPU checks the L2 cache.  If found there, the instruction
passes through 8 pipeline stages to decode it into P4 micro-ops and be
deposited in the L1 Icache.  There are a few other complications to
your question as well which I will mention later.

Ace's Hardware (
) has a short post about the then-new Xeon DP with 1M L3 cache that
answers many of the questions you asked.  The L1 (data cache) latency
is 2 cycles, the L2 latency is 9 cycles (2 + 7), and the L3 latency is
23 cycles (2 + 7 + 14).  These are all best-case scenarios.  For
example, while the L1 cache's integer load latency is indeed 2 cycles,
its floating-point load latency climbs to 9 cycles.

Ace's also measured the L1 cache's miss and replace latency at 31 and
17 cycles, respectively and the L2's miss latency at 328 cycles.  See
( ).  What this means
is that, although the L2 latency is listed as 9 cycles (2 + 7), it may
take up to 38 cycles (L1 miss + L2 hit).  Similar calculations hold
for the L3.  Bear in mind that, while the P4's cache architecture is
"mostly inclusive", it is not true that ALL L1 Dcache data is mirrored
in both the L2 cache and the L3 cache.  This makes it difficult to
give you precise numbers, as the latencies seen will depend on where
the data actually lies. lists an absolute ton of information.  Their P4
(including Celeron and Xeon) page, found at ( ) confirms Ace's 2 / 7 / 14 cycle
latencies for L1, L2, and L3 caches.  It also provides information
about the P4's prefetch logic which is one of the reasons why data is
not mirrored in all caches.

To summarize:
L1 -> CPU latency:  2-9 cycles (depending on data type)
L2 -> L1 latency:    +7 cycles (after L1 miss)
L3 -> L2 latency:   +14 cycles (after L2 miss)

Useful links for more information and sources:

Search terms:
P4 [xeon] cache latency[, cycles]
P4 [xeon] cache architecture
P4 L3 cache [architecture | latency]

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  

Google Home - Answers FAQ - Terms of Service - Privacy Policy