Good morning Stelle-ga,
The question you ask is not as simple as it at first appears. The P4
has separate L1 instruction and data caches. The L1 Dcache is fairly
standard, but the Icache bears further comment. The Icache does not
actually cache the contents of the L2, but rather caches
already-decoded instructions. As a result of this, the data in the L1
Icache is not fed into the same pipeline stage as a traditional L1
Icache would. Since the instructions are already decoded, the P4's
Icache skips the initial X86 decoding stages which form the head of
the P4's pipeline. When an instruction is not found in the L1 trace
cache, the CPU checks the L2 cache. If found there, the instruction
passes through 8 pipeline stages to decode it into P4 micro-ops and be
deposited in the L1 Icache. There are a few other complications to
your question as well which I will mention later.
Ace's Hardware ( http://www.aceshardware.com/read_news.jsp?id=75000402
) has a short post about the then-new Xeon DP with 1M L3 cache that
answers many of the questions you asked. The L1 (data cache) latency
is 2 cycles, the L2 latency is 9 cycles (2 + 7), and the L3 latency is
23 cycles (2 + 7 + 14). These are all best-case scenarios. For
example, while the L1 cache's integer load latency is indeed 2 cycles,
its floating-point load latency climbs to 9 cycles.
Ace's also measured the L1 cache's miss and replace latency at 31 and
17 cycles, respectively and the L2's miss latency at 328 cycles. See
( http://www.aceshardware.com/read.jsp?id=55000253 ). What this means
is that, although the L2 latency is listed as 9 cycles (2 + 7), it may
take up to 38 cycles (L1 miss + L2 hit). Similar calculations hold
for the L3. Bear in mind that, while the P4's cache architecture is
"mostly inclusive", it is not true that ALL L1 Dcache data is mirrored
in both the L2 cache and the L3 cache. This makes it difficult to
give you precise numbers, as the latencies seen will depend on where
the data actually lies.
Sandpile.org lists an absolute ton of information. Their P4
(including Celeron and Xeon) page, found at (
http://www.sandpile.org/impl/p4.htm ) confirms Ace's 2 / 7 / 14 cycle
latencies for L1, L2, and L3 caches. It also provides information
about the P4's prefetch logic which is one of the reasons why data is
not mirrored in all caches.
To summarize:
L1 -> CPU latency: 2-9 cycles (depending on data type)
L2 -> L1 latency: +7 cycles (after L1 miss)
L3 -> L2 latency: +14 cycles (after L2 miss)
Useful links for more information and sources:
http://www.intel.com
http://www.intel.com/research/
http://www.sandpile.org
http://www.sandpile.org/impl/p4.htm
http://www.aceshardware.com
http://www.aceshardware.com/read.jsp?id=20000190
http://www.aceshardware.com/read.jsp?id=25000191
http://www.arstechnica.com
http://arstechnica.com/cpu/01q2/p4andg4e/p4andg4e-1.html
http://arstechnica.com/cpu/01q4/p4andg4e2/p4andg4e2-1.html
Search terms:
P4 [xeon] cache latency[, cycles]
P4 [xeon] cache architecture
P4 L3 cache [architecture | latency]
-Haversian |