Hello Aaslam,
Background: when referring to SRAM, I assume you are talking about
main memory. If you really mean caches (as the comment implies), the
performance hit is much less - perhaps 5:1 but the explanations still
apply. I would be glad to revise the answer if needed to more
accurately describe SRAM in caches.
Q: Are registers inherently faster than SRAM?
A: Yes.
Q: Why?
A: Several factors including:
- distance signals must travel (speed of light, registers are part of the CPU)
- speed of components implementing the registers (also more expensive)
- complexity of interface (registers - direct, memory through one or
more cache levels / memory controllers)
- access to memory may be delayed by I/O units (e.g., DMA, PCI transfers)
When a cache is involved, there is also some complexity related to
independent updates of memory. This will require cache flushes to keep
the cache / memory contents consistent (sometimes referred to as cache
coherency).
Refer to:
http://www.sc2001.org/papers/pap.pap120.pdf
(describes cycle stealing)
[several subsequent links also help describe this]
Q: How much faster are registers compared to SRAM?
A: Some old designs had basically a 1:1 ratio - each instruction took
one cycle (instruction fetch / memory operation). For more modern
systems, it varies by system, but a typical ratio is 30:1. The ratio
of 100:1 as noted in the comment is also possible. There are also "Non
Uniform Memory Access" (NUMA) systems where times to local memory are
similar to a normal system, but you also have access to other machines
where the ratio could be 10 times worse (or more). Also note that
write access may be "faster" than read access due to caching effects
(fewer CPU stalls).
For reference:
http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/mem_hierarchy.html
(data for 1996 - 200 to 400 Mhz processors)
http://www-courses.cs.uiuc.edu/~cs333/slides/chapter5%5B1%5D-1.pdf
(general material, describes ratios of 2:1 up to 200:1 in one table)
Q: How many storage bits are provided on chip by registers?
A: Wow - that varies a lot by processor type and age of the system.
The "right answer" is a pretty wide range. In the early to mid 70's it
was common to have a single or pair of registers. For a 16 bit word
size, that would thus be 16 or 32 bits in registers. You may still see
that today in embedded microprocessors as well. Several CISC machines
would later have 8 to 16 registers - so at 32 bits, you get 256 to 512
bits. Several RISC machines would have up to 32 registers, so 1024 to
2048 bits (32 or 64 bit). Note this is MUCH smaller than cache sizes
and memory sizes.
For reference, see:
http://www.cs.uiowa.edu/~jones/pdp8/man/registers.html
(PDP-8 series, single accumulator and some models w/ extended accumulator)
http://www.8052.com/tutbregs.phtml
(8051 series, w/ single A/B accumulator, a set of 8 limited "registers")
http://www.osdata.com/topic/language/asm/register.htm
(several machine references)
http://www.sics.se/~psm/sparcstack.html
(Sparc register explanation - a RISC machine, also describes stack usage)
Search phrases included:
memory delay I/O
8051 registers
pdp-8 registers
vax registers
sparc registers
memory register cycle ratio
If any part of this answer is unclear or does not meet your needs,
please use a clarification request.
--Maniac |
Clarification of Answer by
maniac-ga
on
28 Jul 2004 10:05 PDT
Hello Aaslam,
Hmm. Getting "real world" (instead of student homework) data and
schematics is taking some digging. I can give you a partial answer to
your clarification now and will try to get more detailed information
later today.
For another top level diagram of use of SRAM in caches, see:
http://www.gsitechnology.com/MemoryTechnologyForCacheApps.pdf
Describes use of SRAM in cache, includes several block diagrams
showing the interconnects (not at a schematic level) as well as timing
involved.
For some real world (and freely available) designs of systems and components, see:
http://www.opencores.org/
Has a number of publically available designs for processors and
supporting items (e.g., arithmetric units, hardware interfaces). More
specifically, see
http://www.opencores.org/projects.cgi/web/or1k/openrisc_1200
which describes a full CPU implementation including instruction and
data caches. It has been implemented in some demonstration devices as
well. You can freely download the specifications and design from the
opencores web site.
I am still digging to find some specific timing / size and complexity
answers to your question clarification and will follow up later today.
--Maniac
|
Clarification of Answer by
maniac-ga
on
28 Jul 2004 16:43 PDT
Hello Aaslam,
I found some good references to answer the points raised in your clarification.
For the most part, the time taken to access (e.g., read / write cycle)
a register is included in the cycle time of the instruction. So an add
instruction doing something like:
R = R+M
will read and write the register in the time of the CPU instruction.
To get to that memory value (M), it must be fetched from the
appropriate part of the cache or memory. Using
http://www.systemlogic.net/articles/01/8/p4/page2.php
as a guide, it indicates that:
- "up to 4 simple arithmetic instructions per clock"
- L1 cache has 2 clock delay
- L2 cache can deliver 1 value each clock after a 10 clock latency
From this information:
- you can manipulate a register at least once per clock
- the access to L1 cache introduces a two clock delay
- the access to L2 cache introduces a ten clock delay
So - yes, you can get an order of magnitude difference in timing
between register operations and SRAM operations (cache).
I would like to be able to give you a firm answer on the ratio of
"speed of light" effects compared to "added components" effects on
cache access time but cannot. There are some good references that
describe how design of components has affected cache times but they
don't go into sufficient detail to answer that particular issue. For
example:
http://www.anandtech.com/showdoc.html?i=1235
compares the design of the Pentium III and Athlon where the L2 cache
in the Pentium III is "on die" and the Athlon is "off die". The larger
Athlon L2 cache was much slower then the smaller Pentium III cache due
to clock rates and distance. Clock rates may be the dominant factor in
this case.
http://www.kickassgear.com/Articles/Coppermine.htm
Describes the Pentium III Coppermine design. Note the number of items
described with the cache including:
- width of cache accesses (fetch 256 bits, not 64)
- associative access
- speed increases (since now on die)
which increase the complexity of the cache / have impacts on the cache
performance. Note that some of these improve throughput (e.g., width
of data path) but do not help latency. Others (such as speed increase)
do improve latency.
http://www.hardwareanalysis.com/action/printarticle/1269/
Another look at the Pentium IV but also describing the Pentium Pro
through Pentium III. Talks about other factors including the use of
branch prediction and a deep pipeline to mitigate the impact of
latency to access data values (in cache or memory).
For schematics / design data - I'll refer you again to the opencores site
http://www.opencores.com/
which provides complete designs to implement a system (or parts of a
system). The architectural information for the OpenRISC 1000 family is
at
http://www.opencores.com/projects.cgi/web/or1k/architecture/
and more specifics on the OpenRISC 1200 at
http://www.opencores.com/projects.cgi/web/or1k/openrisc_1200
which includes links to the design, tutorials on implementation, and a
mailing list for discussion.
Good luck with your work.
--Maniac
|