Assume the datapath has data fowarding, no branch hazard protection
detection, memory reads and writes take one cycle (unrealistic) and a
conventional five stage pipeline of Fetch, Register Read, ALU, Memory
Read/Write, Register Write.
; main loop
.L1:
l.s $f7, -4($1)
.L2:
l.w $3, 0($2)
beq $3, $0, #L3
l.s $f4,0($1)
l.s $f5, -N*4($1)
l.s $f6, N*4($1)
l.s $f8, 4($1)
mul.s $f9, $f4,$f2
add.s $f10, $f5, $f6
add.s $f11, $f7, $f8
add.s $f10, $10, $f11
mul.s $f10, $f10, $f3
add.s $f7, $f10, $f9
sub.s $f11, $f4, $f7
mul.s $f11, $f11, $f11
add.s $f12, $f12, $f11
s.s $f10, 0($1)
.L3:
addi $1, $1, 4
addi $2, $2, 4
bne $1, $2, #L2
add $1, $0, $3
add $2, $0, $4
c.lt.s $f1,$f12
bc1t #L1
Q1 : Identify all hazards in the Laplace program fragment above.
Q2 : Write down the logical expressions in terms of pipeline
register-number fields and other control signals which will detect and
resolve the hazards .
Q3 : Will the pipeline be forced to stall when executing program
above? Where and why?
Q4 : Are there any unresolvable load hazards?
Q5 : Identify all data and branch hazards in the optimised code.
Q6 : What major changes have been made to the algorithm to give better performance?
Q7 : We decided to increase the performance of the 5 stage pipelined
architecture by increasing the clock frequency. This required the
introduction of an additional pipeline stage in the data memory
section making it two clock cycles long. What changes woudl be
required for the data forwarding and hazard detection logic?
Another super-linear question
For small numbers of processors it is possible to obtain
?super-linear? speedup. That is 2 processors complete the program in
less than half the time and 4 less than quarter of the time that one
processor would take. Why? |