## Computer Design (0907432) Homework 2 Submit Handwritten Solutions **Q1:** Given the following C code and its MIPS translation. for (i=1000; i>0; i=i-1) (6 points) ``` *k[i] = *k[i] * a[i]; Loop: ld r3, 0(r1) l.d f0, 0(r3) l.d f2, 0(r2) mul.d f4, f0, f2 s.d f4, 0(r3) daddui r1, r1, -8 daddui r2, r2, -8 bnez r2, Loop ``` Use the latencies given in Slide 16 of the 5<sup>th</sup> slide collection of this course. And assume that branch instructions are resolved in the decode stage. 1) How many cycles are needed to execute one iteration of this loop? ## 13 CYLCES 2) Perform your best pipeline scheduling on this loop and find how many cycles are needed to execute one iteration of the scheduled loop. ## 9 CYLCES 3) Unroll this loop twice and schedule it, then find how many cycles are needed to execute one iteration of the unrolled loop. ``` 1 2 3 4 5 6 7 8 9 0 1 2 3 ld r3, 0(r1) FDEMW Loop: ld r4, -8(r1) l.d f0, 0(r3) FDEMW FDEMW 1.d f2, 0(r2) 1.d f6, 0(r4) FDEMW FDEMW 1.d f8, -8(r2) 1.d f8, -8(r2) mul.d f4, f0, f2 mul.d f10, f6, f8 FDEMW F D A A A A M W FDAAAAMW daddui r1, r1, -16 FDEMW daddui r2, r2, -16 FDEMW s.d f4, 0 (r3) FDEMW s.d f10, 0(r4) F D E M W bnez r2, Loop FDEMW \mathbf{F} ``` ## 13 CYLCES **Q2:** Given two branch predictors that use BHT with n=1 and n=2. Which predictor performs best in the following situations? (3 points) - 1) Predicting the direction of a branch that is taken 100 times and then not taken 100 times. BHT with n=1 is better. - 2) Predicting the direction of an end-of-loop branch. BHT with n=2 is better. - 3) Predicting the direction of a branch that changes direction on every two executions. BHT with n=1 is better. ``` Q3: What is the size of (2, 1) correlating branch predictor with k = 12? Size = 2^m * n * 2^k = 4 * 1 * 4K = 16 Kbits ``` **Q4:** Using pipeline diagrams, how many cycles are needed to fetch and complete executing the first two iterations of the MIPS loop shown in **Q1**? Assume that the branch is predicted taken and resolved in the execute stage. Also assume that integer instructions (like FP instructions) are executed using Tomasulo's algorithm. Each integer instruction takes one cycle in the execute stage and each FP instruction takes three cycles in the execute stage. And assume that the available functional units are branch unit, integer ALU, memory port, FP adder, FP multiplier. (4 points) 1) Using Tomasulo's algorithm described in the class. 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 ld r3, 0(r1) IAMW Loop: 1.d f0, 0(r3) Ι A M W 1.d f2, 0(r2) I A M W mul.d f4, f0, f2 I E E E Ws.d f4, 0(r3)ΙA Μ daddui r1, r1, -8 I E W daddui r2, r2, -8 I E W bnez r2, Loop I Ε ld r3, 0(r1) Loop: Ι A M W 1.d f0, 0(r3) A M W 1.d f2, 0(r2) ΙA Mmul.d f4, f0, f2 Ι E E E Ws.d f4, 0(r3)ΙA M daddui r1, r1, -8 I E W daddui r2, r2, -8 I E bnez r2, Loop Ι E 22 CYLCES 2) Using the speculative Tomasulo's algorithm described in the class. 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 ld r3, 0(r1) IAMWC Loop: 1.d f0, 0(r3)I A M W C 1.d f2, 0(r2) I A M W C mul.d f4, f0, f2 I EEEWC s.d f4, 0(r3) IA daddui r1, r1, -8 I E W daddui r2, r2, -8 I E W bnez r2, Loop Ι $\mathbf{E}$ ld r3, 0(r1) Loop: I A M W 1.d f0, 0(r3)Ι A M W C 1.d f2, 0(r2) I A M W C mul.d f4, f0, f2 Ι EEEWC s.d f4, 0(r3)IA daddui r1, r1, -8 I E W I E W Ι Ε C C daddui r2, r2, -8 bnez r2, Loop 22 CYLCES