## 0907432 Computer Design (Spring 2012) <u>Midterm Exam Solution</u>

| رقم الشعبة:                                        | <u>Midterm Exam Solution</u><br>رقم التسلسل:                                                                                              | الاسم:                                    |
|----------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------|
|                                                    | Closed books & notes. No calculators or mole<br>work clearly and give the final answer in the sp                                          | 1 1                                       |
| executed on a processor that                       | tions and 50% of these instructions are memory<br>t runs on a 2-GHz clock, executes the memory<br>her instructions in an average 1.0 CPI. |                                           |
| (a) What is the overall aver                       | rage CPI?2.5 cycles                                                                                                                       |                                           |
| Average CPI = 0.50 *                               | 4.0 + 0.50 * 1.0 = 2.5                                                                                                                    |                                           |
| (b) What is the time needed                        | d to execute this program? <u>1.25</u> see                                                                                                | conds                                     |
| Time = $(10^9 * 2.5) / (2^5)$                      | $(*10^9) = 1.25$                                                                                                                          |                                           |
| (c) Using Amdahl's law, w of 4?2.5                 | what is the speedup when the memory instruction                                                                                           | ons are improved by a factor              |
| f = 2/2.5 = 0.8<br>Speedup = 1 / ((1-0.8)          | + 0.8/4) = 1 / 0.4 = 2.5                                                                                                                  |                                           |
|                                                    | er costs \$5,000 and has an average yield of 50% that of area $22 \times 7 \text{ mm}^2$ each.                                            | %. The wafer radius is 70 mm<br>[3 marks] |
| The approximate die co                             | ost is:100 dollars                                                                                                                        | [5 marks                                  |
| No of dies per wafer =<br>Cost ≅ \$5000 / (100 * 0 | - Wafer Area / Die Area ≅ ((22/7) * 70 * 70) /<br>0.50) = \$100                                                                           | / (22*7) = <b>100</b>                     |
|                                                    |                                                                                                                                           |                                           |
|                                                    |                                                                                                                                           |                                           |
|                                                    |                                                                                                                                           |                                           |

Q3. Assume that you have a typical 5-stage pipelined processor that uses forwarding and stalls to solve data hazards. After you do code scheduling (modifying and rearrange instructions) for the following code sequence, the minimum number of cycles needed to execute this sequence is: \_\_\_\_8\_\_\_ cycles (*You must show your modified code in the right space*)

[3 marks]

| Original Code Sequence |     |     |        |     | Rearra | nged and Modified Code Sequence |
|------------------------|-----|-----|--------|-----|--------|---------------------------------|
| L1:                    | lw  | R5, | 0(R1)  | L1: | lw     | R5, 0(R1)                       |
|                        | add | R6, | R5, R5 |     | lw     | R30, 0(R2)                      |
|                        | lw  | R5, | 0(R2)  |     | add    | R6, R5, R5                      |
|                        | add | R7, | R5, R5 |     | add    | R7, R30, R30                    |
|                        |     |     |        |     |        |                                 |
|                        |     |     |        |     |        |                                 |
|                        |     |     |        |     |        |                                 |
|                        |     |     |        |     |        |                                 |

Q4. Assume that the following eight instructions are executed by a speculative superscalar processor of degree 3. This processor uses reservation stations and 48-entry reorder buffer. The integer latency is 1 cycle, the memory latency is 2 cycles, and floating-point latency is 3 cycles. The processor has one address calculation unit, one memory access unit, one integer ALU unit, one floating-point unit, and one branch unit. The processor has 5 reservation stations for each functional unit. Using pipeline diagram in the space below, the number of cycles needed to fetch and commit these instructions is: \_\_\_\_15\_\_\_ cycles [6 marks]

|       |     |        | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
|-------|-----|--------|---|---|---|---|---|---|---|---|---|----|----|----|----|----|----|----|----|
| lw    | R1, | 0(R2)  | F | Ι | A | M | W | С |   |   |   |    |    |    |    |    |    |    |    |
| lw    | FO, | 0(R1)  | F | Ι |   |   |   | A | Μ | W | С |    |    |    |    |    |    |    |    |
| add.d | F4, | F2, F0 | F | Ι |   |   |   |   |   |   | E | E  | E  | W  | С  |    |    |    |    |
| SW    | F4, | 0(R1)  |   | F | Ι |   |   |   | A |   |   |    |    |    | С  |    |    |    |    |
| lw    | R1, | 8(R2)  |   | F | Ι | A | Μ | W |   |   |   |    |    |    | С  |    |    |    |    |
| lw    | FO, | 0(R1)  |   | F | Ι |   |   |   |   | A | Μ | W  |    |    |    | С  |    |    |    |
| sub.d | F4, | F0, F6 |   |   | F | Ι |   |   |   |   |   |    | E  | E  | E  | W  | С  |    |    |
| SW    | F4, | 0(R1)  |   |   | F | Ι |   |   |   |   | A |    |    |    |    |    | С  |    |    |

**Q5.** Assume that you have a typical 5-stage pipelined processor that uses forwarding and stalls to solve data hazards. Assume that the processor is executing the following code segment and that the first instruction has reached the write back stage.

[6 marks]

| sub | R1, | R2, | R3 |
|-----|-----|-----|----|
| add | R2, | R1, | R3 |
| and | R4, | R1, | R2 |
| or  | R1, | R2, | R3 |

- (a) In the following table specify what instruction has reached each of the four listed stages.
- (b) The figure below shows the last four stages of this pipeline. This figure has four multiplexers. For each multiplexer, draw a circle mark on the selected (active) multiplexer input.



| <b>Q6.</b> The size of a direct-mapped cache is 16 KB and is organized in 128 blocks. The processor accesses this cache using a 32-bit address in 1 cycle when the data is in the cache. This cache is a write-back cache and does not use write buffers. The second memory level is the main memory and it takes 100 cycles to read or write one memory block. |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [6 marks]                                                                                                                                                                                                                                                                                                                                                       |
| (a) What is the cache block size? <u>128</u> bytes                                                                                                                                                                                                                                                                                                              |
| Block size = cache size / no of blocks = $16 * 2^{10} / 2^7 = 2^{14} / 2^7 = 2^7 = 128$                                                                                                                                                                                                                                                                         |
| (b) How many bits are needed for the tag field?18 bits                                                                                                                                                                                                                                                                                                          |
| $< tag > = 32 - lg_2 \ 128 - lg_2 \ 128 = 32 - 7 - 7 = 18$                                                                                                                                                                                                                                                                                                      |
| (c) How many cycles are needed to access the cache when there is a cache miss and there is a need to                                                                                                                                                                                                                                                            |
| replace a dirty block?201 cycles                                                                                                                                                                                                                                                                                                                                |
| 1 cycle to access the cache plus 100 cycles to write back the dirty block plus 100 cycles to load<br>the missed block = 201 cycles.                                                                                                                                                                                                                             |
|                                                                                                                                                                                                                                                                                                                                                                 |

<Good Luck>