No More Worries!

Our orders are delivered strictly on time without delay

Paper Formatting

Double or single-spaced
1-inch margin
12 Font Arial or Times New Roman
300 words per page

No Lateness!

Our orders are delivered strictly on time without delay

LTP Guarantees

Free Unlimited revisions
Guaranteed Privacy
Money Return guarantee
Plagiarism Free Writing

Computer Architecture Homework

Question 1. Instruction Cache (16 points) In this problem, we will be exploring adding an instruction cache to a standard 5-stage pipelined processor. We will be using the MIPS assembly program shown below in Figure 1. The rst column shows the instruction address for each instruction. Note that these addresses are byte addresses. The value of r1 is initially 64, meaning that there are 64 iterations in the loop. In this problem, we will be considering the execution of this loop with a direct- mapped instruction cache microarchitecture with eight 16-Byte cache lines. This means each cache line can hold four instructions and the bottom four bits of an instruction address are the block oset. Hint: The rst instruction in Figure 1 (i.e., addiu r1, r1, -1), is in the middle of a cache line. Address Instruction Iteration 1 Iteration 2 loop: 0x108 addiu r1, r1, -1 0x11c addiu r2, r2, 1 0x110 j foo ... foo: 0x218 addiu r6, r6, 1 0x21c bne r1, r0, loop Figure 1: MIPS assembly loop Question 1.A Catagorizing Cache Misses (4 points) Fill out the table above. In the appropriate column, write compulsory, con ict, or ca- pacity next to each instruction which misses in the instruction cache to indicate the type of instruction cache misses that occur in the rst and second iteration of the loop. Assume that the instruction cache is initially completely empty. Compulsory misses occur when a memory access maps to a currently invalid (or empty) cache line or cache set. Con ict misses occur when multiple memory addresses map to the same cache line or cache set. Capacity misses occur when the cache is full and can no longer handle further memory access requests or contain the working set of data needed for program execution. 2 Question 1.B Average Memory Access Latency (6 points) Calculate the instruction cache miss rate for 64 iterations of the loop. Calculate the average instruction cache memory access latency in cycles for 64 iterations of the loop. Assume the hit time is one cycle and that the miss penalty is 5 cycles. You must show your work, especially the various components of the average memory access latency. Remark on which kind of miss is dominating the average memory access latency. Question 1.C Set-Associativity (6 points) Qualitatively, predict how the cache performance would change if we replace the eight-entry, direct-mapped cache with an eight-entry, two-way, set-associative cache. Both caches have a one-cycle hit latency. Assume the set-associative cache address interleaves the sets across the ways using the least signicant bits right after the block oset. What kind of misses would be present with this kind of cache microarchitecture? 3 Question 2. Cache Mapping and Access (16 points) Consider a 512-KByte cache with 64-word cachelines (a cacheline is also known as a cache block, each word is 4-Bytes). This cache uses write-back scheme, and the address is 32 bits wide. Question 2.A Direct-Mapped, Cache Fields (2 points) Assume the cache is direct-mapped. Fill in the table below to specify the size of each address eld. Field Size (bits) Cacheline Oset Cacheline Index Tag Question 2.B Fully-Associative, Cache Fields (2 points) Assume the cache is fully-associative. Fill in the table below to specify the size of each address eld. Field Size (bits) Cacheline Oset Cacheline Index Tag Question 2.C 16-Way Set-Associative, Cache Fields (2 point) Assume the cache is 16-way set-associative. Fill in the table below to specify the size of each address eld. Field Size (bits) Cacheline Oset Cacheline Index Tag 4 Question 2.D Direct-Mapped, Cache Transactions (4 points) Assume the cache is direct-mapped. Fill in the table on the next page to identify the content of the cache after each of the following memory accesses. Assume the cache is initially empty (also called a \cold cache"). Specify if an entry causes another line to be replaced from the cache, and if an entry has to write its data back to memory. For the data column, specify the data in the block by referring to its address like M[address]. Write accesses will modify the data, so lets indicate the data after a write access with D[address]. Request Cachline Hit or Caused Write-back Address Type Index Miss? Modied Tag Data Replace? to Memory? 0x128 read M[0x100] 0xF40 write D[0xF00] 0xA00051 read 0x093 write 0x4000B44 read Question 2.E 16-Way Set-Associative, Cache Transactions (4 points) Assume the cache is 16-way set-associative. Fill the table below to identify the content of the cache after each of the following memory accesses. Assume the cache is empty in the beginning (also known as cold cache). Specify if an entry causes another line to be replaced from the cache, and if an entry has to write its data back to memory. For the data column, specify the data in the block by referring to its address like M[address]. Write accesses will modify the data, so lets indicate the data after a write access with D[address]. Request Cachline Hit or Caused Write-back Address Type Index Miss? Modied Tag Data Replace? to Memory? 0x128 read M[0x100] 0xF40 write D[0xF00] 0xA00051 read 0x093 write 0x4000B44 read Question 2.F Overhead (2 points) What is the overhead and actual size of the direct-mapped cache? What is the overhead and actual size of the 16-way set-associative cache? Does the structure change the overhead in terms of number of memory bits? 5 Question 3. Average Memory Access Time (12 points) Considered two pipelined machines A and B, described in the table below. Property Computer A Computer B ISA MIPS x86 Clock Frequency 2 GHz 3 GHz Base CPI 1 cycle/instruction 2 cycles/instruction L1 Instruction Cache Hit Time 1 cycle 1 cycle L1 Instruction Cache Miss Rate 2% 2% L1 Data Cache Hit Time 1 cycle 2 cycles L1 Data CAche Miss Rate 8% 5% L2 Hit Time 15 cycles 12 cycles L2 Global Miss Rate 3% 4% Main Memory Access Time 250 cycles 300 cycles Figure 2: Table containing properties of two dierent machines Question 3.A Ideal System (4 points) Assume a perfect memory system (100% of memory accesses hit in the L1 caches) and perfect branch prediction. Assume that MIPS programs execute 1.25x as many instructions as x86 programs. Which machine has the faster execution time and what is the speedup? Question 3.B No L2 Cache (4 points) Now assume each machine has only L1 caches (split iL1 cache and dL1 cache), no L2 cache, perfect branch prediction, and that 30% of the instructions are loads/stores. Which machine has the faster execution time and what is the speedup? Question 3.C Full System (4 points) Now assume each machine has both L1 caches and a unied L2 cache, and that 30% of the instructions are loads/stores. Which machine has the faster execution time and what is the speedup? 6 Question 4. Array vs. List Cache Behavior (20 points) In this problem, you will explore the cache behavior for a basic operation on two common software data structures. Figure 3 illustrates a basic linear array and a doubly linked list. Each node in the doubly linked list has a 4-Byte value eld, a 4-Byte pointer that points to the next node in the list, and a 4-Byte pointer that points to the previous node in the list. The previous pointer for the head node is dened to be zero, and the next pointer for the tail node is dened to be zero. For this problem, you should assume that both data structures contain 64 4-Byte values (i.e., the array is 64 elements long, and the linked list contains 64 nodes). We wish to explore the performance of reversing the values in each data structure. Conven- tional wisdom in a basic computer science course on data structures might suggest that the time to complete this operation on both the array and linked list is O(n) where n is the num- ber of elements in the data structure. \Big-O" notation is useful when analyzing asymptotic behavior as n grows very large, but it abstracts many important \constant factors" that can dominate the performance for reasonable sized data structures on real architectures. For this problem, you should assume we have a very large, fully-associative data cache such that there are no capacity nor con ict misses. Assume a write-back, no write-allocate cache. The data cache uses 16-Byte cache lines and is initially empty. All data cache accesses will result in either a hit or a compulsory miss. You should also assume that the array data structure is cache-line aligned, each node in the linked list is also cache-line aligned, and there is only one linked list node per cache line. Cache-line aligned means that the rst element of the array is at the very beginning of a cache line, and that value eld for each linked list node is also at the very beginning of a cache line. The data cache has a hit latency of one cycle and a miss penalty of four cycles. This means if an instruction stalls in M waiting for a cache miss, it will remain in the M stage for a total of ve cycles (one cycle for the hit latency and four cycles for the miss penalty). The processor should stall for both read and write misses. Figure 3: Linear Array and Doubly Linked-List Data Structures 7 CPI Breakdown Number of Execution Useful RAW Control Memory Question Instructions CPI Time (cyc) Work Stalls Squashes Stalls 4.A read 4.B write Figure 4: Execution Time for Reverse Operation on Array and Linked List Data Structures Question 4.A Analyzing Performance of an Array Data Structure (7 points) Figure 5 on the next page shows C-code which implements the reverse operation for an array data structure with an even number of elements. Figure 6 shows corresponding MIPS assembly code for the function. Recall that arguments are stored in r4 and r5. Note that we are organizing the loop a bit dierently than some of the other loops we have studied in this course to better match the assembly code and also to better match the code used for the linked list. We use a few setup instructions to create a pointer to the last element in the array. We then iteratively move the forward and reverse pointers, swapping the data as we go along. Draw a pipeline diagram illustrating at least one iteration of the loop shown in Figure 6. You do not need to show the four setup instructions used to calculate the initial reverse pointer. You should draw as many iterations as you need in order to determine the steady-state behavior of the loop. Carefully consider which memory accesses hit or miss in the data cache. Use your pipeline diagram to estimate the overall execution time in cycles for this operation. Fill in the corresponding row of Figure 4. You will need to decompose the CPI into its various components based on what stalls and/or squashes occur in the execution. You must show your work. 8 Question 4.B Analyzing Performance of a Linked-List Data Structure (7 points) Figure 7 on the next page shows C-code which implements the reverse operations for a linked- list data structure with an even number of elements. Figure 8 shows corresponding MIPS assembly code for the function. Recall that arguments are stored in r4 and r5. We need additional load instructions to retrieve the next and previous pointers for iterating through the list. Note that we use a non-zero oset when accessing these next and previous pointers since we know ahead of time where these elds are located relative to the value eld. Draw a pipeline diagram illustrating at least one iteration of the loop shown in Figure 8. You should draw as many iterations as you need in order to determine the steady-state behavior of the loop. Carefully consider which memory accesses hit or miss in the data cache. Use your pipeline diagram to estimate the overall execution time in cycles for this operation. Fill in the corresponding row of Figure 4. You will need to decompose the CPI into its various components based on what stalls and/or squashes occur in the execution. You must show your work. 9 10 Question 4.C Comparison of Data Structures (6 points) Compare the performance of the two data structures and generalize your results by answering the following questions. Which data structure performs better in this specic example and why? How would the execution time change as a function of cache line size assuming we always only allocate one linked-list node per cache line? How would the execution time change if we assumed larger cache lines and that the memory allocator organizes multiple linked-list nodes on the same cache line? How does the execution time change as the number of elements in the data structure grows asymptotically large? How does this relate to the theoretical asymptotic behavior of O(n)? How would the execution time change if both data structures were already present in the cache such that there were no cache misses? Can we draw any broad conclusions about the cache behavior of more regular array- or matrix-based data structures vs. more irregular list-, tree-, or graph-based data- structures that make extensive use of dynamic memory allocation and pointers? 11

No More Worries!

Paper Formatting

No Lateness!

LTP Guarantees

Computer Architecture Homework

Our Services

Why Choose Us