gem5 RISC-V BOOM Processor Evaluation

Evaluating gem5's RISC-V ISA models using the BOOM processor

As part of a team project, we conducted a comprehensive evaluation of gem5’s RISC-V ISA models using the Berkeley Out-of-Order Machine (BOOM) processor. This research project focused on analyzing the performance characteristics, architectural trade-offs, and simulation accuracy of different processor configurations within the gem5 architectural simulator framework.

The project employed advanced computer architecture simulation techniques, performance benchmarking methodologies, and statistical analysis to evaluate processor designs across multiple workload scenarios. By utilizing the BOOM processor as our primary evaluation platform, we gained insights into out-of-order execution, pipeline optimization, and the impact of various architectural parameters on overall system performance.

Our primary goal was to investigate whether computer architecture simulation tools like gem5 are sufficient for determining the power, performance, or area (PPA) effects of custom processors, or if the development effort of a concrete processor core is necessary. We compared gem5’s simulation capabilities against the Chipyard framework and Verilator for BOOM processor modeling.


Table of Contents

  1. Main Goals
  2. Technical Background
  3. Experimental Platform
  4. Methodology
  5. Implementation and Results
  6. Discussion and Limitations
  7. Conclusion and Future Work
  8. References

Main Goals

  1. Evaluate gem5’s RISC-V ISA models using BOOM processor
    • Analyze performance characteristics of different processor configurations
    • Assess simulation accuracy and reliability of gem5 framework compared to Chipyard/Verilator
    • Compare architectural trade-offs in out-of-order execution

  2. Implement comprehensive benchmarking methodology
    • Design systematic evaluation framework for processor performance using riscv-tests and CoreMark
    • Develop statistical analysis techniques for simulation results
    • Create reproducible methodology for architectural research

  3. Analyze cache and memory system performance
    • Evaluate different cache configurations (32B vs 64B block sizes) and their impact on performance
    • Study memory hierarchy optimization strategies including prefetching mechanisms
    • Assess the effectiveness of different prefetcher configurations (AMPM, next-line, tagged)

  4. Investigate pipeline optimization techniques
    • Analyze out-of-order execution efficiency using custom matrix computation benchmarks
    • Evaluate branch prediction accuracy using TAGE predictor configurations
    • Study instruction-level parallelism and its limitations in 4-wide decode configurations

Technical Background

gem5 Architectural Simulator

The gem5 simulator is a modular, open-source computer architecture simulator that provides a flexible framework for evaluating different processor designs. It supports multiple instruction set architectures (ISAs) including RISC-V, ARM, x86, and others.[5] The simulator consists of several key components:

  • CPU Models: Various CPU models including simple, in-order, and out-of-order processors
  • Memory System: Configurable cache hierarchies and memory controllers
  • Interconnect: Flexible network-on-chip and bus models
  • I/O Systems: Support for various I/O devices and interfaces

gem5’s modular design allows researchers to easily configure different architectural parameters and evaluate their impact on performance, power consumption, and other metrics.

BOOM Processor Architecture

The Berkeley Out-of-Order Machine (BOOM) is a high-performance, out-of-order RISC-V processor implementation designed for research and education. Key architectural features include:

  • Out-of-Order Execution: Dynamic instruction scheduling for improved performance
  • Speculative Execution: Branch prediction and speculative instruction execution
  • Register Renaming: Eliminates false dependencies and enables better instruction-level parallelism
  • Load/Store Queue: Manages memory operations and maintains memory consistency
  • Reorder Buffer: Ensures correct program execution order

SonicBOOM (BOOMv3) Configuration:

Parameter Value
Fetch Width 8
Decode Width 4
ROB Entries 128
Integer Registers 128
Floating-point (FP) Registers 128
Load Queue (LDQ) Size 32
Store Queue (STQ) Size 32
Maximum Number of Branches 20
Fetch Buffer Entry Size 32
Prefetching TRUE

Table 1. SonicBOOM (MegaBoomConfig) Tile Parameters.

BOOM’s design provides a realistic baseline for evaluating modern processor architectures and understanding the complexities of out-of-order execution.[7]

Figure 1: BOOM version 3 (SonicBOOM) High-level Architecture showing the superscalar out-of-order pipeline design with fetch, decode, rename, dispatch, execute, and commit stages.
Figure 2: Architecture for the Alpha 21264 processor that the BaseO3CPU in gem5 follows closely, providing the foundation for our out-of-order processor modeling.
RISC-V ISA Models

RISC-V is an open-source instruction set architecture that provides a foundation for processor design and evaluation.[14] The ISA includes:

  • Base Integer Instructions: Core computational and control flow instructions
  • Floating-Point Extensions: Support for single and double precision operations
  • Vector Extensions: SIMD operations for data-parallel workloads
  • Privileged Architecture: Support for different privilege levels and virtualization

The modular nature of RISC-V allows for flexible processor implementations and enables detailed analysis of different architectural decisions.

Feature Sonic BOOM Risky-OOO RSD FabScalar
ISA RV64GC RV64G RV32IM PISA (sub-set)
DMIPS/MHz 3.98 ? 2.04 ?
SPEC2006 IPC 0.86 0.48 N/A ?
Dec Width 1-5 2 2 2-5
Mem Width 16b/cycle 8b/cycle 4b/cycle 4-8b/cycle
HDL Chisel3 BSV+CMD System Verilog FabScalar toolset with CPSL

Table 2. Comparison of alternative academic or open-source out-of-order processors.

Component Present in BaseO3CPU? Comments
TAGE-L Branch Predictor
MEM Issue Queue
L1/L2 Caches Dual-ported Data Cache Present. Needs to be expanded.
L0 and Dense L1 BTB
Instruction Fetch and PreDecode Needs to be integrated into 16-byte window.
RAR ReOrder Buffer
Decoder Blocks Singular decoder needs to be expanded to 4-wide.
Execution Units Requires multithreading support. Need AGU.

Table 3. Significant BOOM hardware modules required for modeling in gem5.

Experimental Platform

Hardware Configuration

Our experimental platform consisted of high-performance computing resources capable of running complex architectural simulations:

  • Multi-core CPU: High-performance processor for simulation execution
  • Large Memory Capacity: Sufficient RAM for complex simulation workloads
  • High-speed Storage: Fast storage for simulation data and results
  • Network Connectivity: Access to distributed computing resources

The hardware configuration was designed to support long-running simulations and handle the computational demands of architectural evaluation.

Software Framework

The software framework included several key components:

  • gem5 Simulator: Latest stable release with RISC-V and BOOM support
  • RISC-V Toolchain: Complete development environment for RISC-V applications
  • Benchmark Suite: Standardized workloads for performance evaluation
  • Analysis Tools: Custom scripts for data processing and visualization
  • Statistical Software: Tools for statistical analysis and result validation

The software stack was carefully configured to ensure reproducibility and enable detailed analysis of simulation results.

Methodology

Benchmark Selection

We selected a comprehensive set of benchmarks to evaluate different aspects of processor performance:

Custom C Matrix Computation Benchmark:

  • Computes determinant
  • Computes sum
  • Evaluates if identity matrix

riscv-tests Repository Benchmarks:

  • 1D 3 element median filter (median)
  • Two input stream multiplication (multiply)
  • QuickSort (qsort)
  • Reverse QuickSort (rsort)
  • Sparse matrix-vector multiplication (spmv)
  • Towers of Hanoi (towers)
  • Vector-vector add (vvadd)

riscv-coremark Repository:

  • Ultra-low power and IoT tests
  • Heterogeneous Compute tests
  • Single-core Performance tests
  • Multi-core Performance tests
  • Phone and Tablet tests

These benchmarks were chosen because they show good general-purpose compute capability and provide stress testing for functional units within the BOOM processor.

Performance Metrics

Our evaluation focused on several key performance metrics:

  • Instructions Per Cycle (IPC): Primary measure of processor efficiency
  • Cache Miss Rates: Evaluation of memory system performance
  • Branch Prediction Accuracy: Assessment of control flow prediction
  • Memory Bandwidth: Analysis of memory system utilization
  • Power Consumption: Energy efficiency considerations

These metrics provided a comprehensive view of processor performance and architectural trade-offs.

Statistical Analysis

We employed rigorous statistical methods to ensure the reliability of our results:

  • Multiple Runs: Each configuration was evaluated across multiple simulation runs
  • Confidence Intervals: Statistical analysis to quantify result uncertainty
  • Outlier Detection: Identification and handling of anomalous results
  • Correlation Analysis: Understanding relationships between different metrics

The statistical framework ensured that our conclusions were based on reliable and reproducible data.

Implementation and Results

Implementation Progress

We successfully set up the Chipyard framework in the local environment and created RISC-V configurations for gem5 along with a repository of our progress and gem5 outputs. The RISC-V GNU Toolchain was set up on an ISU ETG virtual machine to allow for compilation of programs that will run on the BOOM core.

Key Implementation Achievements:

  • Created custom gem5 configuration scripts in Python based on fs_linux.py and riscv-ubuntu-run.py[4]
  • Successfully tested static compilation of matrix computation programs to RISC-V
  • Configured BOOM processor under Chipyard framework using “MegaBoomConfig”[7]
  • Established GitHub repository for configuration, test, and result files

Technical Challenges Overcome:

  • Resolved lockfile issues by switching to WSL environment
  • Fixed JVM memory allocation challenges in Chipyard framework
  • Solved gem5 segmentation faults by removing HiFive() platform from system configuration
  • Corrected memory size validation errors by properly configuring IO and memory buses with Bridge components
CoreMark Benchmark Analysis

Our CoreMark benchmark analysis revealed several key insights into processor performance:

  • Baseline Performance: Established performance baseline for BOOM processor
  • Configuration Impact: Evaluated the effect of different architectural parameters
  • Scalability Analysis: Studied performance scaling with different workload sizes
  • Optimization Opportunities: Identified areas for architectural improvement

The CoreMark results provided a standardized comparison point for evaluating processor efficiency and identifying optimization opportunities.

Cache Performance Evaluation

Cache performance analysis focused on understanding memory system behavior across three different gem5 configurations:

Configuration 1 (32B Block Size with AMPM Prefetchers):

  • Average cycle count difference: ~8.76%
  • Closest matches: rsort (10.85%) and multiply (11.62%)
  • Largest differences: spmv (79.01%) and towers (46.21%)
  • Host execution time: ~197.51% faster in gem5

Configuration 2 (64B Block Size, No iCache Prefetcher):

  • Average cycle count difference: ~12.26%
  • Improved performance for rsort (8.34%) and qsort (10.28%)
  • Maintained similar host execution time differences (~197%)
  • Best overall accuracy for most benchmarks

Configuration 3 (64B Block Size with Next-Line Prefetchers):

  • Average cycle count difference: ~6.56%
  • Best match: multiply benchmark (4.46%)
  • Most accurate model with next-line prefetchers and modified latencies
  • Host execution time: ~197.59% faster in gem5

Cache Configuration Comparison:

Configuration Block Size Prefetcher Avg Cycle Diff %
Config 1 32B AMPM 8.76%
Config 2 64B None 12.26%
Config 3 64B Next-Line 6.56%

Table 6. Cache configuration comparison showing the impact of block size and prefetcher settings on simulation accuracy.

Key Findings:

  • Cache block size significantly impacts performance (32B vs 64B)
  • Prefetcher configuration has substantial effect on memory system behavior
  • Host execution time differences consistently ~198% (gem5 much faster than Verilator)
  • Complex benchmarks like CoreMark would take weeks/months on Verilator vs hours on gem5
  • Configuration 3 (64B, next-line prefetcher) provided the best overall accuracy
Detailed Performance Analysis

Our comprehensive performance analysis across multiple benchmarks revealed critical insights into processor behavior and simulation efficiency:

Benchmark Performance Summary:

Benchmark gem5 Cycles BOOM Cycles Cycle Diff % Host Time Diff %
matrix_prog 56,679 45,525 21.83% 198.78%
median 45,970 60,254 26.89% 198.46%
multiply 60,776 68,277 11.62% 197.83%
qsort 386,717 339,421 13.03% 196.77%
rsort 274,268 246,044 10.85% 196.48%
spmv 102,578 236,561 79.01% 198.95%
towers 26,666 42,691 46.21% 198.58%
vvadd 45,281 51,664 13.17% 197.39%
Average 124,867 136,305 27.7% 197.51%

Table 4. Detailed performance comparison between gem5 and BOOM simulations across riscv-tests benchmarks.

CoreMark Benchmark Analysis:

Configuration gem5 Host Hours Expected BOOM Hours Time Ratio
coremark config1 19.62 3,182.59 162.2x
coremark config2 30.78 4,992.81 162.2x
Average 25.20 4,087.70 162.2x

Table 5. CoreMark benchmark performance showing dramatic simulation time differences.

Key Performance Insights:

Branch Prediction Performance:

  • BTB Hit Ratio: Average 93.1% across benchmarks, indicating effective branch target prediction
  • Conditional Prediction Accuracy: High accuracy with minimal mispredictions
  • RAS Usage: Effective return address stack utilization with very low incorrect predictions

Cache Performance Analysis:

  • Data Cache Miss Rates: Varied significantly by benchmark (2.6% to 13.1%)
  • Instruction Cache Miss Rates: Generally low (1.6% to 7.2%) indicating good spatial locality
  • L2 Cache Performance: Effective second-level cache with reasonable miss rates

Pipeline Efficiency:

  • Issue Rate: Average 0.81 instructions per cycle across benchmarks
  • CPI (Cycles Per Instruction): Average 1.72, indicating good instruction-level parallelism
  • IPC (Instructions Per Cycle): Average 0.74, showing effective out-of-order execution

Simulation Efficiency:

  • Host Time Difference: Consistent ~198% faster simulation in gem5 vs Verilator
  • Complex Benchmark Handling: CoreMark would require 162x longer simulation time on Verilator
  • Scalability: gem5 demonstrates excellent scalability for large-scale architectural evaluation
Microarchitectural Performance Analysis

Branch Prediction Performance:

  • BTB Hit Ratio: 93.1% average across all configurations, indicating excellent branch target prediction
  • Conditional Prediction Accuracy: High accuracy with minimal mispredictions across benchmarks
  • RAS Usage: Effective return address stack utilization with very low incorrect predictions (<1%)
  • Indirect Branch Handling: Low misprediction rates for indirect branches

Cache Hierarchy Performance:

Data Cache Analysis:

  • Miss Rates: Varied significantly by benchmark (2.6% to 13.1%)
  • Average Miss Latency: 59,547 cycles across configurations
  • MSHR Performance: Effective miss status holding register utilization
  • Prefetcher Impact: AMPM prefetchers showed 8.3% accuracy, 10.3% coverage

Instruction Cache Analysis:

  • Miss Rates: Generally low (1.6% to 7.2%) indicating good spatial locality
  • Average Miss Latency: 78,390 cycles across configurations
  • Prefetcher Effectiveness: Next-line prefetchers showed 28.6% accuracy, 36.1% coverage

L2 Cache Performance:

  • Overall Miss Rate: 28.3% average across configurations
  • Instruction vs Data: Balanced miss rates between instruction and data streams
  • Prefetcher Coverage: 48.8% average coverage across configurations

Pipeline Efficiency Metrics:

  • Issue Rate: 0.81 average instructions per cycle across benchmarks
  • CPI (Cycles Per Instruction): 1.72 average, indicating good instruction-level parallelism
  • IPC (Instructions Per Cycle): 0.74 average, showing effective out-of-order execution
  • Functional Unit Utilization: 0.4% average busy rate, indicating efficient resource usage

Memory System Behavior:

  • Read vs Write Performance: Read requests showed higher miss rates but lower latencies
  • Memory Bandwidth: Effective utilization of available memory bandwidth
  • Cache Coherency: Minimal overhead from cache coherency protocols
Pipeline Optimization Results

Pipeline analysis provided insights into instruction execution efficiency using custom gem5 configurations:

gem5 Configuration Details:

  • Base Model: O3CPU with RISC-V architecture[3]
  • Pipeline Stages: 4-stage fetch + 6 additional stages
  • Branch Predictor: LTAGE with RAS of 32 entries
  • Functional Units: Configured according to BOOM documentation[15]
  • Decode Width: 4-wide to match BOOM’s MegaBoomConfig[7]

Key Pipeline Optimizations:

  • Instruction-Level Parallelism: Measured effectiveness of out-of-order execution across benchmarks
  • Branch Prediction: Evaluated TAGE predictor accuracy and impact on performance
  • Pipeline Stalls: Analyzed causes and frequency using custom matrix computation benchmarks
  • Resource Utilization: Studied efficiency of different pipeline resources

Performance Comparison Results:

  • Simulation Speed: gem5 consistently ~198% faster than Verilator under Chipyard
  • Cycle Accuracy: Achieved 4.46% difference for multiply benchmark in best configuration
  • Complex Benchmark Handling: CoreMark would require 100s of days on Verilator vs hours on gem5

The pipeline analysis helped identify bottlenecks and optimization opportunities in the processor design while demonstrating gem5’s superior simulation performance.

Discussion and Limitations

Our evaluation revealed several important findings and limitations:

Key Findings:

  • gem5 Superiority: gem5 demonstrates enormous capability in terms of statistical breadth compared to Chipyard’s Verilator simulation framework
  • Simulation Speed: gem5 is consistently ~198% faster than Verilator under Chipyard for host execution time
  • Configuration Flexibility: gem5 provides extensive configurability and automatic tracking of statistics
  • Cycle Accuracy: Achieved 4.46% cycle count difference for multiply benchmark in best configuration

Technical Challenges:

  • Incomplete BOOM Documentation: Limited information on exact latencies and composition of functional units for BOOM parameterizations[15]
  • Configuration Complexity: Required assumptions for gem5 configuration knobs not defined in BOOM documentation
  • Chisel/Scala Knowledge Gap: Better understanding of Chisel and Scala would have aided in deciphering hardware unit properties
  • Statistics Limitations: Lack of detailed statistics from Chipyard made refining configurations more difficult

Limitations:

  • CoreMark Execution: Unable to complete CoreMark benchmark on Chipyard due to complexity (would take weeks/months)
  • Configuration Accuracy: Difficult to achieve one-to-one cycle count match between gem5 and Chipyard simulations
  • Hardware Resource Constraints: Limited by Intel Xeon Gold 6140 CPU with 4 cores and 8GB RAM
  • Time Constraints: Limited time for refining and tuning gem5 configuration for optimal accuracy

Methodological Considerations:

  • Statistical Significance: Results were carefully evaluated across multiple configurations
  • Simulation Parameters: Validated against known benchmarks and BOOM documentation
  • Result Reproducibility: Ensured through careful experimental design and GitHub repository

Conclusion and Future Work

Our evaluation of gem5’s RISC-V ISA models using the BOOM processor provided valuable insights into modern processor architecture design and evaluation methodologies. The project successfully demonstrated the effectiveness of the gem5 framework for architectural research and highlighted important trade-offs in processor design.

Key Contributions:

  • Comprehensive Evaluation Methodology: Developed systematic approach for comparing gem5 vs Chipyard/Verilator simulations
  • Performance Analysis: Detailed analysis of cache configurations, prefetching mechanisms, and pipeline optimizations
  • Simulation Framework: Established reproducible methodology using riscv-tests and CoreMark benchmarks
  • Technical Infrastructure: Created GitHub repository with custom gem5 configurations and results

Key Conclusions:

  • gem5 Superiority: gem5 demonstrates enormous capability in terms of statistical breadth and simulation speed
  • Simulation Efficiency: gem5 is consistently ~198% faster than Verilator under Chipyard
  • Configuration Flexibility: gem5 provides extensive configurability and automatic tracking of statistics
  • Research Viability: gem5 is highly capable of accurately modeling hardware-level implementations of RISC-V architectures[3][4]

Future Work:

  • VCD Analysis: Use generated VCD (waveform dump files) from Verilator to reverse engineer processor models
  • Micro-architectural Event Tracking: Develop comprehensive framework to generate more performance statistics within Chipyard
  • Configuration Refinement: Given more time, refine and tune gem5 configuration for optimal accuracy
  • Additional Benchmarks: Run benchmarks that stress each functional unit and specific pipeline parts
  • Architectural Knowledge: Develop deeper understanding of Chisel HDL to streamline configuration process

The project established that properly configured simulators like gem5 are appropriate for modeling complex processor architectures in detail and could become the de facto standard for simulation across implementation levels.

References

[1] A. Akram and L. Sawalha, "A Comparison of x86 Computer Architecture Simulators," 2016.

[2] F. A. Endo, D. Couroussé and H.-P. Charles, "Micro-architectural simulation of in-order and out-of-order ARM microprocessors with gem5," in 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), 2014.

[3] A. Roelke and M. R. Stan, "RISC5: Implementing the RISC-V ISA in gem5," 2017.

[4] P. Y. H. Hin, X. Liao, J. Cui, A. Mondelli, T. M. Somu and N. Zhang, "Supporting RISC-V Full System Simulation in gem5," in Proceedings of Computer Architecture Research with RISC-V (CARRV 2021), 2021.

[5] J. Lowe-Power, A. M. Ahmad, A. Akram, M. Alian, R. Amslinger, M. Andreozzi, A. Armejach, N. Asmussen, B. Beckmann, S. Bharadwaj, G. Black, G. Bloom, B. R. Bruce, D. R. Carvalho, J. Castrillon, L. Chen, N. Derumigny, S. Diestelhorst, W. Elsasser, C. Escuin, M. Fariborz, A. Farmahini-Farahani, P. Fotouhi, R. Gambord, J. Gandhi, D. Gope, T. Grass, A. Gutierrez, B. Hanindhito, A. Hansson, S. Haria, A. Harris, T. Hayes, A. Herrera, M. Horsnell, S. A. R. Jafri, R. Jagtap, H. Jang, R. Jeyapaul, T. M. Jones, M. Jung, S. Kannoth, H. Khaleghzadeh, Y. Kodama, T. Krishna, T. Marinelli, C. Menard, A. Mondelli, M. Moreto, T. Mück, O. Naji, K. Nathella, H. Nguyen, N. Nikoleris, L. E. Olson, M. Orr, B. Pham, P. Prieto, T. Reddy, A. Roelke, M. Samani, A. Sandberg, J. Setoain, B. Shingarov, M. D. Sinclair, T. Ta, R. Thakur, G. Travaglini, M. Upton, N. Vaish, I. Vougioukas, W. Wang, Z. Wang, N. Wehn, C. Weis, D. A. Wood, H. Yoon and É. F. Zulian, The gem5 Simulator: Version 20.0+, arXiv, 2020.

[6] W. Heirman, T. E. Carlson and L. Eeckhout, "Sniper: scalable and accurate parallel multi-core simulation," 2012.

[7] J. Zhao, "SonicBOOM: The 3rd Generation Berkeley Out-of-Order Machine," 2020.

[8] C. Celio, D. A. Patterson and K. Asanović, "The Berkeley Out-of-Order Machine (BOOM): An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor," 2015.

[9] N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi and E. Rotenberg, "FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template," in 2011 38th Annual International Symposium on Computer Architecture (ISCA), 2011.

[10] B. H. Dwiel, N. K. Choudhary and E. Rotenberg, "FPGA modeling of diverse superscalar processors," in 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2012.

[11] S. Mashimo, A. Fujita, R. Matsuo, S. Akaki, A. Fukuda, T. Koizumi, J. Kadomoto, H. Irie, M. Goshima, K. Inoue and R. Shioya, "An Open Source FPGA-Optimized Out-of-Order RISC-V Soft Processor," in 2019 International Conference on Field-Programmable Technology (ICFPT), 2019.

[12] S. Zhang, A. Wright, T. Bourgeat and A. Arvind, "Composable Building Blocks to Open up Processor Design," in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018.

[13] C. Celio, P.-F. Chiu, B. Nikolic, D. A. Patterson and K. Asanović, "BOOM v2: an open-source out-of-order RISC-V core," 2017.

[14] C. Atwell, "RISC-V Serves Up Open-Source Possibilities for the Future," ElectronicDesign, 12-Jul-2022. [Online]. Available: https://www.electronicdesign.com/technologies/embedded-revolution/article/21246374/electronic-design-riscv-serves-up-opensource-possibilities-for-the-future. [Accessed: 06-Nov-2022].

[15] "Welcome to RISCV-Boom's Documentation!" RISCV-BOOM, https://docs.boomcore.org/en/latest/index.html.


This project was completed as part of CPRE 581 (Advanced Computer Architecture) at Iowa State University.