gem5 RISC-V BOOM Processor Evaluation

As part of a team project, we conducted a comprehensive evaluation of gem5’s RISC-V ISA models using the Berkeley Out-of-Order Machine (BOOM) processor. This research project focused on analyzing the performance characteristics, architectural trade-offs, and simulation accuracy of different processor configurations within the gem5 architectural simulator framework.

The project employed advanced computer architecture simulation techniques, performance benchmarking methodologies, and statistical analysis to evaluate processor designs across multiple workload scenarios. By utilizing the BOOM processor as our primary evaluation platform, we gained insights into out-of-order execution, pipeline optimization, and the impact of various architectural parameters on overall system performance.

Our primary goal was to investigate whether computer architecture simulation tools like gem5 are sufficient for determining the power, performance, or area (PPA) effects of custom processors, or if the development effort of a concrete processor core is necessary. We compared gem5’s simulation capabilities against the Chipyard framework and Verilator for BOOM processor modeling.

Main Goals
Technical Background
Experimental Platform
- Hardware Configuration
- Software Framework
Methodology
Implementation and Results
Discussion and Limitations
Conclusion and Future Work
References

Main Goals

Evaluate gem5’s RISC-V ISA models using BOOM processor
- Analyze performance characteristics of different processor configurations
- Assess simulation accuracy and reliability of gem5 framework compared to Chipyard/Verilator
- Compare architectural trade-offs in out-of-order execution
Implement comprehensive benchmarking methodology
- Design systematic evaluation framework for processor performance using riscv-tests and CoreMark
- Develop statistical analysis techniques for simulation results
- Create reproducible methodology for architectural research
Analyze cache and memory system performance
- Evaluate different cache configurations (32B vs 64B block sizes) and their impact on performance
- Study memory hierarchy optimization strategies including prefetching mechanisms
- Assess the effectiveness of different prefetcher configurations (AMPM, next-line, tagged)
Investigate pipeline optimization techniques
- Analyze out-of-order execution efficiency using custom matrix computation benchmarks
- Evaluate branch prediction accuracy using TAGE predictor configurations
- Study instruction-level parallelism and its limitations in 4-wide decode configurations

Technical Background

gem5 Architectural Simulator

The gem5 simulator is a modular, open-source computer architecture simulator that provides a flexible framework for evaluating different processor designs. It supports multiple instruction set architectures (ISAs) including RISC-V, ARM, x86, and others.^[5] The simulator consists of several key components:

CPU Models: Various CPU models including simple, in-order, and out-of-order processors
Memory System: Configurable cache hierarchies and memory controllers
Interconnect: Flexible network-on-chip and bus models
I/O Systems: Support for various I/O devices and interfaces

gem5’s modular design allows researchers to easily configure different architectural parameters and evaluate their impact on performance, power consumption, and other metrics.

BOOM Processor Architecture

The Berkeley Out-of-Order Machine (BOOM) is a high-performance, out-of-order RISC-V processor implementation designed for research and education. Key architectural features include:

Out-of-Order Execution: Dynamic instruction scheduling for improved performance
Speculative Execution: Branch prediction and speculative instruction execution
Register Renaming: Eliminates false dependencies and enables better instruction-level parallelism
Load/Store Queue: Manages memory operations and maintains memory consistency
Reorder Buffer: Ensures correct program execution order

SonicBOOM (BOOMv3) Configuration:

Parameter	Value
Fetch Width	8
Decode Width	4
ROB Entries	128
Integer Registers	128
Floating-point (FP) Registers	128
Load Queue (LDQ) Size	32
Store Queue (STQ) Size	32
Maximum Number of Branches	20
Fetch Buffer Entry Size	32
Prefetching	TRUE

Table 1. SonicBOOM (MegaBoomConfig) Tile Parameters.

BOOM’s design provides a realistic baseline for evaluating modern processor architectures and understanding the complexities of out-of-order execution.^[7]

Figure 1: BOOM version 3 (SonicBOOM) High-level Architecture showing the superscalar out-of-order pipeline design with fetch, decode, rename, dispatch, execute, and commit stages.

Figure 2: Architecture for the Alpha 21264 processor that the BaseO3CPU in gem5 follows closely, providing the foundation for our out-of-order processor modeling.

RISC-V ISA Models

RISC-V is an open-source instruction set architecture that provides a foundation for processor design and evaluation.^[14] The ISA includes:

Base Integer Instructions: Core computational and control flow instructions
Floating-Point Extensions: Support for single and double precision operations
Vector Extensions: SIMD operations for data-parallel workloads
Privileged Architecture: Support for different privilege levels and virtualization

The modular nature of RISC-V allows for flexible processor implementations and enables detailed analysis of different architectural decisions.

Feature	Sonic BOOM	Risky-OOO	RSD	FabScalar
ISA	RV64GC	RV64G	RV32IM	PISA (sub-set)
DMIPS/MHz	3.98	?	2.04	?
SPEC2006 IPC	0.86	0.48	N/A	?
Dec Width	1-5	2	2	2-5
Mem Width	16b/cycle	8b/cycle	4b/cycle	4-8b/cycle
HDL	Chisel3	BSV+CMD	System Verilog	FabScalar toolset with CPSL

Table 2. Comparison of alternative academic or open-source out-of-order processors.

Component	Present in BaseO3CPU?	Comments
TAGE-L Branch Predictor	✘
MEM Issue Queue	✘
L1/L2 Caches	✘	Dual-ported Data Cache Present. Needs to be expanded.
L0 and Dense L1 BTB	✘
Instruction Fetch and PreDecode	✓	Needs to be integrated into 16-byte window.
RAR ReOrder Buffer	✘
Decoder Blocks	✓	Singular decoder needs to be expanded to 4-wide.
Execution Units	✓	Requires multithreading support. Need AGU.

Table 3. Significant BOOM hardware modules required for modeling in gem5.

Experimental Platform

Hardware Configuration

Our experimental platform consisted of high-performance computing resources capable of running complex architectural simulations:

Multi-core CPU: High-performance processor for simulation execution
Large Memory Capacity: Sufficient RAM for complex simulation workloads
High-speed Storage: Fast storage for simulation data and results
Network Connectivity: Access to distributed computing resources

The hardware configuration was designed to support long-running simulations and handle the computational demands of architectural evaluation.

Software Framework

The software framework included several key components:

gem5 Simulator: Latest stable release with RISC-V and BOOM support
RISC-V Toolchain: Complete development environment for RISC-V applications
Benchmark Suite: Standardized workloads for performance evaluation
Analysis Tools: Custom scripts for data processing and visualization
Statistical Software: Tools for statistical analysis and result validation

The software stack was carefully configured to ensure reproducibility and enable detailed analysis of simulation results.

Methodology

Benchmark Selection

We selected a comprehensive set of benchmarks to evaluate different aspects of processor performance:

Custom C Matrix Computation Benchmark:

Computes determinant
Computes sum
Evaluates if identity matrix

riscv-tests Repository Benchmarks:

1D 3 element median filter (median)
Two input stream multiplication (multiply)
QuickSort (qsort)
Reverse QuickSort (rsort)
Sparse matrix-vector multiplication (spmv)
Towers of Hanoi (towers)
Vector-vector add (vvadd)

riscv-coremark Repository:

Ultra-low power and IoT tests
Heterogeneous Compute tests
Single-core Performance tests
Multi-core Performance tests
Phone and Tablet tests

These benchmarks were chosen because they show good general-purpose compute capability and provide stress testing for functional units within the BOOM processor.

Performance Metrics

Our evaluation focused on several key performance metrics:

Instructions Per Cycle (IPC): Primary measure of processor efficiency
Cache Miss Rates: Evaluation of memory system performance
Branch Prediction Accuracy: Assessment of control flow prediction
Memory Bandwidth: Analysis of memory system utilization
Power Consumption: Energy efficiency considerations

These metrics provided a comprehensive view of processor performance and architectural trade-offs.

Statistical Analysis

We employed rigorous statistical methods to ensure the reliability of our results:

Multiple Runs: Each configuration was evaluated across multiple simulation runs
Confidence Intervals: Statistical analysis to quantify result uncertainty
Outlier Detection: Identification and handling of anomalous results
Correlation Analysis: Understanding relationships between different metrics

The statistical framework ensured that our conclusions were based on reliable and reproducible data.

Implementation and Results

Implementation Progress

We successfully set up the Chipyard framework in the local environment and created RISC-V configurations for gem5 along with a repository of our progress and gem5 outputs. The RISC-V GNU Toolchain was set up on an ISU ETG virtual machine to allow for compilation of programs that will run on the BOOM core.

Key Implementation Achievements:

Created custom gem5 configuration scripts in Python based on fs_linux.py and riscv-ubuntu-run.py^[4]
Successfully tested static compilation of matrix computation programs to RISC-V
Configured BOOM processor under Chipyard framework using “MegaBoomConfig”^[7]
Established GitHub repository for configuration, test, and result files

Technical Challenges Overcome:

Resolved lockfile issues by switching to WSL environment
Fixed JVM memory allocation challenges in Chipyard framework
Solved gem5 segmentation faults by removing HiFive() platform from system configuration
Corrected memory size validation errors by properly configuring IO and memory buses with Bridge components

CoreMark Benchmark Analysis

Our CoreMark benchmark analysis revealed several key insights into processor performance:

Baseline Performance: Established performance baseline for BOOM processor
Configuration Impact: Evaluated the effect of different architectural parameters
Scalability Analysis: Studied performance scaling with different workload sizes
Optimization Opportunities: Identified areas for architectural improvement

The CoreMark results provided a standardized comparison point for evaluating processor efficiency and identifying optimization opportunities.

Cache Performance Evaluation

Cache performance analysis focused on understanding memory system behavior across three different gem5 configurations:

Configuration 1 (32B Block Size with AMPM Prefetchers):

Average cycle count difference: ~8.76%
Closest matches: rsort (10.85%) and multiply (11.62%)
Largest differences: spmv (79.01%) and towers (46.21%)
Host execution time: ~197.51% faster in gem5

Configuration 2 (64B Block Size, No iCache Prefetcher):

Average cycle count difference: ~12.26%
Improved performance for rsort (8.34%) and qsort (10.28%)
Maintained similar host execution time differences (~197%)
Best overall accuracy for most benchmarks

Configuration 3 (64B Block Size with Next-Line Prefetchers):

Average cycle count difference: ~6.56%
Best match: multiply benchmark (4.46%)
Most accurate model with next-line prefetchers and modified latencies
Host execution time: ~197.59% faster in gem5

Cache Configuration Comparison:

Configuration	Block Size	Prefetcher	Avg Cycle Diff %
Config 1	32B	AMPM	8.76%
Config 2	64B	None	12.26%
Config 3	64B	Next-Line	6.56%

Table 6. Cache configuration comparison showing the impact of block size and prefetcher settings on simulation accuracy.

Key Findings:

Cache block size significantly impacts performance (32B vs 64B)
Prefetcher configuration has substantial effect on memory system behavior
Host execution time differences consistently ~198% (gem5 much faster than Verilator)
Complex benchmarks like CoreMark would take weeks/months on Verilator vs hours on gem5
Configuration 3 (64B, next-line prefetcher) provided the best overall accuracy

Detailed Performance Analysis

Our comprehensive performance analysis across multiple benchmarks revealed critical insights into processor behavior and simulation efficiency:

Benchmark Performance Summary:

Benchmark	gem5 Cycles	BOOM Cycles	Cycle Diff %	Host Time Diff %
matrix_prog	56,679	45,525	21.83%	198.78%
median	45,970	60,254	26.89%	198.46%
multiply	60,776	68,277	11.62%	197.83%
qsort	386,717	339,421	13.03%	196.77%
rsort	274,268	246,044	10.85%	196.48%
spmv	102,578	236,561	79.01%	198.95%
towers	26,666	42,691	46.21%	198.58%
vvadd	45,281	51,664	13.17%	197.39%
Average	124,867	136,305	27.7%	197.51%

Table 4. Detailed performance comparison between gem5 and BOOM simulations across riscv-tests benchmarks.

CoreMark Benchmark Analysis:

Configuration	gem5 Host Hours	Expected BOOM Hours	Time Ratio
coremark config1	19.62	3,182.59	162.2x
coremark config2	30.78	4,992.81	162.2x
Average	25.20	4,087.70	162.2x

Table 5. CoreMark benchmark performance showing dramatic simulation time differences.

Key Performance Insights:

Branch Prediction Performance:

BTB Hit Ratio: Average 93.1% across benchmarks, indicating effective branch target prediction
Conditional Prediction Accuracy: High accuracy with minimal mispredictions
RAS Usage: Effective return address stack utilization with very low incorrect predictions

Cache Performance Analysis:

Data Cache Miss Rates: Varied significantly by benchmark (2.6% to 13.1%)
Instruction Cache Miss Rates: Generally low (1.6% to 7.2%) indicating good spatial locality
L2 Cache Performance: Effective second-level cache with reasonable miss rates

Pipeline Efficiency:

Issue Rate: Average 0.81 instructions per cycle across benchmarks
CPI (Cycles Per Instruction): Average 1.72, indicating good instruction-level parallelism
IPC (Instructions Per Cycle): Average 0.74, showing effective out-of-order execution

Simulation Efficiency:

Host Time Difference: Consistent ~198% faster simulation in gem5 vs Verilator
Complex Benchmark Handling: CoreMark would require 162x longer simulation time on Verilator
Scalability: gem5 demonstrates excellent scalability for large-scale architectural evaluation

Microarchitectural Performance Analysis

Branch Prediction Performance:

BTB Hit Ratio: 93.1% average across all configurations, indicating excellent branch target prediction
Conditional Prediction Accuracy: High accuracy with minimal mispredictions across benchmarks
RAS Usage: Effective return address stack utilization with very low incorrect predictions (<1%)
Indirect Branch Handling: Low misprediction rates for indirect branches

Cache Hierarchy Performance:

Data Cache Analysis:

Miss Rates: Varied significantly by benchmark (2.6% to 13.1%)
Average Miss Latency: 59,547 cycles across configurations
MSHR Performance: Effective miss status holding register utilization
Prefetcher Impact: AMPM prefetchers showed 8.3% accuracy, 10.3% coverage

Instruction Cache Analysis:

Miss Rates: Generally low (1.6% to 7.2%) indicating good spatial locality
Average Miss Latency: 78,390 cycles across configurations
Prefetcher Effectiveness: Next-line prefetchers showed 28.6% accuracy, 36.1% coverage

L2 Cache Performance:

Overall Miss Rate: 28.3% average across configurations
Instruction vs Data: Balanced miss rates between instruction and data streams
Prefetcher Coverage: 48.8% average coverage across configurations

Pipeline Efficiency Metrics:

Issue Rate: 0.81 average instructions per cycle across benchmarks
CPI (Cycles Per Instruction): 1.72 average, indicating good instruction-level parallelism
IPC (Instructions Per Cycle): 0.74 average, showing effective out-of-order execution
Functional Unit Utilization: 0.4% average busy rate, indicating efficient resource usage

Memory System Behavior:

Read vs Write Performance: Read requests showed higher miss rates but lower latencies
Memory Bandwidth: Effective utilization of available memory bandwidth
Cache Coherency: Minimal overhead from cache coherency protocols

Pipeline Optimization Results

Pipeline analysis provided insights into instruction execution efficiency using custom gem5 configurations:

gem5 Configuration Details:

Base Model: O3CPU with RISC-V architecture^[3]
Pipeline Stages: 4-stage fetch + 6 additional stages
Branch Predictor: LTAGE with RAS of 32 entries
Functional Units: Configured according to BOOM documentation^[15]
Decode Width: 4-wide to match BOOM’s MegaBoomConfig^[7]

Key Pipeline Optimizations:

Instruction-Level Parallelism: Measured effectiveness of out-of-order execution across benchmarks
Branch Prediction: Evaluated TAGE predictor accuracy and impact on performance
Pipeline Stalls: Analyzed causes and frequency using custom matrix computation benchmarks
Resource Utilization: Studied efficiency of different pipeline resources

Performance Comparison Results:

Simulation Speed: gem5 consistently ~198% faster than Verilator under Chipyard
Cycle Accuracy: Achieved 4.46% difference for multiply benchmark in best configuration
Complex Benchmark Handling: CoreMark would require 100s of days on Verilator vs hours on gem5

The pipeline analysis helped identify bottlenecks and optimization opportunities in the processor design while demonstrating gem5’s superior simulation performance.

Discussion and Limitations

Our evaluation revealed several important findings and limitations:

Key Findings:

gem5 Superiority: gem5 demonstrates enormous capability in terms of statistical breadth compared to Chipyard’s Verilator simulation framework
Simulation Speed: gem5 is consistently ~198% faster than Verilator under Chipyard for host execution time
Configuration Flexibility: gem5 provides extensive configurability and automatic tracking of statistics
Cycle Accuracy: Achieved 4.46% cycle count difference for multiply benchmark in best configuration

Technical Challenges:

Incomplete BOOM Documentation: Limited information on exact latencies and composition of functional units for BOOM parameterizations^[15]
Configuration Complexity: Required assumptions for gem5 configuration knobs not defined in BOOM documentation
Chisel/Scala Knowledge Gap: Better understanding of Chisel and Scala would have aided in deciphering hardware unit properties
Statistics Limitations: Lack of detailed statistics from Chipyard made refining configurations more difficult

Limitations:

CoreMark Execution: Unable to complete CoreMark benchmark on Chipyard due to complexity (would take weeks/months)
Configuration Accuracy: Difficult to achieve one-to-one cycle count match between gem5 and Chipyard simulations
Hardware Resource Constraints: Limited by Intel Xeon Gold 6140 CPU with 4 cores and 8GB RAM
Time Constraints: Limited time for refining and tuning gem5 configuration for optimal accuracy

Methodological Considerations:

Statistical Significance: Results were carefully evaluated across multiple configurations
Simulation Parameters: Validated against known benchmarks and BOOM documentation
Result Reproducibility: Ensured through careful experimental design and GitHub repository

Conclusion and Future Work

Our evaluation of gem5’s RISC-V ISA models using the BOOM processor provided valuable insights into modern processor architecture design and evaluation methodologies. The project successfully demonstrated the effectiveness of the gem5 framework for architectural research and highlighted important trade-offs in processor design.

Key Contributions:

Comprehensive Evaluation Methodology: Developed systematic approach for comparing gem5 vs Chipyard/Verilator simulations
Performance Analysis: Detailed analysis of cache configurations, prefetching mechanisms, and pipeline optimizations
Simulation Framework: Established reproducible methodology using riscv-tests and CoreMark benchmarks
Technical Infrastructure: Created GitHub repository with custom gem5 configurations and results

Key Conclusions:

gem5 Superiority: gem5 demonstrates enormous capability in terms of statistical breadth and simulation speed
Simulation Efficiency: gem5 is consistently ~198% faster than Verilator under Chipyard
Configuration Flexibility: gem5 provides extensive configurability and automatic tracking of statistics
Research Viability: gem5 is highly capable of accurately modeling hardware-level implementations of RISC-V architectures^[3]^[4]

Future Work:

VCD Analysis: Use generated VCD (waveform dump files) from Verilator to reverse engineer processor models
Micro-architectural Event Tracking: Develop comprehensive framework to generate more performance statistics within Chipyard
Configuration Refinement: Given more time, refine and tune gem5 configuration for optimal accuracy
Additional Benchmarks: Run benchmarks that stress each functional unit and specific pipeline parts
Architectural Knowledge: Develop deeper understanding of Chisel HDL to streamline configuration process

The project established that properly configured simulators like gem5 are appropriate for modeling complex processor architectures in detail and could become the de facto standard for simulation across implementation levels.

References

[1] A. Akram and L. Sawalha, "A Comparison of x86 Computer Architecture Simulators," 2016.

[2] F. A. Endo, D. Couroussé and H.-P. Charles, "Micro-architectural simulation of in-order and out-of-order ARM microprocessors with gem5," in 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), 2014.

[3] A. Roelke and M. R. Stan, "RISC5: Implementing the RISC-V ISA in gem5," 2017.

[4] P. Y. H. Hin, X. Liao, J. Cui, A. Mondelli, T. M. Somu and N. Zhang, "Supporting RISC-V Full System Simulation in gem5," in Proceedings of Computer Architecture Research with RISC-V (CARRV 2021), 2021.

[5] J. Lowe-Power, A. M. Ahmad, A. Akram, M. Alian, R. Amslinger, M. Andreozzi, A. Armejach, N. Asmussen, B. Beckmann, S. Bharadwaj, G. Black, G. Bloom, B. R. Bruce, D. R. Carvalho, J. Castrillon, L. Chen, N. Derumigny, S. Diestelhorst, W. Elsasser, C. Escuin, M. Fariborz, A. Farmahini-Farahani, P. Fotouhi, R. Gambord, J. Gandhi, D. Gope, T. Grass, A. Gutierrez, B. Hanindhito, A. Hansson, S. Haria, A. Harris, T. Hayes, A. Herrera, M. Horsnell, S. A. R. Jafri, R. Jagtap, H. Jang, R. Jeyapaul, T. M. Jones, M. Jung, S. Kannoth, H. Khaleghzadeh, Y. Kodama, T. Krishna, T. Marinelli, C. Menard, A. Mondelli, M. Moreto, T. Mück, O. Naji, K. Nathella, H. Nguyen, N. Nikoleris, L. E. Olson, M. Orr, B. Pham, P. Prieto, T. Reddy, A. Roelke, M. Samani, A. Sandberg, J. Setoain, B. Shingarov, M. D. Sinclair, T. Ta, R. Thakur, G. Travaglini, M. Upton, N. Vaish, I. Vougioukas, W. Wang, Z. Wang, N. Wehn, C. Weis, D. A. Wood, H. Yoon and É. F. Zulian, The gem5 Simulator: Version 20.0+, arXiv, 2020.

[6] W. Heirman, T. E. Carlson and L. Eeckhout, "Sniper: scalable and accurate parallel multi-core simulation," 2012.

[7] J. Zhao, "SonicBOOM: The 3rd Generation Berkeley Out-of-Order Machine," 2020.

[8] C. Celio, D. A. Patterson and K. Asanović, "The Berkeley Out-of-Order Machine (BOOM): An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor," 2015.

[9] N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi and E. Rotenberg, "FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template," in 2011 38th Annual International Symposium on Computer Architecture (ISCA), 2011.

[10] B. H. Dwiel, N. K. Choudhary and E. Rotenberg, "FPGA modeling of diverse superscalar processors," in 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2012.

[11] S. Mashimo, A. Fujita, R. Matsuo, S. Akaki, A. Fukuda, T. Koizumi, J. Kadomoto, H. Irie, M. Goshima, K. Inoue and R. Shioya, "An Open Source FPGA-Optimized Out-of-Order RISC-V Soft Processor," in 2019 International Conference on Field-Programmable Technology (ICFPT), 2019.

[12] S. Zhang, A. Wright, T. Bourgeat and A. Arvind, "Composable Building Blocks to Open up Processor Design," in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018.

[13] C. Celio, P.-F. Chiu, B. Nikolic, D. A. Patterson and K. Asanović, "BOOM v2: an open-source out-of-order RISC-V core," 2017.

[14] C. Atwell, "RISC-V Serves Up Open-Source Possibilities for the Future," ElectronicDesign, 12-Jul-2022. [Online]. Available: https://www.electronicdesign.com/technologies/embedded-revolution/article/21246374/electronic-design-riscv-serves-up-opensource-possibilities-for-the-future. [Accessed: 06-Nov-2022].

[15] "Welcome to RISCV-Boom's Documentation!" RISCV-BOOM, https://docs.boomcore.org/en/latest/index.html.

This project was completed as part of CPRE 581 (Advanced Computer Architecture) at Iowa State University.

Table of Contents