# The Use of Multithreading for Exception Handling

Craig Zilles, Joel Emer\*, Guri Sohi

**University of Wisconsin - Madison** 

\*Compaq - Alpha Development Group

International Symposium on Microarchitecture - 32

November, 1999

# **Overview**

Extensions to a multithreaded processor to reclaim lost performance during exception handling in a pipelined, out-of-order processor

- HARDWARE EXCEPTIONS
- Performance in traditional implementation
- IMPORTANT CHARACTERISTICS OF EXCEPTION HANDLERS
- EXPLOIT THEM WITH EXTENSION TO SMT PROCESSOR
- Methodology/Performance
- AN OPTIMIZATION: QUICK-STARTING
- Conclusions

# Hardware exceptions

# COST-EFFECTIVE HARDWARE UNCOMMON CASE HANDLED BY SOFTWARE

# RECOVERABLE EXCEPTIONS (NOT SEGFAULTS)

- TIB miss
- unaligned access
- emulated instructions

# EVENT DETECTED BY HARDWARE, RESOLVED BY SOFTWARE

- A short piece of code is executed
- Control is returned to the application at the exception

# **MUCH LIKE BRANCH MISPREDICT**

- causes CHANGE IN CONTROL FLOW
- often detected at execute time

PRE-EXCEPT APPLICATION POST-EXCEPT APPLICATION

SQUASH THE EXCEPTION AND POST-EXCEPTION INSTRUCTIONS

PRE-EXCEPT APPLICATION

PRE-EXCEPT APPLICATION EXCPT. HANDLER

REFETCH APPLICATION CODE

PRE-EXCEPT APPLICATION EXCPT. HANDLER POST-EXCEPT APPLI

—— DYNAMIC INSTRUCTION STREAM

# WITH INCREASED PIPELINE LENGTH, SUPERSCALAR WIDTH, AND WINDOW SIZE

#### • IT ONLY GETS WORSE



The Use of Multithreading for Exception Handling - Craig Zilles, Joel Emer, and Guri Sohi International Symposium on Microarchitecture - 32, November 1999

# **Structure of Exception Handler**

# RECONVERGENT CONTROL FLOW

The same application instructions are executed in the same order INDEPENDENT of the exception handler's execution

MINIMAL DATA DEPENDENCES between application and exception handler

typically only involving excepting instruction

Example: TLB MISS HANDLER

- reads miss address from privileged register
- loads from page table
- writes TLB

# **Extension to SMT processor**

### RECONVERGENT CONTROL FLOW -> DON'T SQUASH

### ALLOCATE THE HANDLER TO SEPARATE THREAD

- FIFO management of window resources (within a thread)
- extra hardware required for ordering threads

### **THREAD**



### PROVIDE APPEARANCE OF SEQUENTIAL EXECUTION

Control thread retirement order

# **Extension to SMT processor**

# MINIMAL DATA DEPENDENCES — USE SEPARATE REGISTER FILE

Avoids additional renamer complexity

UNCOMMON CASE (TLB MISS  $\rightarrow$  PAGE FAULT  $\rightarrow$  CONTEXT SWITCH)

REVERT TO NORMAL MECHANISM

#### MEMORY DEPENDENCES

• DETECT CONFLICTS, RECOVER (MUCH LIKE R10K, OR ARB)

# Methodology

# **EXAMPLE IMPLEMENTATION: SOFTWARE TLB MISS HANDLING**

- EXECUTION DRIVEN SMT SIMULATOR
- BUILT FROM ALPHA ARCHITECTURE SIMPLESCALAR TOOLKIT
- SUPPORTS ENOUGH OF 21164 PRIVILEGED ARCHITECTURE TO RUN COMMON-CASE TLB HANDLER
  - SPECULATIVE EXECUTION, MULTIPLE IN-FLIGHT MISSES
- 8 WIDE, 128 WINDOW, 7 STAGE, BIG YAGS, 64K L1'S, 1M L2
- BENCHMARKS WITH NON-TRIVIAL TLB BEHAVIOR (FROM SPEC AND ELSEWHERE)
- SCALED DOWN (64 ENTRY) DATA TLB

METRIC: PENALTY PER MISS

(additional overhead vs. simulation with perfect TLB) / misses

# DOES MUCH BETTER THAN TRADITIONAL SOFTWARE APPROACH NOT AS GOOD AS AGGRESSIVE HARDWARE TLB MISS WIDGET



# **Optimization: Quick-starting**

#### PERFORMANCE GAP BETWEEN HARDWARE AND MULTI-THREADED

FETCH/DECODE LATENCY

SOLUTION: CACHE EXCEPTION HANDLER PARTWAY DOWN PIPELINE

#### **OUR SMT IMPLEMENTATION:**

- PER THREAD FETCH BUFFERS, IDLE RESOURCES WHEN THREAD IS IDLE
- PREDICT NEXT EXCEPTION, USE IDLE FETCH CYCLES TO PREFETCH HANDLER.
- REDUCES MULTI-THREADED EXCEPTION LATENCY.



# **A**LMOST CUTS PERFORMANCE GAP IN HALF



# Single Thread Performance vs. Throughput

SINGLE APPLICATION: (PREVIOUS RESULTS)

FOCUS: IMPROVE SINGLE THREAD PERFORMANCE

MULTIPROGRAMED/MULTITHREADED WORKLOAD:

FOCUS: MAXIMIZE THROUGHPUT

OUR EXPERIMENT: (NOT NECESSARILY A FAIR COMPARISON)

RUN 3 APPLICATIONS, 1 IDLE THREAD FOR EXCEPTION HANDLING

# Performance on Multiprogramed Workloads



# PERFORMANCE IS MORE COMPLICATED

- SMT IS MORE LATENCY TOLERANT
- SMT IS LESS TOLERANT OF WASTED BANDWIDTH

# **Related Work**

# **ARCHITECTURES:**

- M-MACHINE
  - FILLO, KECKLER, DALLY, CARTER, CHANG, GUREVICH, LEE
  - KECKLER, DALLY, CHANG, LEE, CHATTERJEE
- Multiscalar/Kestrel

# SUBORDINATE MULTITHREADING:

- Chappel, Stark, Kim, Reinhart, and Patt
- SONG AND DUBOIS

# **Conclusions**

#### SIGNIFICANTLY IMPROVES EXCEPTION HANDLING PERFORMANCE:

 software TLB miss performance approaching that of an aggressive hardware TLB miss performance

NOT ALL EXCEPTIONS CAN BE IMPLEMENTED IN HARDWARE

#### HIGH PERFORMANCE EXCEPTIONS ENABLE NOVEL SOFTWARE SYSTEMS

'a la software DSM or Concurrent GC