First Advisor

Michael A. Driscoll

Term of Graduation

Fall 2006

Date of Publication


Document Type


Degree Name

Doctor of Philosophy (Ph.D.) in Electrical and Computer Engineering


Electrical and Computer Engineering




Computer architecture, Memory management (Computer science), Parallel processing (Electronic computers)



Physical Description

1 online resource (2, ix, 163 pages)


Memory latency-tolerant architectures support thousands of in-flight instructions without proportionate scaling of cycle-critical processor resources, and thousands of useful instructions can complete in parallel with a long-latency miss to memory. These architectures, however, require large queues to track all loads and stores executed while a long-latency miss is pending. Hierarchical designs alleviate cycle-time impact of these structures but the Content-Addressable-Memory (CAM) and search functions required to enforce memory ordering and provide data-forwarding place high demand on area and power.

Many recent proposals address the complexity of load and store queues. However, none of these proposals addresses the fundamental source of complexity in these queues: the constant searching required for enforcing ordering among memory operations and for proper data-forwarding. These earlier proposals only provide mechanisms for coping with search complexity. This dissertation presents a novel proposal for high performance load and store queues that do not require fully-associative searches.

We present new load and store processing mechanisms for latency-tolerant architectures. We augment small, primary load and store queues with large, secondary buffers. The secondary load buffer is an un-ordered, set-associative structure, similar to a cache. The secondary store buffer, the Store Redo Log (SRL), is a first-in first-out structure recording the program order of all stores completed in parallel with a miss, and has no CAM and search functions. Instead of the secondary store queue, a cache provides temporary forwarding. The SRL enforces memory ordering by ensuring memory updates occur in program order once the miss returns.

The new mechanisms eliminate the CAM and search functions in the secondary load and store buffers, and remove fundamental sources of complexity, power, and area inefficiency in load and store processing. The new organization, while being area and power efficient, is competitive in performance compared to hierarchical designs. The key idea behind our proposal is: "Redoing certain stores to fix dependences is better than trying to constantly enforce dependences." The design of both load and store queues is inherently scalable and significantly simple because of lack of any CAM logic.

Our method shows 5x area and 6x total power savings over hierarchical designs.


In Copyright. URI:

This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).


If you are the rightful copyright holder of this dissertation or thesis and wish to have it removed from the Open Access Collection, please submit a request to and include clear identification of the work, preferably with URL.

Persistent Identifier