Barcelona port.

Barcelona, Spain

September 8-12, 2001


Abstracts and Papers

Proceedings front matter.

Session 1: Simulation and Modeling

Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications. Tim Sherwood, Erez Perelman and Brad Calder, (University of California, San Diego)

Abstract: Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to months to complete. To overcome this problem researchers choose a very small portion of a program's execution to evaluate their results, rather than simulating the entire program. In this paper we propose Basic Block Distribution Analysis as an automated approach for finding these small portions of the program to simulate that are representative of the entire program's execution. This approach is based upon using profiles of a program's code structure (basic blocks) to uniquely identify different phases of execution in the program. We show that the periodicity of the basic block frequency profile reflects the periodicity of detailed simulation across several different architectural metrics (e.g., IPC, branch miss rate, cache miss rate, value misprediction, address misprediction, and reorder buffer occupancy). Since basic block frequencies can be collected using very fast profiling tools, our approach provides a practical technique for finding the periodicity and simulation points in applications.

Modeling Superscalar Processors via Statistical Simulation. Sebastien Nussbaum and James Smith (Dept. of Electrical and Computer Engineering,University of Wisconsin-Madison)

Abstract: Statistical simulation is a technique for fast performance evaluation of superscalar processors. First, intrinsic statistical information is collected from a single detailed simulation of a program. This information is then used to generate a synthetic instruction trace that is fed to a simple processor model, along with a cache and branch prediction statistics. Because of the probabilistic nature of the simulation, it quickly converges to a performance rate. The simplicity and simulation speed make it useful for fast design space exploration; as such, it is a good complement to conventional detailed simulation. The accuracy of this technique is evaluated for different levels of modeling complexity. Both errors and convergence properties are studied in detail. A simple instruction model yields an average error of 8% compared with detailed simulation. A more detailed instruction model reduces the error to 5% but requires about three times as long to converge.

Hybrid Analytical-Statistical Modeling for Efficiently Exploring Architecture and Workload Design Spaces. Lieven Eeckhout and Koen De Bosschere (Department of Electronics and Information Systems, Ghent University)

Abstract: Microprocessor design time and effort are getting impractical due to the huge number of simulations that need to be done to evaluate various processor configurations for various workloads. An early design stage methodology could be useful to efficiently cull huge design spaces to identify regions of interest to be further explored using more accurate simulations. In this paper, we present an early design stage method that bridges the gap between analytical and statistical modeling. The hybrid analytical-statistical method presented here is based on the observation that register traffic characteristics exhibit power law properties which allows us to fully characterize a workload with just a few parameters which is much more efficient than the collection of distributions that need to be specified in classical statistical modeling. We evaluate the applicability and the usefulness of this hybrid analytical-statistical modeling technique to efficiently and accurately cull huge architectural design spaces. In addition, we demonstrate that this hybrid analytical-statistical modeling technique can be used to explore the entire workload space by varying just a few workload parameters.

Session 2: Efficient Caches

Filtering Techniques to Improve Trace-Cache Efficiency. Roni Rosner, Avi Mendelson and Ronny Ronen (Israel Design Center, Intel)

Abstract: Trace-caches have been shown to effectively increase the number of "useful" instructions that can be fetched into the machine, thus enabling more instructions to be executed each cycle. However, trace cache has another important benefit that got less attention in recent research: especially for variable length ISA, such as Intel's IA-32 architecture (X86), reducing instruction decoding power is particularly attractive. Keeping the instruction traces in decoded format, implies the decoding power is only paid upon the build of a trace, thus reducing the overall power consumption of the system. We observe that (1) the majority of traces that are inserted into the trace-cache are rarely used again before being replaced; (2) the majority of the instructions delivered for execution originate from the fewer traces that are heavily and repeatedly used; and that (3) techniques that aim to improve instruction-fetch bandwidth may increase the number of traces built during program execution. Based on these observations, we propose splitting the trace cache into two components: the filter trace-cache (FTC) and the main trace-cache (MTC). Traces are first inserted into the FTC that is used to filter out the infrequently used traces; traces that prove "useful" are later moved into the MTC itself. The FTC/MTC organization exhibits an important benefit: it decreases the number of traces built, thus reducing power consumption while improving overall performance. For medium-size applications, the FTC/MTC pair reduces the number of trace builds by 16% in average. An extension of the filtering concept involves adding a second level (L2) trace-cache that stores less frequent traces that are replaced in the FTC or the MTC. The extra level of caching allows for order-of-magnitude reduction in the number of trace builds. Second level trace cache proves particularly useful for applications with large instruction footprints.

Reactive-Associative Caches. Brannon Batson (1) and T. Vijaykumar (2). (1) Compaq and (2) Purdue University

Abstract: While set-associative caches typically incur fewer misses than direct-mapped caches, set-associative caches have slower hit times. We propose the reactive-associative cache (r-a cache), which provides flexible associativity by placing most blocks in direct-mapped positions and reactively displacing only conflicting blocks to set-associative positions. The r-a cache uses way-prediction (like the predictive associative cache, PSA) to access displaced blocks on the initial probe. Unlike PSA, however, the r-a cache employs a novel feedback mechanism to prevent unpredictable blocks from being displaced. Reactive displacement and feedback allow the r-a cache to use a novel PC-based way-prediction and achieve high accuracy; without impractical block swapping as in column associative and group associative, and without relying on timing-constrained XOR way prediction. A one-port, 4-way r-a cache achieves up to 9% speedup over a direct-mapped cache and performs within 2% of an idealized 2-way set-associative, 1-cycle cache. A 4-way r-a cache achieves up to 13% speedup over a PSA cache, with both r-a and PSA using the PC scheme. CACTI estimates that for sizes larger than 8KB, a 4-way r-a cache is within 1% of direct-mapped hit times, and 24% faster than a 2-way set-associative cache.

Adaptive Mode Control: A Static-Power-Efficient Cache Design. Huiyang Zhou, Mark Toburen, Eric Rotenberg and Thomas Conte (North Carolina State University)

Abstract: Lower threshold voltages in deep sub-micron technologies cause more leakage current, increasing static power dissipation. This trend, combined with the trend of larger/more cache memories dominating die area, has prompted circuit designers to develop SRAM cells with low-leakage operating modes (e.g., sleep mode). Sleep mode reduces static power dissipation but data stored in a sleeping cell is unreliable or lost. So, at the architecture level, there is interest in exploiting sleep mode to reduce static power dissipation while maintaining high performance. Current approaches dynamically control the operating mode of large groups of cache lines or even individual cache lines. However, the performance monitoring mechanism that controls the percentage of sleep-mode lines, and identifies particular lines for sleep mode, is somewhat arbitrary. There is no way to know what the performance could be with all cache lines active, so arbitrary miss rate targets are set (perhaps on a per-benchmark basis using profile information) and the control mechanism tracks these targets. We propose applying sleep mode only to the data store and not the tag store. By keeping the entire tag store active, the hardware knows what the hypothetical miss rate would be if all data lines were active and the actual miss rate can be made to precisely track it. Simulations show an average of 73% of I-cache lines and 54% of D-cache lines are put in sleep mode with an average IPC impact of only 1.7%, for 64KB caches.

Session 3: Specialized Instruction Sets

Implementation and Evaluation of the Complex Streamed Instruction Set. Ben Juurlink (1), Dmitri Tcheressiz (2), Stamatis Vassiliadis (1), Harry Wijshoff (2). (1) Computer Engineering Laboratory, Electrical Engineering Department, Delft University of Technology, Delft. (2) Department of Computer Science, Leiden University, Leiden.

Abstract: An architectural paradigm designed to accelerate streaming operations on mixed-width data is presented and evaluated. The described Complex Streamed Instruction (CSI) set contains instructions that process data streams of arbitrary length. The number of bits or elements that will be processed in parallel is, therefore, not visible to the programmer, so no recompilation is needed in order to benefit from a wider datapath. CSI also eliminates many overhead instructions (such as instructions needed for data alignment and reorganization) often needed in applications utilizing media ISA extensions such as MMX and VIS by replacing them by a hardware mechanism. Simulation results using several multimedia kernels demonstrate that CSI provides a factor of up to 9.9 (4.0 on average) performance improvement when compared to Sun's VIS extension. For complete applications, the performance gain is 9% to 36% with an average of 20%.

On the Efficiency of Reductions in micro-SIMD media extensions. Jesus Corbal, Roger Espasa, and Mateo Valero (Computer Architecture Department, UPC)

Abstract: Many important multimedia applications contain a significant fraction of reduction operations. Although, in general, multimedia applications are characterized for having high amounts of Data Level Parallelism, reductions and accumulations are difficult to parallelize and show a poor tolerance to increases in the latency of the instructions. This is specially significant for micro-SIMD extensions such as MMX or AltiVec. To overcome the problem of reductions in micro-SIMD ISAs, designers tend to include more and more complex instructions able to deal with the most common forms of reductions in multimedia. As long as the number of processor pipeline stages grows, the number of cycles needed to execute these multimedia instructions increases with every processor generation, severely compromising performance. This paper presents an in-depth discussion of how reductions/accumulations are performed in current micro-SIMD architectures and evaluates the performance trade-offs for a near-future highly aggressive superscalar processors with three different styles of micro-SIMD extensions. We compare a MMX-like alternative to a MDMX-like extension that has Packed accumulators to attack the reduction problem, and we also compare it to MOM, a matrix register ISA. We will show that while packed accumulators present several advantages, they introduce artificial recurrences that severely degrade performance for processors with high number of registers and long latency operations. On the other hand, this paper demonstrates that longer SIMD media extensions such as MOM can take great advantage of accumulators by exploiting the associative parallelism implicit in reductions.

Session 4: Prediction and Recovery

Boolean Formula-based Branch Prediction for Future Technologies. Daniel Jimenez (1), Heather Hanson (2) and Calvin Lin (1). (1) Department of Computer Sciences, The University of Texas at Austin. (2) Department of Electrical & Computer Engineering, The University of Texas at Austin.

Abstract: We present a new method for branch prediction that encodes in the branch instruction a formula, chosen by profiling, that is used to perform history-based prediction. By using a special class of Boolean formulas, our encoding is extremely concise. By replacing the large tables found in current predictors with a small, fast circuit, our scheme is ideally suited to future technologies that will have large wire delays. In a projected 70 nm technology and an aggressive clock rate of about 5 GHz, an implementation of our method that uses an 8-bit formula encoding has a misprediction rate of 6.0%, 42% lower than that of the best gshare predictor implementable in that same technology. In today's technology, a 16-bit version of our predictor can replace bias bits in an 8K-entry agree predictor to achieve a 2.86% misprediction rate, which is slightly lower than the 2.93% misprediction rate of the Alpha 21264 hybrid predictor, even though the Alpha predictor has almost twice the hardware budget. Our predictor also consumes much less power than table-based predictors. This paper describes our predictor, explains our profiling algorithm, and presents experimental results using the SPEC 2000 integer benchmarks.

Using Dataflow Based Context for Accurate Value Prediction. Renju Thomas and Manoj Franklin (University of Maryland)

Abstract: We explore the reasons behind the rather low prediction accuracy of existing data value predictors. Our studies show that contexts formed only from the outcomes of the last several instances of a static instruction do not always encapsulate all of the information required for correct prediction. Complex interactions between data flow and control flow change the context in ways that result in predictability loss for a significant number of dynamic instructions. For improving the prediction accuracy, we propose the concept of using contexts derived from the predictable portions of the data flow graph. That is, the predictability of hard-to-predict instructions can be improved by taking advantage of the predictability of the easy-to-predict instructions that precede it in the data flow graph. We propose and investigate a run-time scheme for producing such an improved context from the predicted values of previous instructions. We also propose a novel predictor called dynamic dataflow-inherited speculative context (DDISC) based predictor for specifically predicting hard-to-predict instructions. Simulation results verify that the use of dataflow-based contexts yields significant improvements in prediction accuracies, ranging from 35% to 99%. This translates to an overall prediction accuracy of 68% to 99.9%.

Recovery mechanism for latency misprediction. Enric Morancho, Jose Maria Llaberia and Angel Olive (Computer Architecture Department, UPC)

Abstract: Signalling result availability from the functional units to the instruction scheduler can increase the cycle time and/or the effective latency of the instructions. The knowledge of all instruction latencies would allow the instruction scheduler to operate without the need of external signalling. However, the latency of some instructions is unknown; but, the scheduler can optimistically predict the latency of these instructions and issue speculatively their dependent instructions. Although prediction techniques have great performance potential, their gain can vanish due to misprediction handling. For instance, holding speculatively scheduled instructions in the issue queue reduces its capacity to look-ahead for independent instructions. This paper evaluates a recovery mechanism for latency mispredictions that retains the speculatively issued instructions in a structure apart from the issue queue: the recovery buffer. When data becomes available after a latency misprediction, the dependent instructions will be re-issued from the recovery buffer. Moreover, in order to simplify the re-issue logic of the recovery buffer, the instructions will be recorded in issue order. On mispredictions, the recovery buffer increases the effective capacity of the issue queue to hold instructions waiting for operands. Our evaluations in integer benchmarks show that the recovery-buffer mechanism reduces issue-queue size requirements about 20-25%. Also, this mechanism is less sensitive to the verification delay than the recovery mechanism that retains the instructions in the issue queue.

Session 5: Memory Optimization

A Cost Framework for Evaluating Integrated Restructuring Optimizations. Bharat Chandramouli, John Carter, Wilson Hsieh and Sally McKee (University of Utah)

Abstract: Loop transformations and array restructuring optimizations usually improve performance by increasing the memory locality of applications, but not always. For instance, loop and array restructuring can either complement or compete with one another. Previous research has proposed integrating loop and array restructuring, but there existed no analytic framework for determining how best to combine the optimizations for a given program. Since the choice of which optimizations to apply, alone or in combination, is highly application- and input-dependent, a cost framework is needed if integrated restructuring is to be automated by an optimizing compiler. To this end, we develop a cost model that considers standard loop optimizations along with two potential forms of array restructuring: conventional copying-based restructuring and remapping-based restructuring that exploits a smart memory controller. We simulate eight applications on a variety of input sizes and with a variety of hand-applied restructuring optimizations. We find that employing a fixed strategy does not always deliver the best performance. Finally, our cost model accurately predicts the best combination of restructuring optimizations among those we examine, and yields performance within a geometric mean of 5% of the best combination across all benchmarks and input sizes.

Compiling for the Impulse Memory Controller. Xianglong Huang, Zhenlin Wang and Kathryn McKinley (Computer Science Dept., University of Massachusetts, Amherst)

Abstract: The Impulse memory controller provides an interface for remapping irregular or sparse memory accesses into dense accesses in the cache memory. This capability significantly increases processor cache and system bus utilization, and previous work shows performance improvements from a factor of 1.2 to 5 with current technology models for hand-coded kernels in a cycle-level simulator. To attain widespread use of any specialized hardware feature requires automating its use in a compiler. In this paper, we present compiler cost models using dependence and locality analysis that determine when to use Impulse to improve performance based on the reduction in misses, the additional cost for misses in Impulse, and the fixed cost for setting up a remapping. We implement the cost models and generate the appropriate Impulse system calls in the Scale compiler framework. Our results demonstrate that our cost models correctly choose when and when not to use Impulse. We also combine and compare Impulse with our implementation of loop permutation for improving locality. If loop permutation can achieve the same dense access pattern as Impulse, we prefer it, since it has no overheads, but we show that the combination can yield better performance.

On the Stability of Temporal Data Reference Profiles. Trishul Chilimbi (Microsoft Research)

Abstract: Growing computer system complexity has made program optimization based solely on static analyses increasingly difficult. Consequently, many code optimizations incorporate information from program execution profiles. Most memory system optimizations go further and rely primarily on profiles. This reliance on profiles makes off-line optimization effectiveness dependent on profile stability across multiple program runs. While code profiles such as basic block, edge, and branch profiles, have been shown to satisfy this requirement, the stability of data reference profiles, especially temporal data reference profiles that are necessary for cache-level optimizations, has neither been studied nor established. This paper shows that temporal data reference profiles expressed in terms of hot data streams, which are data reference sequences that frequently repeat, are quite stable; an encouraging result for memory optimization research. Most hot data streams belong to one of two categories-those that appear in multiple runs with their data elements referenced in the same order, and those with the same set of elements referenced in a different order-and this category membership is extremely stable. In addition, the fraction of hot data streams that belong to the first category is quite large.

Session 6: Program Optimization

Code Reordering and Speculation Support for Dynamic Optimization Systems. Erik Nystrom, Ronald Barnes, Matthew Merten and Wen-mei Hwu (University of Illinois)

Abstract: For dynamic optimization systems, success is limited by two difficult problems arising from instruction reordering. Following optimization within and across basic block boundaries, both the ordering of exceptions and the observed processor register contents at each exception point must be consistent with the original code. While compilers traditionally utilize global data-flow analysis to determine which registers require preservation, this analysis is often infeasible in dynamic optimization systems due to both strict time/space constraints and incomplete code discovery. This paper presents an approach called Precise Speculation that addresses these problems. The proposed mechanism is a component of our vision for Run-time Optimization ARchitecture, or ROAR, to support aggressive dynamic optimization of programs. It utilizes a hardware mechanism to automatically recover the precise register states when a deferred exception is reported, utilizing the original unoptimized code to perform all recovery. We observe that Precise Speculation enables a dynamic optimization system to achieve a large performance gain over aggressively optimized base code, while preserving precise exceptions. For an 8-issue EPIC processor, the dynamic optimizer achieves between 3.6% and 57% speedup over a full-strength optimizing compiler that employs profile-guided optimization.

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors. Josep Codina, Jesus Sanchez and Antonio Gonzalez (Computer Architecture Department, UPC)

Abstract: This work presents a modulo scheduling framework for clustered ILP processors that integrates the cluster assign- ment, instruction scheduling and register allocation steps in a single phase. This unified approach is more effective than traditional approaches based on sequentially performing some (or all) of the three steps, since it allows optimizing the global code generation problem instead of searching for optimal solutions to each individual step. Besides, it avoids the iterative nature of traditional approaches, which require repeated applications of the three steps until a valid solu- tion is found. The proposed framework includes a mecha- nism to insert spill code on-the-fly and heuristics to evaluate the quality of partial schedules considering simul- taneously inter-cluster communications, memory pressure and register pressure. Transformations that allow trading pressure on a type of resource for another resource are also included. We show that the proposed technique outper- forms previously proposed techniques. For instance, the average speed-up for the SPECfp95 is 36% for a 4-cluster configuration.

Cache-Friendly Implementations of Transitive Closure. Michael Penner and Viktor Prasanna (University of Southern California)

Abstract: In this paper we show cache-friendly implementations of the Floyd-Warshall algorithm for the All-Pairs Shortest-Path problem. We first compare the best commercial compiler optimizations available with standard cache-friendly optimizations and a simple improvement involving a block layout, which reduces TLB misses. We show approximately 15% improvements using these optimizations. We also develop a general representation, the Unidirectional Space Time Representation, which can be used to generate cache-friendly implementations for a large class of algorithms. We show analytically and experimentally that this representation can be used to minimize level-1 and level-2 cache misses and TLB misses and therefore exhibits the best overall performance. Using this representation we show a 2x improvement in performance with respect to the compiler optimized implementation. Experiments were conducted on Pentium III, Alpha, and MIPS R12000 machines using problem sizes between 1024 and 2048 vertices. We used the Simplescalar simulator to demonstrate improved cache performance.

Session 7: Technology Implications

Exploring the Design Space of Future CMPs. Jaehyuk Huh, Doug Burger and Stephen Keckler (University of Texas at Austin)

Abstract: In this paper, we study the space of chip multiprocessor (CMP) organizations. We compare the area and performance trade-offs for CMP implementations to determine how many processing cores future server CMPs should have, whether the cores should have in-order or out-of-order issue, and how big the per-processor on-chip caches should be. We find that, contrary to some conventional wisdom, out-of-order processing cores will maximize job throughput on future CMPs. As technology shrinks, limited off-chip bandwidth will begin to curtail the number of cores that can be effective on a single die. Current projections show that the transistor/signal pin ratio will increase by a factor of 45 between 180 and 35 nanometer technologies. That disparity will force increases in per-processor cache capacities as technology shrinks, from 128KB at 100nm, to 256KB at 70nm, and to 1MB at 50 and 35nm, reducing the number of cores that would otherwise be possible.

Area and System Clock Effects on SMT/CMP Processors. James Burns (Intel) and Jean-Luc Gaudiot (USC)

Abstract: Two approaches to high throughput processors are Chip Multi-Processing (CMP) and Simultaneous Multi-Threading (SMT). CMP increases layout efficiency, which allows more functional units and a faster clock rate. However, CMP suffers from hardware partitioning of functional resources. SMT increases functional unit utilization by issuing instructions simultaneously from multiple threads. However, a wide-issue SMT suffers from layout and technology implementation problems. We use silicon resources as our basis for comparison and find that area and system clock have a large effect on the optimal SMT/CMP design trade. We show the area overhead of SMT on each processor and how it scales with the width of the processor pipeline and the number of SMT threads. The wide issue SMT delivers the highest single-thread performance with improved multi-thread throughput. However multiple smaller cores deliver the highest throughput.

Session 8: Parallel Machines

Limits on Speculative Module-level Parallelism in Imperative and Object-oriented Programs on CMP Platforms. Fredrik Warg and Per Stenstrom (Chalmers University of Technology)

Abstract: This paper considers program modules, e.g. procedures, functions, and methods as the basic method to exploit speculative parallelism in existing codes. We analyze how much inherent and exploitable parallelism exist in a set of C and Java programs on a set of chip-multiprocessor architecture models, and identify what inherent program features, as well as architectural deficiencies, that limit the speedup. Our data complement previous limit studies by indicating that the programming style -- object-oriented versus imperative -- does not seem to have any noticeable impact on the achievable speedup. Further, we show that as few as eight processors are enough to exploit all of the inherent parallelism. However, memory-level data dependence resolution and thread management mechanisms of recent CMP proposals may impose overheads that severely limit the speedup obtained.

Compiler and Runtime Analysis for Efficient Communication in Data Intensive Applications. Renato Ferreira (1), Gagan Agrawal (2) and Joel Saltz (1). (1) University of Maryland, (2) University of Delaware

Abstract: Processing and analyzing large volumes of data plays an increasingly important role in many domains of scientific research. We are developing a compiler that processes data intensive applications written in a dialect of Java and compiles them for efficient execution on distributed memory parallel machines. In this paper, we focus on the problem of generating correct and efficient communication for data intensive applications. We present static analysis techniques for 1) extracting a global reduction function from a data parallel loop, and 2) determining if a subscript function is monotonic. We also present a runtime technique for reducing the volume of communication during the global reduction phase. We have experimented with two data intensive applications to evaluate the efficacy of our techniques. Our results show that 1) our techniques for extracting global reduction functions and establishing monotonicity of subscript functions can successfully handle these applications, 2) significant reduction in communication volume and execution times is achieved through our runtime analysis technique, 3) runtime communication analysis is critical for achieving speedups on parallel configurations.

Architectural Support for Parallel Reductions in Scalable Shared-Memory Multiprocessors. Maria Jesus Garzaran (1), Milos Prvulovic (2), Ye Zhang (2), Alin Jula (3), Hao Yu (3), Lawrence Rauchwerger (3) and Josep Torrellas (2). (1) Universidad de Zaragoza, Spain, (2) University of Illinois at Urbana-Champaign, (3) Texas A&M University

Abstract: Reductions are important and time-consuming operations in many scientific codes. Effective parallelization of reductions is a critical transformation for loop parallelization, especially for sparse, dynamic applications. Unfortunately, conventional reduction parallelization algorithms are not scalable. In this paper, we present new architectural support that significantly speeds-up parallel reduction and makes it scalable in shared-memory multiprocessors. The required architectural changes are mostly confined to the directory controllers. Experimental results based on simulations show that the proposed support is very effective. While conventional software-only reduction parallelization delivers average speedups of only 2.7 for 16 processors, our scheme delivers average speedups of 7.6.

Session 9: Data Prefetching

Optimizing Software Data Prefetches with Rotating Registers. Gautam Doshi, Rakesh Krishnaiyer and Kalyan Muthukumar (Intel Corporation)

Abstract: Software data prefetching is a well-known technique to improve the performance of programs that suffer many cache misses at several levels of memory hierarchy. However, it has significant overhead in terms of increased code size, additional instructions, and possibly increased memory bus traffic due to redundant prefetches. This paper presents two novel methods to reduce the overhead of software data prefetching and improve the program performance by optimized prefetch scheduling. These methods exploit the availability of rotating registers and predication in architectures such as the Itanium(TM) architecture. The methods (1) minimize redundant prefetches, (2) reduce the number of issue slots needed for prefetch instructions, and (3) avoid branch mispredict penalties - all with minimal code size increase. Compared to traditional data prefetching techniques, these methods (i) do not require loop unrolling, (ii) do not require predicate computations and (iii) require fewer machine resources. One of these methods has been implemented in the Intel Production Compiler for the Itanium(TM) processor. This technique is compared with traditional approaches for software prefetching and experimental results are presented based on the floating-point benchmark suite of CPU2000.

Multi-Chain Prefetching: Effective Exploitation of Inter-Chain Memory Parallelism for Pointer-Chasing Codes. Nicholas Kohout (1), Seungryul Choi (2), Dongkeun Kim (3), Donald Yeung (3). (1) Intel Corp., (2) Department of Computer Science, University of Maryland at College Park, (3) Department of Electrical and Computer Engineering, University of Maryland at College Park

Abstract: Pointer-chasing applications tend to traverse composed data structures consisting of multiple independent pointer chains. While the traversal of any single pointer chain leads to the serialization of memory operations, the traversal of independent pointer chains provides a source of memory parallelism. This paper presents multi-chain prefetching, a technique that utilizes offline analysis and a hardware prefetch engine to prefetch multiple independent pointer chains simultaneously, thus exploiting interchain memory parallelism for the purpose of memory latency tolerance. This paper makes three contributions. First, we introduce a scheduling algorithm that identifies independent pointer chains in pointer-chasing codes and computes a prefetch schedule that overlaps serialized cache misses across separate chains. Our analysis focuses on static traversals. We also propose using speculation to identify independent pointer chains in dynamic traversals. Second, we present the design of a prefetch engine that traverses pointer-based data structures and overlaps multiple pointer chains according to our scheduling algorithm. Finally, we conduct an experimental evaluation of multi-chain prefetching and compare its performance against two existing techniques, jump pointer prefetching and prefetch arrays. Our results show multi-chain prefetching improves execution time by 40% across six pointer-chasing kernels from the Olden benchmark suite, and by 8% across four SPECInt CPU2000 benchmarks. Multi-chain prefetching also outperforms jump pointer prefetching and prefetch arrays by 28% on Olden, and by 12% on SPECInt. Furthermore, speculation can enable multichain prefetching for some dynamic traversal codes, but our technique loses its effectiveness when the pointer-chain traversal order is unpredictable. Finally, we also show that combining multi-chain prefetching with prefetch arrays can potentially provide higher performance than either technique alone.

Data Flow Analysis for Software Prefetching Linked Data Structures in Java. Brendon Cahoon and Kathryn McKinley (University of Massachusetts)

Abstract: In this paper, we describe an effective compile-time analysis for software prefetching in Java. Previous work in software data prefetching for pointer-based codes uses simple compiler algorithms and does not investigate prefetching for object-oriented language features that make compile-time analysis difficult. We develop a new data flow analysis to detect regular accesses to linked data structures in Java programs. We use intra and interprocedural analysis to identify profitable prefetching opportunities for greedy and jump-pointer prefetching, and we implement these techniques in a compiler for Java. Our results show that both prefetching techniques improve four of our ten programs. The largest performance improvement is 48% with jump-pointers, but consistent improvements are difficult to obtain.

Comparing and Combining Read Miss Clustering and Software Prefetching. Vijay Pai (1) and Sarita Adve (2). (1) Rice University, (2) University of Illinois

Abstract: A recent latency tolerance technique, read miss clustering, restructures code to send demand miss references in parallel to the underlying memory system. An alternate, widely-used latency tolerance technique is software prefetching, which initiates data fetches ahead of expected demand miss references by a certain distance. Since both techniques seem to target the same types of latencies and use the same system resources, it is unclear which technique is superior or if both can be combined. This paper shows that these two techniques are actually mutually beneficial, each helping to overcome limitations of the other. We perform our study for uniprocessor and multiprocessor configurations, in simulation and on a real machine (Convex Exemplar). Compared to prefetching alone (the state-of-the-art implemented in systems today), the combination of the two techniques reduces execution time an average of 21% across all cases studied in simulation and an average of 16% for 5 out of 10 cases on the Exemplar. The combination sees execution time reductions relative to clustering alone averaging 15% for 6 out of 11 cases in simulation and 20% for 6 out of 10 cases on the Exemplar.


Sponsored by IEEE TCCA, IEEE TCPP, ACM SIGARCH and IFIP Working Group 10.3

With generous institutional (UPC, CICYT and CIRIT) and corporate (Compaq, HP Labs, IBM Spain, IBM Research, Intel, Microsoft Research and SGI) support.

Edited and Designed by Eduard Ayguade