Abstracts
and Papers
Proceedings
front matter. 
Session
1: Simulation and Modeling
Basic
Block Distribution Analysis to Find Periodic Behavior and Simulation
Points in Applications. Tim Sherwood, Erez Perelman and Brad Calder,
(University of California, San Diego) 
Abstract:
Modern architecture research relies heavily on detailed pipeline
simulation. Simulating the full execution of an industry standard benchmark
can take weeks to months to complete. To overcome this problem researchers
choose a very small portion of a program's execution to evaluate their
results, rather than simulating the entire program. In this paper we
propose Basic Block Distribution Analysis as an automated approach for
finding these small portions of the program to simulate that are representative
of the entire program's execution. This approach is based upon using
profiles of a program's code structure (basic blocks) to uniquely identify
different phases of execution in the program. We show that the periodicity
of the basic block frequency profile reflects the periodicity of detailed
simulation across several different architectural metrics (e.g., IPC,
branch miss rate, cache miss rate, value misprediction, address misprediction,
and reorder buffer occupancy). Since basic block frequencies can be
collected using very fast profiling tools, our approach provides a practical
technique for finding the periodicity and simulation points in applications.
Modeling
Superscalar Processors via Statistical Simulation. Sebastien Nussbaum
and James Smith (Dept. of Electrical and Computer Engineering,University
of Wisconsin-Madison)
Abstract:
Statistical simulation is a technique for fast performance evaluation
of superscalar processors. First, intrinsic statistical information
is collected from a single detailed simulation of a program. This information
is then used to generate a synthetic instruction trace that is fed to
a simple processor model, along with a cache and branch prediction statistics.
Because of the probabilistic nature of the simulation, it quickly converges
to a performance rate. The simplicity and simulation speed make it useful
for fast design space exploration; as such, it is a good complement
to conventional detailed simulation. The accuracy of this technique
is evaluated for different levels of modeling complexity. Both errors
and convergence properties are studied in detail. A simple instruction
model yields an average error of 8% compared with detailed simulation.
A more detailed instruction model reduces the error to 5% but requires
about three times as long to converge.
Hybrid
Analytical-Statistical Modeling for Efficiently Exploring Architecture
and Workload Design Spaces. Lieven Eeckhout and Koen De Bosschere
(Department of Electronics and Information Systems, Ghent University)
Abstract:
Microprocessor design time and effort are getting impractical due
to the huge number of simulations that need to be done to evaluate various
processor configurations for various workloads. An early design stage
methodology could be useful to efficiently cull huge design spaces to
identify regions of interest to be further explored using more accurate
simulations. In this paper, we present an early design stage method
that bridges the gap between analytical and statistical modeling. The
hybrid analytical-statistical method presented here is based on the
observation that register traffic characteristics exhibit power law
properties which allows us to fully characterize a workload with just
a few parameters which is much more efficient than the collection of
distributions that need to be specified in classical statistical modeling.
We evaluate the applicability and the usefulness of this hybrid analytical-statistical
modeling technique to efficiently and accurately cull huge architectural
design spaces. In addition, we demonstrate that this hybrid analytical-statistical
modeling technique can be used to explore the entire workload space
by varying just a few workload parameters.
Session
2: Efficient Caches
Filtering
Techniques to Improve Trace-Cache Efficiency. Roni Rosner, Avi Mendelson
and Ronny Ronen (Israel Design Center, Intel) 
Abstract:
Trace-caches have been shown to effectively
increase the number of "useful" instructions that can be fetched into
the machine, thus enabling more instructions to be executed each cycle.
However, trace cache has another important benefit that got less attention
in recent research: especially for variable length ISA, such as Intel's
IA-32 architecture (X86), reducing instruction decoding power is particularly
attractive. Keeping the instruction traces in decoded format, implies
the decoding power is only paid upon the build of a trace, thus reducing
the overall power consumption of the system. We observe that (1) the
majority of traces that are inserted into the trace-cache are rarely
used again before being replaced; (2) the majority of the instructions
delivered for execution originate from the fewer traces that are heavily
and repeatedly used; and that (3) techniques that aim to improve instruction-fetch
bandwidth may increase the number of traces built during program execution.
Based on these observations, we propose splitting the trace cache into
two components: the filter trace-cache (FTC) and the main trace-cache
(MTC). Traces are first inserted into the FTC that is used to filter
out the infrequently used traces; traces that prove "useful" are later
moved into the MTC itself. The FTC/MTC organization exhibits an important
benefit: it decreases the number of traces built, thus reducing power
consumption while improving overall performance. For medium-size applications,
the FTC/MTC pair reduces the number of trace builds by 16% in average.
An extension of the filtering concept involves adding a second level
(L2) trace-cache that stores less frequent traces that are replaced
in the FTC or the MTC. The extra level of caching allows for order-of-magnitude
reduction in the number of trace builds. Second level trace cache proves
particularly useful for applications with large instruction footprints.
Reactive-Associative
Caches. Brannon Batson (1) and T. Vijaykumar (2). (1) Compaq and
(2) Purdue University
Abstract:
While set-associative caches typically incur fewer misses than direct-mapped
caches, set-associative caches have slower hit times. We propose the
reactive-associative cache (r-a cache), which provides flexible associativity
by placing most blocks in direct-mapped positions and reactively displacing
only conflicting blocks to set-associative positions. The r-a cache
uses way-prediction (like the predictive associative cache, PSA) to
access displaced blocks on the initial probe. Unlike PSA, however, the
r-a cache employs a novel feedback mechanism to prevent unpredictable
blocks from being displaced. Reactive displacement and feedback allow
the r-a cache to use a novel PC-based way-prediction and achieve high
accuracy; without impractical block swapping as in column associative
and group associative, and without relying on timing-constrained XOR
way prediction. A one-port, 4-way r-a cache achieves up to 9% speedup
over a direct-mapped cache and performs within 2% of an idealized 2-way
set-associative, 1-cycle cache. A 4-way r-a cache achieves up to 13%
speedup over a PSA cache, with both r-a and PSA using the PC scheme.
CACTI estimates that for sizes larger than 8KB, a 4-way r-a cache is
within 1% of direct-mapped hit times, and 24% faster than a 2-way set-associative
cache.
Adaptive
Mode Control: A Static-Power-Efficient Cache Design. Huiyang Zhou,
Mark Toburen, Eric Rotenberg and Thomas Conte (North Carolina State
University)
Abstract:
Lower threshold voltages in deep sub-micron
technologies cause more leakage current, increasing static power dissipation.
This trend, combined with the trend of larger/more cache memories dominating
die area, has prompted circuit designers to develop SRAM cells with
low-leakage operating modes (e.g., sleep mode). Sleep mode reduces static
power dissipation but data stored in a sleeping cell is unreliable or
lost. So, at the architecture level, there is interest in exploiting
sleep mode to reduce static power dissipation while maintaining high
performance. Current approaches dynamically control the operating mode
of large groups of cache lines or even individual cache lines. However,
the performance monitoring mechanism that controls the percentage of
sleep-mode lines, and identifies particular lines for sleep mode, is
somewhat arbitrary. There is no way to know what the performance could
be with all cache lines active, so arbitrary miss rate targets are set
(perhaps on a per-benchmark basis using profile information) and the
control mechanism tracks these targets. We propose applying sleep mode
only to the data store and not the tag store. By keeping the entire
tag store active, the hardware knows what the hypothetical miss rate
would be if all data lines were active and the actual miss rate can
be made to precisely track it. Simulations show an average of 73% of
I-cache lines and 54% of D-cache lines are put in sleep mode with an
average IPC impact of only 1.7%, for 64KB caches.
Session
3: Specialized Instruction Sets
Implementation
and Evaluation of the Complex Streamed Instruction Set. Ben Juurlink
(1), Dmitri Tcheressiz (2), Stamatis Vassiliadis (1), Harry Wijshoff
(2). (1) Computer Engineering Laboratory, Electrical Engineering Department,
Delft University of Technology, Delft. (2) Department of Computer Science,
Leiden University, Leiden. 
Abstract:
An architectural paradigm designed to accelerate
streaming operations on mixed-width data is presented and evaluated.
The described Complex Streamed Instruction (CSI) set contains instructions
that process data streams of arbitrary length. The number of bits or
elements that will be processed in parallel is, therefore, not visible
to the programmer, so no recompilation is needed in order to benefit
from a wider datapath. CSI also eliminates many overhead instructions
(such as instructions needed for data alignment and reorganization)
often needed in applications utilizing media ISA extensions such as
MMX and VIS by replacing them by a hardware mechanism. Simulation results
using several multimedia kernels demonstrate that CSI provides a factor
of up to 9.9 (4.0 on average) performance improvement when compared
to Sun's VIS extension. For complete applications, the performance gain
is 9% to 36% with an average of 20%.
On
the Efficiency of Reductions in micro-SIMD media extensions. Jesus
Corbal, Roger Espasa, and Mateo Valero (Computer Architecture Department,
UPC) 
Abstract:
Many important multimedia applications contain
a significant fraction of reduction operations. Although, in general,
multimedia applications are characterized for having high amounts of
Data Level Parallelism, reductions and accumulations are difficult to
parallelize and show a poor tolerance to increases in the latency of
the instructions. This is specially significant for micro-SIMD extensions
such as MMX or AltiVec. To overcome the problem of reductions in micro-SIMD
ISAs, designers tend to include more and more complex instructions able
to deal with the most common forms of reductions in multimedia. As long
as the number of processor pipeline stages grows, the number of cycles
needed to execute these multimedia instructions increases with every
processor generation, severely compromising performance. This paper
presents an in-depth discussion of how reductions/accumulations are
performed in current micro-SIMD architectures and evaluates the performance
trade-offs for a near-future highly aggressive superscalar processors
with three different styles of micro-SIMD extensions. We compare a MMX-like
alternative to a MDMX-like extension that has Packed accumulators to
attack the reduction problem, and we also compare it to MOM, a matrix
register ISA. We will show that while packed accumulators present several
advantages, they introduce artificial recurrences that severely degrade
performance for processors with high number of registers and long latency
operations. On the other hand, this paper demonstrates that longer SIMD
media extensions such as MOM can take great advantage of accumulators
by exploiting the associative parallelism implicit in reductions.
Session
4: Prediction and Recovery
Boolean
Formula-based Branch Prediction for Future Technologies. Daniel
Jimenez (1), Heather Hanson (2) and Calvin Lin (1). (1) Department of
Computer Sciences, The University of Texas at Austin. (2) Department
of Electrical & Computer Engineering, The University of Texas at Austin.
Abstract:
We present a new method for branch prediction
that encodes in the branch instruction a formula, chosen by profiling,
that is used to perform history-based prediction. By using a special
class of Boolean formulas, our encoding is extremely concise. By replacing
the large tables found in current predictors with a small, fast circuit,
our scheme is ideally suited to future technologies that will have large
wire delays. In a projected 70 nm technology and an aggressive clock
rate of about 5 GHz, an implementation of our method that uses an 8-bit
formula encoding has a misprediction rate of 6.0%, 42% lower than that
of the best gshare predictor implementable in that same technology.
In today's technology, a 16-bit version of our predictor can replace
bias bits in an 8K-entry agree predictor to achieve a 2.86% misprediction
rate, which is slightly lower than the 2.93% misprediction rate of the
Alpha 21264 hybrid predictor, even though the Alpha predictor has almost
twice the hardware budget. Our predictor also consumes much less power
than table-based predictors. This paper describes our predictor, explains
our profiling algorithm, and presents experimental results using the
SPEC 2000 integer benchmarks.
Using
Dataflow Based Context for Accurate Value Prediction. Renju Thomas
and Manoj Franklin (University of Maryland) 
Abstract:
We explore the reasons behind the rather
low prediction accuracy of existing data value predictors. Our studies
show that contexts formed only from the outcomes of the last several
instances of a static instruction do not always encapsulate all of the
information required for correct prediction. Complex interactions between
data flow and control flow change the context in ways that result in
predictability loss for a significant number of dynamic instructions.
For improving the prediction accuracy, we propose the concept of using
contexts derived from the predictable portions of the data flow graph.
That is, the predictability of hard-to-predict instructions can be improved
by taking advantage of the predictability of the easy-to-predict instructions
that precede it in the data flow graph. We propose and investigate a
run-time scheme for producing such an improved context from the predicted
values of previous instructions. We also propose a novel predictor called
dynamic dataflow-inherited speculative context (DDISC) based predictor
for specifically predicting hard-to-predict instructions. Simulation
results verify that the use of dataflow-based contexts yields significant
improvements in prediction accuracies, ranging from 35% to 99%. This
translates to an overall prediction accuracy of 68% to 99.9%.
Recovery
mechanism for latency misprediction. Enric Morancho, Jose Maria
Llaberia and Angel Olive (Computer Architecture Department, UPC) 
Abstract:
Signalling result availability from the functional
units to the instruction scheduler can increase the cycle time and/or
the effective latency of the instructions. The knowledge of all instruction
latencies would allow the instruction scheduler to operate without the
need of external signalling. However, the latency of some instructions
is unknown; but, the scheduler can optimistically predict the latency
of these instructions and issue speculatively their dependent instructions.
Although prediction techniques have great performance potential, their
gain can vanish due to misprediction handling. For instance, holding
speculatively scheduled instructions in the issue queue reduces its
capacity to look-ahead for independent instructions. This paper evaluates
a recovery mechanism for latency mispredictions that retains the speculatively
issued instructions in a structure apart from the issue queue: the recovery
buffer. When data becomes available after a latency misprediction, the
dependent instructions will be re-issued from the recovery buffer. Moreover,
in order to simplify the re-issue logic of the recovery buffer, the
instructions will be recorded in issue order. On mispredictions, the
recovery buffer increases the effective capacity of the issue queue
to hold instructions waiting for operands. Our evaluations in integer
benchmarks show that the recovery-buffer mechanism reduces issue-queue
size requirements about 20-25%. Also, this mechanism is less sensitive
to the verification delay than the recovery mechanism that retains the
instructions in the issue queue.
Session
5: Memory Optimization
A
Cost Framework for Evaluating Integrated Restructuring Optimizations.
Bharat Chandramouli, John Carter, Wilson Hsieh and Sally McKee
(University of Utah)
Abstract:
Loop transformations and array restructuring
optimizations usually improve performance by increasing the memory locality
of applications, but not always. For instance, loop and array restructuring
can either complement or compete with one another. Previous research
has proposed integrating loop and array restructuring, but there existed
no analytic framework for determining how best to combine the optimizations
for a given program. Since the choice of which optimizations to apply,
alone or in combination, is highly application- and input-dependent,
a cost framework is needed if integrated restructuring is to be automated
by an optimizing compiler. To this end, we develop a cost model that
considers standard loop optimizations along with two potential forms
of array restructuring: conventional copying-based restructuring and
remapping-based restructuring that exploits a smart memory controller.
We simulate eight applications on a variety of input sizes and with
a variety of hand-applied restructuring optimizations. We find that
employing a fixed strategy does not always deliver the best performance.
Finally, our cost model accurately predicts the best combination of
restructuring optimizations among those we examine, and yields performance
within a geometric mean of 5% of the best combination across all benchmarks
and input sizes.
Compiling
for the Impulse Memory Controller. Xianglong Huang, Zhenlin Wang
and Kathryn McKinley (Computer Science Dept., University of Massachusetts,
Amherst)
Abstract:
The Impulse memory controller provides an
interface for remapping irregular or sparse memory accesses into dense
accesses in the cache memory. This capability significantly increases
processor cache and system bus utilization, and previous work shows
performance improvements from a factor of 1.2 to 5 with current technology
models for hand-coded kernels in a cycle-level simulator. To attain
widespread use of any specialized hardware feature requires automating
its use in a compiler. In this paper, we present compiler cost models
using dependence and locality analysis that determine when to use Impulse
to improve performance based on the reduction in misses, the additional
cost for misses in Impulse, and the fixed cost for setting up a remapping.
We implement the cost models and generate the appropriate Impulse system
calls in the Scale compiler framework. Our results demonstrate that
our cost models correctly choose when and when not to use Impulse. We
also combine and compare Impulse with our implementation of loop permutation
for improving locality. If loop permutation can achieve the same dense
access pattern as Impulse, we prefer it, since it has no overheads,
but we show that the combination can yield better performance.
On
the Stability of Temporal Data Reference Profiles. Trishul Chilimbi
(Microsoft Research)
Abstract:
Growing computer system complexity has made
program optimization based solely on static analyses increasingly difficult.
Consequently, many code optimizations incorporate information from program
execution profiles. Most memory system optimizations go further and
rely primarily on profiles. This reliance on profiles makes off-line
optimization effectiveness dependent on profile stability across multiple
program runs. While code profiles such as basic block, edge, and branch
profiles, have been shown to satisfy this requirement, the stability
of data reference profiles, especially temporal data reference profiles
that are necessary for cache-level optimizations, has neither been studied
nor established. This paper shows that temporal data reference profiles
expressed in terms of hot data streams, which are data reference sequences
that frequently repeat, are quite stable; an encouraging result for
memory optimization research. Most hot data streams belong to one of
two categories-those that appear in multiple runs with their data elements
referenced in the same order, and those with the same set of elements
referenced in a different order-and this category membership is extremely
stable. In addition, the fraction of hot data streams that belong to
the first category is quite large.
Session
6: Program Optimization
Code
Reordering and Speculation Support for Dynamic Optimization Systems.
Erik Nystrom, Ronald Barnes, Matthew Merten and Wen-mei Hwu (University
of Illinois)
Abstract:
For dynamic optimization systems, success
is limited by two difficult problems arising from instruction reordering.
Following optimization within and across basic block boundaries, both
the ordering of exceptions and the observed processor register contents
at each exception point must be consistent with the original code. While
compilers traditionally utilize global data-flow analysis to determine
which registers require preservation, this analysis is often infeasible
in dynamic optimization systems due to both strict time/space constraints
and incomplete code discovery. This paper presents an approach called
Precise Speculation that addresses these problems. The proposed mechanism
is a component of our vision for Run-time Optimization ARchitecture,
or ROAR, to support aggressive dynamic optimization of programs. It
utilizes a hardware mechanism to automatically recover the precise register
states when a deferred exception is reported, utilizing the original
unoptimized code to perform all recovery. We observe that Precise Speculation
enables a dynamic optimization system to achieve a large performance
gain over aggressively optimized base code, while preserving precise
exceptions. For an 8-issue EPIC processor, the dynamic optimizer achieves
between 3.6% and 57% speedup over a full-strength optimizing compiler
that employs profile-guided optimization.
A
Unified Modulo Scheduling and Register Allocation Technique for Clustered
Processors. Josep Codina, Jesus Sanchez and Antonio Gonzalez (Computer
Architecture Department, UPC) 
Abstract:
This work presents a modulo scheduling framework
for clustered ILP processors that integrates the cluster assign- ment,
instruction scheduling and register allocation steps in a single phase.
This unified approach is more effective than traditional approaches
based on sequentially performing some (or all) of the three steps, since
it allows optimizing the global code generation problem instead of searching
for optimal solutions to each individual step. Besides, it avoids the
iterative nature of traditional approaches, which require repeated applications
of the three steps until a valid solu- tion is found. The proposed framework
includes a mecha- nism to insert spill code on-the-fly and heuristics
to evaluate the quality of partial schedules considering simul- taneously
inter-cluster communications, memory pressure and register pressure.
Transformations that allow trading pressure on a type of resource for
another resource are also included. We show that the proposed technique
outper- forms previously proposed techniques. For instance, the average
speed-up for the SPECfp95 is 36% for a 4-cluster configuration.
Cache-Friendly
Implementations of Transitive Closure. Michael Penner and Viktor
Prasanna (University of Southern California)
Abstract: In this paper we show cache-friendly
implementations of the Floyd-Warshall algorithm for the All-Pairs Shortest-Path
problem. We first compare the best commercial compiler optimizations
available with standard cache-friendly optimizations and a simple improvement
involving a block layout, which reduces TLB misses. We show approximately
15% improvements using these optimizations. We also develop a general
representation, the Unidirectional Space Time Representation, which
can be used to generate cache-friendly implementations for a large class
of algorithms. We show analytically and experimentally that this representation
can be used to minimize level-1 and level-2 cache misses and TLB misses
and therefore exhibits the best overall performance. Using this representation
we show a 2x improvement in performance with respect to the compiler
optimized implementation. Experiments were conducted on Pentium III,
Alpha, and MIPS R12000 machines using problem sizes between 1024 and
2048 vertices. We used the Simplescalar simulator to demonstrate improved
cache performance.
Session
7: Technology Implications
Exploring
the Design Space of Future CMPs. Jaehyuk Huh, Doug Burger and Stephen
Keckler (University of Texas at Austin)
Abstract:
In this paper, we study the space of chip
multiprocessor (CMP) organizations. We compare the area and performance
trade-offs for CMP implementations to determine how many processing
cores future server CMPs should have, whether the cores should have
in-order or out-of-order issue, and how big the per-processor on-chip
caches should be. We find that, contrary to some conventional wisdom,
out-of-order processing cores will maximize job throughput on future
CMPs. As technology shrinks, limited off-chip bandwidth will begin to
curtail the number of cores that can be effective on a single die. Current
projections show that the transistor/signal pin ratio will increase
by a factor of 45 between 180 and 35 nanometer technologies. That disparity
will force increases in per-processor cache capacities as technology
shrinks, from 128KB at 100nm, to 256KB at 70nm, and to 1MB at 50 and
35nm, reducing the number of cores that would otherwise be possible.
Area
and System Clock Effects on SMT/CMP Processors. James Burns (Intel)
and Jean-Luc Gaudiot (USC)
Abstract:
Two approaches to high throughput processors
are Chip Multi-Processing (CMP) and Simultaneous Multi-Threading (SMT).
CMP increases layout efficiency, which allows more functional units
and a faster clock rate. However, CMP suffers from hardware partitioning
of functional resources. SMT increases functional unit utilization by
issuing instructions simultaneously from multiple threads. However,
a wide-issue SMT suffers from layout and technology implementation problems.
We use silicon resources as our basis for comparison and find that area
and system clock have a large effect on the optimal SMT/CMP design trade.
We show the area overhead of SMT on each processor and how it scales
with the width of the processor pipeline and the number of SMT threads.
The wide issue SMT delivers the highest single-thread performance with
improved multi-thread throughput. However multiple smaller cores deliver
the highest throughput.
Session
8: Parallel Machines
Limits
on Speculative Module-level Parallelism in Imperative and Object-oriented
Programs on CMP Platforms. Fredrik Warg and Per Stenstrom (Chalmers
University of Technology)
Abstract:
This paper considers program modules, e.g.
procedures, functions, and methods as the basic method to exploit speculative
parallelism in existing codes. We analyze how much inherent and exploitable
parallelism exist in a set of C and Java programs on a set of chip-multiprocessor
architecture models, and identify what inherent program features, as
well as architectural deficiencies, that limit the speedup. Our data
complement previous limit studies by indicating that the programming
style -- object-oriented versus imperative -- does not seem to have
any noticeable impact on the achievable speedup. Further, we show that
as few as eight processors are enough to exploit all of the inherent
parallelism. However, memory-level data dependence resolution and thread
management mechanisms of recent CMP proposals may impose overheads that
severely limit the speedup obtained.
Compiler
and Runtime Analysis for Efficient Communication in Data Intensive Applications.
Renato Ferreira (1), Gagan Agrawal (2) and Joel Saltz (1). (1) University
of Maryland, (2) University of Delaware
Abstract:
Processing and analyzing large volumes of
data plays an increasingly important role in many domains of scientific
research. We are developing a compiler that processes data intensive
applications written in a dialect of Java and compiles them for efficient
execution on distributed memory parallel machines. In this paper, we
focus on the problem of generating correct and efficient communication
for data intensive applications. We present static analysis techniques
for 1) extracting a global reduction function from a data parallel loop,
and 2) determining if a subscript function is monotonic. We also present
a runtime technique for reducing the volume of communication during
the global reduction phase. We have experimented with two data intensive
applications to evaluate the efficacy of our techniques. Our results
show that 1) our techniques for extracting global reduction functions
and establishing monotonicity of subscript functions can successfully
handle these applications, 2) significant reduction in communication
volume and execution times is achieved through our runtime analysis
technique, 3) runtime communication analysis is critical for achieving
speedups on parallel configurations.
Architectural
Support for Parallel Reductions in Scalable Shared-Memory Multiprocessors.
Maria Jesus Garzaran (1), Milos Prvulovic (2), Ye Zhang (2), Alin Jula
(3), Hao Yu (3), Lawrence Rauchwerger (3) and Josep Torrellas (2). (1)
Universidad de Zaragoza, Spain, (2) University of Illinois at Urbana-Champaign,
(3) Texas A&M University
Abstract:
Reductions are important and time-consuming
operations in many scientific codes. Effective parallelization of reductions
is a critical transformation for loop parallelization, especially for
sparse, dynamic applications. Unfortunately, conventional reduction
parallelization algorithms are not scalable. In this paper, we present
new architectural support that significantly speeds-up parallel reduction
and makes it scalable in shared-memory multiprocessors. The required
architectural changes are mostly confined to the directory controllers.
Experimental results based on simulations show that the proposed support
is very effective. While conventional software-only reduction parallelization
delivers average speedups of only 2.7 for 16 processors, our scheme
delivers average speedups of 7.6.
Session
9: Data Prefetching
Optimizing
Software Data Prefetches with Rotating Registers. Gautam Doshi,
Rakesh Krishnaiyer and Kalyan Muthukumar (Intel Corporation) 
Abstract:
Software data prefetching is a well-known
technique to improve the performance of programs that suffer many cache
misses at several levels of memory hierarchy. However, it has significant
overhead in terms of increased code size, additional instructions, and
possibly increased memory bus traffic due to redundant prefetches. This
paper presents two novel methods to reduce the overhead of software
data prefetching and improve the program performance by optimized prefetch
scheduling. These methods exploit the availability of rotating registers
and predication in architectures such as the Itanium(TM) architecture.
The methods (1) minimize redundant prefetches, (2) reduce the number
of issue slots needed for prefetch instructions, and (3) avoid branch
mispredict penalties - all with minimal code size increase. Compared
to traditional data prefetching techniques, these methods (i) do not
require loop unrolling, (ii) do not require predicate computations and
(iii) require fewer machine resources. One of these methods has been
implemented in the Intel Production Compiler for the Itanium(TM) processor.
This technique is compared with traditional approaches for software
prefetching and experimental results are presented based on the floating-point
benchmark suite of CPU2000.
Multi-Chain
Prefetching: Effective Exploitation of Inter-Chain Memory Parallelism
for Pointer-Chasing Codes. Nicholas Kohout (1), Seungryul Choi (2),
Dongkeun Kim (3), Donald Yeung (3). (1) Intel Corp., (2) Department
of Computer Science, University of Maryland at College Park, (3) Department
of Electrical and Computer Engineering, University of Maryland at College
Park
Abstract:
Pointer-chasing applications tend to traverse
composed data structures consisting of multiple independent pointer
chains. While the traversal of any single pointer chain leads to the
serialization of memory operations, the traversal of independent pointer
chains provides a source of memory parallelism. This paper presents
multi-chain prefetching, a technique that utilizes offline analysis
and a hardware prefetch engine to prefetch multiple independent pointer
chains simultaneously, thus exploiting interchain memory parallelism
for the purpose of memory latency tolerance. This paper makes three
contributions. First, we introduce a scheduling algorithm that identifies
independent pointer chains in pointer-chasing codes and computes a prefetch
schedule that overlaps serialized cache misses across separate chains.
Our analysis focuses on static traversals. We also propose using speculation
to identify independent pointer chains in dynamic traversals. Second,
we present the design of a prefetch engine that traverses pointer-based
data structures and overlaps multiple pointer chains according to our
scheduling algorithm. Finally, we conduct an experimental evaluation
of multi-chain prefetching and compare its performance against two existing
techniques, jump pointer prefetching and prefetch arrays. Our results
show multi-chain prefetching improves execution time by 40% across six
pointer-chasing kernels from the Olden benchmark suite, and by 8% across
four SPECInt CPU2000 benchmarks. Multi-chain prefetching also outperforms
jump pointer prefetching and prefetch arrays by 28% on Olden, and by
12% on SPECInt. Furthermore, speculation can enable multichain prefetching
for some dynamic traversal codes, but our technique loses its effectiveness
when the pointer-chain traversal order is unpredictable. Finally, we
also show that combining multi-chain prefetching with prefetch arrays
can potentially provide higher performance than either technique alone.
Data
Flow Analysis for Software Prefetching Linked Data Structures in Java.
Brendon Cahoon and Kathryn McKinley (University of Massachusetts)
Abstract:
In this paper, we describe an effective compile-time analysis for
software prefetching in Java. Previous work in software data prefetching
for pointer-based codes uses simple compiler algorithms and does not
investigate prefetching for object-oriented language features that make
compile-time analysis difficult. We develop a new data flow analysis
to detect regular accesses to linked data structures in Java programs.
We use intra and interprocedural analysis to identify profitable prefetching
opportunities for greedy and jump-pointer prefetching, and we implement
these techniques in a compiler for Java. Our results show that both
prefetching techniques improve four of our ten programs. The largest
performance improvement is 48% with jump-pointers, but consistent improvements
are difficult to obtain.
Comparing
and Combining Read Miss Clustering and Software Prefetching. Vijay
Pai (1) and Sarita Adve (2). (1) Rice University, (2) University of
Illinois 
Abstract:
A recent latency tolerance technique, read
miss clustering, restructures code to send demand miss references in
parallel to the underlying memory system. An alternate, widely-used
latency tolerance technique is software prefetching, which initiates
data fetches ahead of expected demand miss references by a certain distance.
Since both techniques seem to target the same types of latencies and
use the same system resources, it is unclear which technique is superior
or if both can be combined. This paper shows that these two techniques
are actually mutually beneficial, each helping to overcome limitations
of the other. We perform our study for uniprocessor and multiprocessor
configurations, in simulation and on a real machine (Convex Exemplar).
Compared to prefetching alone (the state-of-the-art implemented in systems
today), the combination of the two techniques reduces execution time
an average of 21% across all cases studied in simulation and an average
of 16% for 5 out of 10 cases on the Exemplar. The combination sees execution
time reductions relative to clustering alone averaging 15% for 6 out
of 11 cases in simulation and 20% for 6 out of 10 cases on the Exemplar.