The increased demand for Decision Support Systems (DSS) and Online
Analytical Processing (OLAP) for business increases the complexity of
databased design, query construction and query optimization. The
challenges for a Database Management System (DBMS) in supporting data
warehouse implementations and the millions (or billions) of rows of
information stored, are being tackled agressively by database vendors as
they attempt to assert dominance in this growing field. Data warehouses
often adopt the star schema data model as a method to minimize the
overhead of high volume data redundancy, by normalizing descriptive data
into dimensions surrrounding a large fact table. This presentation will
address the IBM DB2 UDB implementation of star schema support and our
current and future work.
The speaker will give an overview of database systems research projects
conducted at
IBM Almaden Research Center with the goal to make the DB2 system more
autonomic, by
having it learn how to improve its performance over time by introducing
several autonomic
feedback loops.
Database Systems let users specify queries in a declarative language
like SQL.
Most modern DBMS optimizers rely upon a cost model to choose the best
query execution plan (QEP) for any given query. Cost estimates are
heavily dependent upon the optimizer's estimates for the number of
rows that will result at each step of the QEP for complex queries
involving many predicates and/or operations. These estimates, in
turn, rely upon statistics on the database and modeling assumptions
that may or may not be true for a given database. In this talk, we
present research on learning in query optimization that has been
carried out at the IBM Almaden Research Center. We introduce LEO,
DB2's LEarning Optimizer, as a comprehensive way to repair incorrect
statistics and cardinality estimates of a query execution plan. By
monitoring executed queries, LEO compares the optimizer's estimates
with actuals at each step in a QEP, and computes adjustments to cost
estimates and statistics that may be used during the current and
future query optimizations. This analysis can be done either on-line
or off-line on a separate system, and either incrementally or in
batches. In this way, LEO introduces a feedback loop to query
optimization that enhances the available information on the database
where the most queries have occurred, allowing the optimizer to
actually learn from its past mistakes. Our technique is general and
can be applied to any operation in a QEP (not just selection
bpredicates on base tables), including joins, derived results after
several predicates have been applied, and even to DISTINCT and
GROUP-BY operators.
The runtime overhead of LEO's monitoring is insignificant, whereas the
potential benefit to response time from more accurate cardinality and
cost estimates can be orders of magnitude.
After a couple of decades of excitement, multiprocessors have
played second fiddle to instruction-level parallelism and uniprocessors
since the early 1990s. A significant reason for the recent excitement
in uniprocessors was the revolution in uniprocessor microarchitecture
made possible by the increasing transistor resources.
Once technology allowed a simple uniprocessor
to be implemented on a single chip (in the mid 1980s), researchers started
looking at radically different ways of implementing uniprocessor microarchitectures.
The net result is that the microarchitecture of a high-performance
uniprocessor today barely resembles that of one from 20 years ago.
Today we stand at the doorstep of a similar revolution for multiprocessors.
Semiconductor technology currently allows us to build a small chip multiprocessor,
but the "multiarchitecture"--the microarchitecture of the multiprocessor--is no
different from that of a traditional multiprocessor built with multiple chips.
Now is the time to start to rethink the multiarchitecture of a chip multiprocessor
and to come up with novel ways of building and using chip multiprocessors.
This talk is an attempt to provide some initial directions towards
this overall research agenda.
BAFL: Bottleneck Analysis of Fine-Grain Parallelism
Rastislav Bodik
Thanks to the RISC revolution, processors became analyzable. Thanks
to Moore's Law, they came to baffle us over again. The culprit, fed
by transistors, is parallelism, introduced all over the processor.
Parallelism baffles us because linear performance models---popularized
on RISC pipelines---break down on out-of-order processors, leaving
architects without a solid handle on how to spend future transistors.
To identify bottlenecks in modern processors, we use the natural
method: finding the critical path in a dependence graph. I will first
show how to model the radically diverse constraints in the
processor---data dependences, queuing delays, and mispredictions---in
a single dependence graph. I will then follow by describing how to
compute the model, using novel probabilistic algorithms that find the
critical path directly in hardware, very efficiently, without actually
building the graph.
The result is BAFL: a methodology, both quantitative and qualitative,
thanks to which students can regain intuition, architects can reduce
circuit complexity, and processors can reconfigure themselves to save
power.
Future large computing systems will be developed on the basis of average
performance, productivity, ease of use, and cost of ownership. The PERCS
(Productive, Easy-to-use, Reliable Computing System) project lead by IBM is
attempting to address this multidisciplinary challenge. Our current
research recognizes that feasible solutions to this multi-pronged problem
will require system design at multiple levels of abstraction, and
implementation of architectures that adapt to the application workload.
With overall computing bandwidths approaching PetaBytes per second before
the end of this decade, it is no surprise that networking and
communications hardware in these systems has become increasingly critical.
Communications energy consumption is quickly becoming a large portion of
total computing system power, and increasingly determines system cost,
volume, and time-to-market. We have been investigating energy-efficient,
productive design methods for interconnect networks in large computing
systems. First, our methodology optimizes multiple levels of the
communications design hierarchy. Second, adaptive policies are explicitly
designed whereby the low-level characteristics of communications hardware
aggressively adapt themselves to their application requirements.
Preliminary results indicate that systems that are vastly more
energy-efficient than existing approaches are possible, while allowing
unified communications chip designs to work across a wide set of protocols,
applications, and manufacturing conditions.
A Compiler Framework to Support Control and Data Speculation
Pen-Chung Yew, University of Minnesota at Twin Cities
Recent high-performance microprocessors, such as Intel/HP's
Itanium (IA-64), support both control and data speculation to enhance
instruction-level parallelism (ILP) in application programs.
These new architectural features allow more aggressive compiler
optimizations.
However, current compilers only exploit these new features
in a very limited way. In this talk, we address compiler
issues related to these new speculation features, and
present a general compiler framework that can exploit
control and data speculation for many more compiler
optimizations. It also supports the generation of recovery
code for potential mis-speculation at runtime. We will
also present efficient alias and data dependence profiling
schemes that are critical to the framework.
This framework has been implemented on Intel's
Open Research Compiler (ORC). We will show some
experimental results on Itanium using SPEC benchmarks.
Computer systems have evolved significantly in the last years leading to high performance systems. This, however, has come with a cost of large power dissipation resulting in high operating temperatures and large energy consumption. Power-awareness has become a major factor in processor design. Therefore, it is important to have a complete understanding of the power and performance behavior of all processor components and for the different workloads. This work analyses the efficiency for three different types of workloads: multimedia, scientific and database. The first part of this presentation will focus the efficient configuration of a high-performance microprocessor. The objectives of this work are:
(1) to identify which microarchitecture parameters are more relevant in a power-aware design;
(2) to present the trends for the microarchitecture configuration which make it more efficient in terms of power-performance;
(3) to show how different workloads affect the power-aware design; and
(4) to propose power-performance efficient configurations for each workload.
The simulation results show that carefully tuned configurations may achieve gains up to 125% over the power-performance efficiency of the baseline configuration. The second part of this presentation will focus on identifying which resources are more important for adaptation. Four criteria are defined to limit the adaptation range considering the cost and/or the efficiency of the system. The results of our experiments showed that the resource to be adapted depends on the:
(1) operating mode (e.g. low-power or high-performance),
(2) the workload, and
(3) the criteria.
Overall, for our baseline processor configuration, the dominant resources to be adapted are thevoltage-frequency and the first-level instruction cache. Adapting resources may lead to an increase in the performance by up to 44% or a reduction in power by up to 93% with no restrictions. For configurations of the same cost and efficiency as the baseline, adaptation of resources may improve the performance by 33% or reduce the power by 34%.
This presentation analyzes the potential for improving conditional branch prediction using isomorphic information.
A dynamic instruction instance is said to be isomorphic if its component - information derived from the instruction
and the dynamic dependence graph of a program - is identical to the component of an instruction executed earlier.
The talk will introduce various safe and unsafe transformations that can change the isomorphic behavior of conditional branch instructions. Safe transformations may only increase isomorphism and predictability whereas unsafe may also
result in pseudo-isomorphism and misprediction.
The presentation includes an empirical analysis of the performance potential limit of a hybrid prediction architecture that combines an alias free history based branch predictor with an ideal isomorphic predictor. The results show that the potential misprediction reduction over the alias free history based predictor is insignificant when the entire dependence graph is included in branch components but it becomes considerable, on the average 84%, when components are safely transformed. This indicates that the alias free branch predictor frequently mispredicts on branches that have -- after transformations -- the same dependence structure with previously executed branches. The potential with unsafe transformations is significant but lower due to the presence of pseudo--isomorphism. Analysis revealed that when considering only components that exhibit very low pseudo--isomorphism, the potential misprediction reduction is on the average 35% when components ignore input values, 15% when components disregard values and memory dependences but include address dependences, and 7% when values, memory dependences and address dependences are ignored. The performance with unsafe transformations suggests that the inclusion of input values in components and the bypassing of memory dependences are essential for a large misprediction reduction. Overall, the experimental analysis shows that future conditional branch predictors can be improved by incorporating isomorphic information.
Efficient implementation of DSP applications are critical for many
embedded systems. Optimising compilers for application programs
written in C, largely focus on code generation and scheduling which,
with their growing maturity are providing diminishing returns. As DSP
applications typically make extesive use of pointer arithmetic, the
alternative use of high level transformations has been largely
ignored. This paper developes an array recovery technique that
automatically converts pointers to arrays, enabling the empirical
evaluation of high level source to source transformations. High level
techniques were applied to the DSPstone benchmarks on 3 platforms:
TriMedia TM-1000, Texas Instruments TMS320C6201 and the Analog SHARC
ADSP-21160. On average, the best transformation gave a factor of 2.43
improvement across the platforms. In certain cases a speedup of 5.48
was found for the SHARC, 7.38 for the TM-1 and 2.3 for the
C6201. These preliminary results justify pointer to array conversion
and further investigation into the use of high level techniques
for embedded compilers.
High scalability in peer-to-peer (P2P) networks has been
achieved with the emergence of the Distributed Hash Table (DHT).
Most DHTs can be regarded as exponential networks, where the
network size evolves exponentially while the minimal distance between
two nodes as well as the routing table size at each node evolve
linearly.
In this talk we present a framework to better characterize most of the
current P2P exponential networks. Depending on the nodes' view of the
network, we express the network as an absolute or relative exponential
network. We then propose the Tango approach to better structure
relative exponential networks to increase their scalability. In these
networks, e.g., Chord and DKS, if all nodes are reachable in at most H
hops, then the number of paths of length less than or equal to H
between two nodes grows exponentially with the network size. We reduce this
redundancy to improve other properties, such as reducing the number of
hops between two nodes. We show that Tango is much more scalable
than the current DHT-based P2P networks. We have chosen Tango
as the algorithm underlying our P2P middleware.
Clustered Programmable-Reconfigurable Processors
Nicholas P. Carter
University of Illinois at Urbana-Champaign
Computing systems based on reconfigurable logic have achieved impressive performance on a number of compute-intensive applications. However, reconfigurable computing systems have not made significant inroads into the general-purpose market because of their limited application set and the effort required to develop applications for them. To address these limitations, we have developed a new class of reconfigurable computing systems: clustered programmable-reconfigurable processors. Clustered programmable-reconfigurable processors integrate multiple programmable processors and blocks of reconfigurable logic onto a chip to provide high performance. Processing resources are divided into independent clusters that communicate with each other over an on-chip network to limit wire lengths and allow the processor to operate at high clock rates. A register-based inter-cluster communication mechanism allows clusters to exchange data without the overhead of shared-memory communication and provides an efficient abstraction barrier between the different computing resources. In this talk, we describe the architecture of the Amalgam programmable-reconfigurable processor, including our reconfigurable cluster design. We present results that argue that clustered processors will deliver high performance in upcoming fabrication processes, and show that a set of benchmarks implemented on Amalgam execute up to 13x faster than they would on a simple programmable processor.
A new breed of applications challenges established assumptions on computer
architectures. These so called "stateful processing" applications have
expanded beyond the original packet processing in routers/switches to data
processing in servers, search engines, and data mining. In the network-processing
scenario, the term "stateful" refers to the need to use contextual, historic
information to process packets in advanced applications (intrusion detection,
statistics gathering, etc.) in contrast to the original routing protocol
that treated each packet as a history-less entity. Several features characterize
stateful applications, namely: a) the need to keep track of contextual
information; b) high rate of memory interactions; c) poor locality; d)
large number of concurrent threads. Superscalar architectures excel in
processing applications that conform to the standard assumptions of good
locality, and relatively low rate of memory interactions. To address the
challenge of stateful applications new architectures can be proposed with
substantial performance improvements. In this talk I will briefly present
some of the stateful processing characteristics and a discussion of several
architecture schemes to address these challenges.
In this talk, I present a simple and efficient method, called ARIES
(Algorithm for Recovery and Isolation Exploiting Semantics), which supports
partial rollbacks of transactions, fine-granularity (e.g., record) locking
and recovery using write-ahead logging (WAL). I introduce the paradigm
of repeating history to redo all missing updates before performing the
rollbacks of the loser transactions during restart after a system failure.
ARIES uses a log sequence number in each page to correlate the state of
a page with respect to logged updates of that page. All updates of a transaction
are logged, including those performed during rollbacks. By appropriate
chaining of the log records written during rollbacks to those written during
forward progress, a bounded amount of logging is ensured during rollbacks
even in the face of repeated failures during restart or of nested rollbacks.
I deal with a variety of features that are very important in building and
operating an "industrial-strength" transaction processing system. ARIES
supports fuzzy checkpoints, selective and deferred restart, fuzzy backups,
media recovery, and high concurrency lock modes (e.g., increment/decrement)
which exploit the semantics of the operations and which require the ability
to perform operation logging. ARIES is flexible with respect to the kinds
of buffer management policies that can be implemented. It supports varying
length objects efficiently. By permitting parallelism during restart, page-oriented
redo and logical undo, it enhances concurrency and performance. ARIES is
applicable not only to database management systems but also to persistent
object-oriented languages, recoverable file systems and transaction-based
operating systems. I show why some of the System R paradigms for logging
and recovery, which were based on the shadow page technique, need to be
changed in the context of WAL. I compare ARIES to the WAL-based recovery
methods of DB2/MVS V1, IMS and Tandem systems. ARIES has been implemented,
to varying degrees, in IBM's DB2 family of products, MQSeries, ADSM, Lotus
Domino/Notes, Starburst and QuickSilver, in Transarc's Encina product suite,
in many non-IBM products, and in the University of Wisconsin's Gamma, EXODUS,
Shore and Paradise. Much more details on ARIES can be found in a paper
in the March 1992 issue of the ACM Transactions on Database Systems and
also in the IBM Research Report RJ6649. This is joint work with D. Haderle,
B. Lindsay, H. Pirahesh and P. Schwarz.
I will present the compiler technology implemented in the Intel Itanium(TM)
product compilers. The Intel compiler has produced leading performance
on many benchmark and application on the Itanium(TM) Architecture. I will
discuss the overall compiler architecture, and techniques in profiling,
interprocedural optimizations, disambiguation, high-level optimizations,
parallelizer, global scalar optimizations, scheduling and code generation.
That's all folks!!!!!