Seminar 2003-04 abstracts

Seminar 03/04 - Abstracts

Star Joins in DB2 UDB
Adriana Zubiri
DB2 UDB Team, IBM Toronto

The increased demand for Decision Support Systems (DSS) and Online Analytical Processing (OLAP) for business increases the complexity of databased design, query construction and query optimization. The challenges for a Database Management System (DBMS) in supporting data warehouse implementations and the millions (or billions) of rows of information stored, are being tackled agressively by database vendors as they attempt to assert dominance in this growing field. Data warehouses often adopt the star schema data model as a method to minimize the overhead of high volume data redundancy, by normalizing descriptive data into dimensions surrrounding a large fact table. This presentation will address the IBM DB2 UDB implementation of star schema support and our current and future work.

Learning in Query Optimization
Volker Markl
IBM Almaden Research Center

The speaker will give an overview of database systems research projects conducted at IBM Almaden Research Center with the goal to make the DB2 system more autonomic, by having it learn how to improve its performance over time by introducing several autonomic feedback loops. Database Systems let users specify queries in a declarative language like SQL. Most modern DBMS optimizers rely upon a cost model to choose the best query execution plan (QEP) for any given query. Cost estimates are heavily dependent upon the optimizer's estimates for the number of rows that will result at each step of the QEP for complex queries involving many predicates and/or operations. These estimates, in turn, rely upon statistics on the database and modeling assumptions that may or may not be true for a given database. In this talk, we present research on learning in query optimization that has been carried out at the IBM Almaden Research Center. We introduce LEO, DB2's LEarning Optimizer, as a comprehensive way to repair incorrect statistics and cardinality estimates of a query execution plan. By monitoring executed queries, LEO compares the optimizer's estimates with actuals at each step in a QEP, and computes adjustments to cost estimates and statistics that may be used during the current and future query optimizations. This analysis can be done either on-line or off-line on a separate system, and either incrementally or in batches. In this way, LEO introduces a feedback loop to query optimization that enhances the available information on the database where the most queries have occurred, allowing the optimizer to actually learn from its past mistakes. Our technique is general and can be applied to any operation in a QEP (not just selection bpredicates on base tables), including joins, derived results after several predicates have been applied, and even to DISTINCT and GROUP-BY operators. The runtime overhead of LEO's monitoring is insignificant, whereas the potential benefit to response time from more accurate cardinality and cost estimates can be orders of magnitude.

Single-Chip Multiprocessors: The Rebirth of Parallel Processing
Guri Sohi
Computer Sciences Department, University of Wisconsin-Madison

After a couple of decades of excitement, multiprocessors have played second fiddle to instruction-level parallelism and uniprocessors since the early 1990s. A significant reason for the recent excitement in uniprocessors was the revolution in uniprocessor microarchitecture made possible by the increasing transistor resources. Once technology allowed a simple uniprocessor to be implemented on a single chip (in the mid 1980s), researchers started looking at radically different ways of implementing uniprocessor microarchitectures. The net result is that the microarchitecture of a high-performance uniprocessor today barely resembles that of one from 20 years ago. Today we stand at the doorstep of a similar revolution for multiprocessors. Semiconductor technology currently allows us to build a small chip multiprocessor, but the "multiarchitecture"--the microarchitecture of the multiprocessor--is no different from that of a traditional multiprocessor built with multiple chips. Now is the time to start to rethink the multiarchitecture of a chip multiprocessor and to come up with novel ways of building and using chip multiprocessors. This talk is an attempt to provide some initial directions towards this overall research agenda.

BAFL: Bottleneck Analysis of Fine-Grain Parallelism
Rastislav Bodik
UC Berkeley

BAFL: Bottleneck Analysis of Fine-Grain Parallelism Rastislav Bodik Thanks to the RISC revolution, processors became analyzable. Thanks to Moore's Law, they came to baffle us over again. The culprit, fed by transistors, is parallelism, introduced all over the processor. Parallelism baffles us because linear performance models---popularized on RISC pipelines---break down on out-of-order processors, leaving architects without a solid handle on how to spend future transistors. To identify bottlenecks in modern processors, we use the natural method: finding the critical path in a dependence graph. I will first show how to model the radically diverse constraints in the processor---data dependences, queuing delays, and mispredictions---in a single dependence graph. I will then follow by describing how to compute the model, using novel probabilistic algorithms that find the critical path directly in hardware, very efficiently, without actually building the graph. The result is BAFL: a methodology, both quantitative and qualitative, thanks to which students can regain intuition, architects can reduce circuit complexity, and processors can reconfigure themselves to save power.

Multi-Level Design of Energy-Efficient Adaptive Networks for Large Computing Systems
Juan-Antonio Carballo
IBM Research

Future large computing systems will be developed on the basis of average performance, productivity, ease of use, and cost of ownership. The PERCS (Productive, Easy-to-use, Reliable Computing System) project lead by IBM is attempting to address this multidisciplinary challenge. Our current research recognizes that feasible solutions to this multi-pronged problem will require system design at multiple levels of abstraction, and implementation of architectures that adapt to the application workload. With overall computing bandwidths approaching PetaBytes per second before the end of this decade, it is no surprise that networking and communications hardware in these systems has become increasingly critical. Communications energy consumption is quickly becoming a large portion of total computing system power, and increasingly determines system cost, volume, and time-to-market. We have been investigating energy-efficient, productive design methods for interconnect networks in large computing systems. First, our methodology optimizes multiple levels of the communications design hierarchy. Second, adaptive policies are explicitly designed whereby the low-level characteristics of communications hardware aggressively adapt themselves to their application requirements. Preliminary results indicate that systems that are vastly more energy-efficient than existing approaches are possible, while allowing unified communications chip designs to work across a wide set of protocols, applications, and manufacturing conditions.

A Compiler Framework to Support Control and Data Speculation
Pen-Chung Yew
University of Minnesota

A Compiler Framework to Support Control and Data Speculation Pen-Chung Yew, University of Minnesota at Twin Cities Recent high-performance microprocessors, such as Intel/HP's Itanium (IA-64), support both control and data speculation to enhance instruction-level parallelism (ILP) in application programs. These new architectural features allow more aggressive compiler optimizations. However, current compilers only exploit these new features in a very limited way. In this talk, we address compiler issues related to these new speculation features, and present a general compiler framework that can exploit control and data speculation for many more compiler optimizations. It also supports the generation of recovery code for potential mis-speculation at runtime. We will also present efficient alias and data dependence profiling schemes that are critical to the framework. This framework has been implemented on Intel's Open Research Compiler (ORC). We will show some experimental results on Itanium using SPEC benchmarks.

Watt Matters Most? High-Performance Microprocessor Design Space Exploration for Power-Performance Efficiency
Pedro Trancoso
Department of Computer Science, University of Cyprus

Computer systems have evolved significantly in the last years leading to high performance systems. This, however, has come with a cost of large power dissipation resulting in high operating temperatures and large energy consumption. Power-awareness has become a major factor in processor design. Therefore, it is important to have a complete understanding of the power and performance behavior of all processor components and for the different workloads. This work analyses the efficiency for three different types of workloads: multimedia, scientific and database. The first part of this presentation will focus the efficient configuration of a high-performance microprocessor. The objectives of this work are: (1) to identify which microarchitecture parameters are more relevant in a power-aware design; (2) to present the trends for the microarchitecture configuration which make it more efficient in terms of power-performance; (3) to show how different workloads affect the power-aware design; and (4) to propose power-performance efficient configurations for each workload. The simulation results show that carefully tuned configurations may achieve gains up to 125% over the power-performance efficiency of the baseline configuration. The second part of this presentation will focus on identifying which resources are more important for adaptation. Four criteria are defined to limit the adaptation range considering the cost and/or the efficiency of the system. The results of our experiments showed that the resource to be adapted depends on the: (1) operating mode (e.g. low-power or high-performance), (2) the workload, and (3) the criteria. Overall, for our baseline processor configuration, the dominant resources to be adapted are thevoltage-frequency and the first-level instruction cache. Adapting resources may lead to an increase in the performance by up to 44% or a reduction in power by up to 93% with no restrictions. For configurations of the same cost and efficiency as the baseline, adaptation of resources may improve the performance by 33% or reduce the power by 34%.

The Isomorphism of Dynamic Conditional Branch Instructions
Yiannakis Sazeides
Department of Computer Science, University of Cyprus

This presentation analyzes the potential for improving conditional branch prediction using isomorphic information. A dynamic instruction instance is said to be isomorphic if its component - information derived from the instruction and the dynamic dependence graph of a program - is identical to the component of an instruction executed earlier. The talk will introduce various safe and unsafe transformations that can change the isomorphic behavior of conditional branch instructions. Safe transformations may only increase isomorphism and predictability whereas unsafe may also result in pseudo-isomorphism and misprediction. The presentation includes an empirical analysis of the performance potential limit of a hybrid prediction architecture that combines an alias free history based branch predictor with an ideal isomorphic predictor. The results show that the potential misprediction reduction over the alias free history based predictor is insignificant when the entire dependence graph is included in branch components but it becomes considerable, on the average 84%, when components are safely transformed. This indicates that the alias free branch predictor frequently mispredicts on branches that have -- after transformations -- the same dependence structure with previously executed branches. The potential with unsafe transformations is significant but lower due to the presence of pseudo--isomorphism. Analysis revealed that when considering only components that exhibit very low pseudo--isomorphism, the potential misprediction reduction is on the average 35% when components ignore input values, 15% when components disregard values and memory dependences but include address dependences, and 7% when values, memory dependences and address dependences are ignored. The performance with unsafe transformations suggests that the inclusion of input values in components and the bypassing of memory dependences are essential for a large misprediction reduction. Overall, the experimental analysis shows that future conditional branch predictors can be improved by incorporating isomorphic information.

Combining Program Recovery, Auto-parallelisation and Locality Analysis for C programs on Multi-processor Embedded Systems
Mike O'Boyle
Edinburgh University

Efficient implementation of DSP applications are critical for many embedded systems. Optimising compilers for application programs written in C, largely focus on code generation and scheduling which, with their growing maturity are providing diminishing returns. As DSP applications typically make extesive use of pointer arithmetic, the alternative use of high level transformations has been largely ignored. This paper developes an array recovery technique that automatically converts pointers to arrays, enabling the empirical evaluation of high level source to source transformations. High level techniques were applied to the DSPstone benchmarks on 3 platforms: TriMedia TM-1000, Texas Instruments TMS320C6201 and the Analog SHARC ADSP-21160. On average, the best transformation gave a factor of 2.43 improvement across the platforms. In certain cases a speedup of 5.48 was found for the SHARC, 7.38 for the TM-1 and 2.3 for the C6201. These preliminary results justify pointer to array conversion and further investigation into the use of high level techniques for embedded compilers.

Improving the Scalability of Exponential Peer-to-Peer Networks
Peter Van Roy
Dept. of Computing Science

High scalability in peer-to-peer (P2P) networks has been achieved with the emergence of the Distributed Hash Table (DHT). Most DHTs can be regarded as exponential networks, where the network size evolves exponentially while the minimal distance between two nodes as well as the routing table size at each node evolve linearly. In this talk we present a framework to better characterize most of the current P2P exponential networks. Depending on the nodes' view of the network, we express the network as an absolute or relative exponential network. We then propose the Tango approach to better structure relative exponential networks to increase their scalability. In these networks, e.g., Chord and DKS, if all nodes are reachable in at most H hops, then the number of paths of length less than or equal to H between two nodes grows exponentially with the network size. We reduce this redundancy to improve other properties, such as reducing the number of hops between two nodes. We show that Tango is much more scalable than the current DHT-based P2P networks. We have chosen Tango as the algorithm underlying our P2P middleware.

Clustered Programmable-Reconfigurable Processors
Nicholas P. Carter
University of Illinois at Urbana-Champaign

Clustered Programmable-Reconfigurable Processors Nicholas P. Carter University of Illinois at Urbana-Champaign Computing systems based on reconfigurable logic have achieved impressive performance on a number of compute-intensive applications. However, reconfigurable computing systems have not made significant inroads into the general-purpose market because of their limited application set and the effort required to develop applications for them. To address these limitations, we have developed a new class of reconfigurable computing systems: clustered programmable-reconfigurable processors. Clustered programmable-reconfigurable processors integrate multiple programmable processors and blocks of reconfigurable logic onto a chip to provide high performance. Processing resources are divided into independent clusters that communicate with each other over an on-chip network to limit wire lengths and allow the processor to operate at high clock rates. A register-based inter-cluster communication mechanism allows clusters to exchange data without the overhead of shared-memory communication and provides an efficient abstraction barrier between the different computing resources. In this talk, we describe the architecture of the Amalgam programmable-reconfigurable processor, including our reconfigurable cluster design. We present results that argue that clustered processors will deliver high performance in upcoming fabrication processes, and show that a set of benchmarks implemented on Amalgam execute up to 13x faster than they would on a simple programmable processor.

Stateful applications and their architectural challenges
Mario Nemirovski
Kayamba Inc.

A new breed of applications challenges established assumptions on computer architectures. These so called "stateful processing" applications have expanded beyond the original packet processing in routers/switches to data processing in servers, search engines, and data mining. In the network-processing scenario, the term "stateful" refers to the need to use contextual, historic information to process packets in advanced applications (intrusion detection, statistics gathering, etc.) in contrast to the original routing protocol that treated each packet as a history-less entity. Several features characterize stateful applications, namely: a) the need to keep track of contextual information; b) high rate of memory interactions; c) poor locality; d) large number of concurrent threads. Superscalar architectures excel in processing applications that conform to the standard assumptions of good locality, and relatively low rate of memory interactions. To address the challenge of stateful applications new architectures can be proposed with substantial performance improvements. In this talk I will briefly present some of the stateful processing characteristics and a discussion of several architecture schemes to address these challenges.

ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging
C Mohan
IBM Almaden Research Center

In this talk, I present a simple and efficient method, called ARIES (Algorithm for Recovery and Isolation Exploiting Semantics), which supports partial rollbacks of transactions, fine-granularity (e.g., record) locking and recovery using write-ahead logging (WAL). I introduce the paradigm of repeating history to redo all missing updates before performing the rollbacks of the loser transactions during restart after a system failure. ARIES uses a log sequence number in each page to correlate the state of a page with respect to logged updates of that page. All updates of a transaction are logged, including those performed during rollbacks. By appropriate chaining of the log records written during rollbacks to those written during forward progress, a bounded amount of logging is ensured during rollbacks even in the face of repeated failures during restart or of nested rollbacks. I deal with a variety of features that are very important in building and operating an "industrial-strength" transaction processing system. ARIES supports fuzzy checkpoints, selective and deferred restart, fuzzy backups, media recovery, and high concurrency lock modes (e.g., increment/decrement) which exploit the semantics of the operations and which require the ability to perform operation logging. ARIES is flexible with respect to the kinds of buffer management policies that can be implemented. It supports varying length objects efficiently. By permitting parallelism during restart, page-oriented redo and logical undo, it enhances concurrency and performance. ARIES is applicable not only to database management systems but also to persistent object-oriented languages, recoverable file systems and transaction-based operating systems. I show why some of the System R paradigms for logging and recovery, which were based on the shadow page technique, need to be changed in the context of WAL. I compare ARIES to the WAL-based recovery methods of DB2/MVS V1, IMS and Tandem systems. ARIES has been implemented, to varying degrees, in IBM's DB2 family of products, MQSeries, ADSM, Lotus Domino/Notes, Starburst and QuickSilver, in Transarc's Encina product suite, in many non-IBM products, and in the University of Wisconsin's Gamma, EXODUS, Shore and Paradise. Much more details on ARIES can be found in a paper in the March 1992 issue of the ACM Transactions on Database Systems and also in the IBM Research Report RJ6649. This is joint work with D. Haderle, B. Lindsay, H. Pirahesh and P. Schwarz.

Building a Product Compiler for the Itanium(TM) Architecture
Wei Li
Intel Software and Solutions Group

I will present the compiler technology implemented in the Intel Itanium(TM) product compilers. The Intel compiler has produced leading performance on many benchmark and application on the Itanium(TM) Architecture. I will discuss the overall compiler architecture, and techniques in profiling, interprocedural optimizations, disambiguation, high-level optimizations, parallelizer, global scalar optimizations, scheduling and code generation.

That's all folks!!!!!