Seminar 2K abstracts

Seminar 01/02 - Abstracts

Delay Sensitive Branch Predictors

Daniel Jimenez
University of Texas at Austin

Modern pipelined microprocessors consult branch predictors to
speculatively fetch and execute instructions beyond conditional branches.
A branch predictor must operate within a single cycle, since it is on the
critical path for fetching instructions. Branch predictors use large
tables to record correlations between branch histories and outcomes;
larger tables yield higher accuracies. However, as feature sizes shrink
and clock rates increase, wire delay will significantly decrease the
size of branch prediction tables that can be accessed in a single cycle.

We propose methods for addressing latency in branch predictors.
We describe hierarchical organizations that extend traditional
predictors. We then describe a highly accurate branch predictor based
on a neural learning technique. Using a hierarchical organization,
this complex multi-cycle predictor can be used as a component of a
fast delay sensitive predictor.

Compiler Optimizations for Low Power and Low Energy

Ulrich Kremer
Rutgers University

Effective power and energy management is important to prolong
battery life and to reduce heat dissipation. Developing compile-time
techniques for application specific power and energy management is an exciting
new challenge. In this talk, I will discuss several optimization techniques
together with an assessment of their potential benefits, namely remote task mapping,
resource hibernation, and dynamic voltage and frequency scaling.

Preliminary versions of these optimizations have been implemented as part of
the SUIF2 compiler infrastructure. The initial benefit study is based on actual power
measurements and simulation results. The actual measurements used a
single-board StrongARM based system developed at Compaq's CRL lab (Skiff), and
Compaq's iPAQ 3600 handheld PC, both running the Linux operating system. The simulation
results were obtained using modified versions of the SimpleScalar tool set. The results
show that significant power and energy savings can be achieved with minimal
performance penalties.

An Efficient and Flexible Programming Environment for SCI-Based PC-Clusters

Wolfgang Karl
Technische Universität München,
Institut für Informatik

With the rise of PC clusters based on high speed networks like the
Scalable Coherent Interface (SCI), Myrinet, or GigaNet, the question
of their programmability has become increasingly important to be
resolved. While the traditional approaches are based on message
passing, the more comfortable shared memory model, previously
restricted to tightly coupled systems like SMPs, is gaining acceptance
in the form of Distributed Shared Memory (DSM) systems. They provide a
global address space, allow inherent data distribution, and support
incremental parallelization and therefore a smooth migration path for
the parallelization of an application. However, this ease of use comes
at a price. Due to the transparency provided by the shared memory
layer hiding the in reality distributed memory resources, the
optimization process for such systems is more complex. Especially the
most important performance issue of DSM systems, the spatial and
temporal data locality, can not be observed directly. New concepts for
the detection, acquisition, and evaluation of memory access patterns
responsible for bad locality have to be developed.

The SMiLE project at LRR-TUM (Shared Memory in a LAN-like Environment)
targets these research issues with both hardware and software efforts
for SCI-based clusters. The Scalable Coherent Interface (SCI) provides
fast communication through a global address space in a hardware DSM
fashion allowing for direct remote memory access at user level. This
mechanism is the base for the extensive SMiLE software infrastructure,
enabling both message passing and shared memory
programming. Especially, the latter one directly benefits from the
existence of hardware DSM mechanisms, which allow the construction of
a hybrid hardware / software DSM system instead of relying completely
on software efforts. On the hardware side, this framework is completed
by a monitoring system, which enables the user to observe any memory
transaction at the SCI HW-DSM level.

The presentation will describe the SMiLE software infrastructure and
the SMiLE monitoring approach for data locality optimizations.

A Microprocessor Design for 2014

Doug Burger
University of Texas at Austin

Silicon devices have at least 15 years of physical scaling left.
It is unclear, however, that we will be able to exploit the properties
and densities of those devices effectively. The one-two punch of
clocking limits and slower wires will make performance scaling more
difficult than ever before, as will the growing unreliability of these
devices. In this talk, I will present a new class of microprocessor
architectures, called Grid Processor Architectures (GPAs). GPAs are
intended to show an order of magnitude higher performance, scale with
technology, adapt to numerous application classes, and tolerate
radiation or noise-induced soft errors. Finally, I will describe the
prototype TRIPS chip being designed at UT-Austin, which contains
multiple GPA cores on a single die.

Computer Architecture From Many Perspectives

Peter Hsu
Peter Hsu Consulting, Inc.

Computer architecture in industry encompasses a wide variety
of issues such as semiconductor circuit technology, power
supply and heat dissipation, project scheduling and manpower
projections, cost and price estimation, and, of course, design.
With the prevelance of System-On-a-Chip (SOC), there is
great demand for versatile computer architects. In this talk
I hope to stimulate your interests by exposing you to a variety
of issues from my career experience: constraints from the
material world, design tool implications, things financial.
I will show examples of how seeminly diverse issues interact,
and how sometimes very simple estimations can help you
understand complex situations.

Computing with FPGAs

Oskar Mencer
Bell Labs &
Imperial College, University of London

Field-Programmable Gate Arrays (FPGAs) can outperform microprocessors on certain tasks by many orders of
magnitude. The open research problems of computing with FPGAs are: (1) understanding the limitations of FPGAs
when competing with microprocessors, and (2) providing a useful programming methodology.

First, I will show how FPGAs can be utilized to accelerate certain algorithms by up to three orders of magnitude.
Examples for methods achieving the speedups are:

   a. exploring parallelism and pipelining on the bit-level,
   b. optimizing the encoding of data values (number representation), and/or
   c. reducing the required memory bandwidth by implementing data-structures and algorithms directly on the
       FPGA.

In addition, the speedup could be translated into savings in power consumption.

Second, I suggest a programming methodology for FPGAs based on Domain Specific Compilers. Domain specific
compilers implement a divide-and-conquer, bottom-up approach to programming FPGAs. The vast space of
possible architectures fragments into architecture families, which indirectly defines application domains. A domain
specific compiler targets one architecture family and thus focuses on a single application domain. The StReAm
compiler, under development at Bell Labs and Imperial College targets pipelined data-flow graphs mapped
directly from object-oriented C++ to hardware. The goal is to provide a simple abstraction for programming
FPGAs analogous to the abstraction of a microprocessor provided by the C programming language.

Research at the Computer Engineering Laboratory of Delft University of Technology

Ben Juurlink
Computer Engineering Laboratory
Electrical Engineering Department
Delft University of Technology

In this talk I will give an overview of the research conducted
at the Computer Engineering Laboratory of Delft University of
Technology. After some general information about the group, I
will describe some ongoing research projects like the Molen
project, in which architectures for multimedia applications and
a Java processor are being developed, and the $\Delta$-Iliad
project, in which new architectural paradigms are being developed
for general purpose computing. In the last part of this talk, I
will describe the Paderborn University BSP (PUB) library. This is
a library of communication primitives based on the Bulk-Synchronous
Parallel (BSP) model.

Retargetable Binary Translation

Cristina Cifuentes
Sun Microsystems Laboratories

Binary translation, the automatic translation of executable programs
from one machine to another, has traditionally been limited to
hand-crafted techniques for a given pair of machines. Our approach
to binary translation, which is designed to allow both source and
target machines to be changed at low cost, is based on a combination
of machine descriptions, binary-interface descriptions, and machine-
independent analyses. This approach is producing components that are
suitable for static binary translators, as well as for other binary
manipulation tools.

This seminar will describe the design and implementation of the
University of Queensland Binary Translator (UQBT), a retargetable
framework for constructing binary translators. Preliminary results
obtained with several static translators instantiated from this
framework will be presented. Retargetability is achieved by
means of specifications of features of the machine and OS conventions,
allowing a binary translator writer to concentrate on such features
and reuse the rest of the framework.

More information about the UQBT project can be found at
http://www.csee.uq.edu.au/csm/uqbt.html

The impact of grid computing on UK research

Ron Perrott
Queen's University, Belfast

In November 2000 the UK research councils launched an e-Science initiative.
This initiative is concerned with the development of the key IT infrastructure to
support the increasingly global research collaborations that are emerging in many
areas of science and engineering. Such e-Science collaborations will be based on
the shared use of some combination of very large computing resources, enormous
data collections and remote access to specialised facilities or sensor data. The
need for such experiments to access extreme computing resources and/or
multi-petabyte datasets, together with their associated visualisation requirements,
will drive the development of the next generation IT infrastructure. An important
component of the initiative is the development of the grid middleware, which
underpins the infrastructure.

The talk will describe how the UK has set up its infrastructure to support
e-Science and the various projects, which have been launched.

El compilador Pro64

Eduard Santamaria
Computer Architecture Dept., UPC

SGI Pro64 es un conjunto de compiladores optimizadores para sistemas
Linux/Itanium. Incluye compiladores para C, C++ y Fortran 90/95 que
siguen los estándares ABI y API de Linux IA-64. El compilador ha sido
desarrollado por SGI y está basado en MIPSPro. Los archivos fuente
se distribuyen bajo licencia GPL de GNU.

La charla describe la organización del compilador a nivel de componentes,
la representación intermedia que utiliza y algunos aspectos prácticos para
empezar a trabajar con el.

Distributed (high performance & collaborative) data mining

José Maria Peña
DATSI, Universidad Politecnica de Madrid

Data mining paralelo y data mining colaborativo son dos nuevas líneas
que han surgido de una serie de necesidades dentro del campo de
análisis de datos y extracción de conocimiento. Data mining paralelo
(o de alto rendimiento) tiene su origen en la necesidad de potencia de
proceso para resolver problemas de data mining de alta dimensionalidad
(bases de datos de gran tamaño). Por su parte, data mining
colaborativo representa la utilización de las técnicas de análisis
sobre fuentes de información de naturaleza distribuida y de las cuales
no se puede integrar un modelo de datos centralizado. Aun planteando
soluciones a diferentes problemas, estas dos líneas convergen en la
necesidad de diponer de mecanismos, herramientas, algoritmos, etc. de
data mining para su explotación sobre entornos distribuidos.

Esta charla repasa el estado del arte en lo referente a data mining
distribuido, planteando las necesidades y requisitos de esta área. A
continuación se presentarán las áreas de investigación afines que
deben aportar soluciones a estos problemas, tales como: equilibrado de
carga, entrada/salida paralela, sistemas multi-agente, etc.

Finalmente, se presentará el estado actual del trabajo de nuestro grupo
en este campo y la aplicación de estas técnicas al analisis de datos
bio-genéticos (BioMining) y de información web (WebMining).

Architectural trade-offs in building a network processor for layers 4-7

Enric Musoll
Clearwater Networks, Inc.

Network processors are becoming, and will become, critical components
of network equipment for the expanding service-based Internet
infrastructure.

Since recently, the bottleneck in the Internet was the bandwidth
of the interconnections among the nodes. Now the bottleneck is the
processing power of the nodes themselves. This shift in the hot spot
in networking systems is due to the advances in optical networks,
increasing the bandwidth of the links, and the increase of the demand
of higher quality of services. It is then a packet processing
bottleneck, fuelled by the limited memory bandwidth and the scarce
and usually under-utilized computing resources of the nodes.

In this talk, we will provide a definition of that a network processor
is. The term 'network processor' ranges a wide spectrum of devices,
with different applications to be targeted in each part of this
spectrum. We will focus on the higher-end portion, and we will see that,
due to the special characteristics of the application workloads, such
class of network processors deserve its own benchmarks to be evaluated.

We will go over several architectures used by different existing
high-end network processors, and we will conclude that an SMT-based
is the best suitable engine in a network processor. We will see that
a packet management unit or coprocessor is desirable to offload the
processor from expensive packet bookkeeping.

Finally, we will show the solution chosen by Clearwater Networks to the
problem of architecting a processor for the layers 4-7.

SMT Architectures

Ronaldo Gonçalves
State University of Maringá, Informatics Department

SMT (Simultaneous MultiThreaded) is becoming one of the major trends in the design of future generations of microarchitectures. Its ability to exploit both intra- and inter-thread parallelism makes it possible to exploit the potential ILP (Instruction-level parallelism) that will be offered by future processor designs. SMT architectures can hide high latencies of instructions taking better advantage of the hardware resources through the simultaneous execution of a lot of diversified instructions from different threads. In order to provide detailed and accurate information about the performance of this approach, a SMT simulator has been developed on top of the SimpleScalar Tool Set.

Professor Ronaldo Gonçalves and other researchers have analyzed and evaluated SMT architectures using that SMT simulator on workloads composed of some SPEC95 benchmarks. The performance of SMT architectures has been investigated considering different instruction cache topologies, shared and per-thread distributed buffer topologies, different decode depth and different accuracy for branch prediction. In this talk, professor Ronaldo Gonçalves will discuss about his experiences in this area showing some results and conclusions.

The MultiView Method for high performance Software Distributed Shared Memory

Assaf Schuster,
Computer Science Department, Technion

In this talk I will review the basic obstacles inherent in high performance implementations of Software Distributed Shared
Memory (SDSM) systems, and the traditional approaches for their solution. I will then present the MultiView method and how
it copes with these problems. I will also mention issues that need further consideration when using MultiView, such as cache
size, zero-copy communication, and protocol simplicity. Time permitting, I will describe extensions to SDSMs that were made
possible using MultiView, including transparent dynamic adaptation of coherence granularity, on-the-fly data-race detection,
and garbage collection.

rePLay: A Hardware Framework for Dynamic Optimization

Sanjay J. Patel,
University of Illinois at Urbana-Champaign

The dynamic behavior of an application speaks volumes about its future behavior. Programs tend to have stable patterns of execution.
Microprocessor techniques such as branch prediction, trace caches, and computation reuse attempt to capitalize on these stable
patterns in order to reduce a program's running time.

In the same spirit, the rePLay Framework uses a program's dynamic behavior to optimize its instruction stream. The rePLay
Framework is a set of microarchitectural features that enable aggressive and safe dynamic optimization of an executing program.
rePLay couples mechanisms to identify and optimize repetitive and stable regions of code with a hardware rollback mechanism. The
ability to roll back architectural state enables the optimizer to make speculative optimizations without requiring recovery code.

In this talk, I will describe the work done on rePLay by my research group at the University of Illinois (the Advanced Computing
Systems Group), including our recent development and performance characterizations of rePLay. I will also describe our work in
progress beyond rePLay, including a mechanism that uses rePLay for providing providing fault-tolerant operation.

Aplicaciones de Mineria de la Web

Ricardo Baeza-Yates,
Dept. of Computer Science, Engineering School, University of Chile.

Los datos de la Web tienen un gran potencial para distintas apliaciones. Desde diseqo de sitios guiados
por el usuario hasta mejoras de desempeqo y ranking en buscadores. En esta charla presentamos datos
cuantitativos de la Web chilena y su uso en algoritmos de ranking, indices jerarquicos e indices distribuidos.

Supercomputing for videogames

Jesus Corbal,
DAC, UPC

Computer entertainment is an emergent market that arises as one
of the most attractive computer development segments in future years.
New home video game systems are envisioned like platforms able to
leverage different media services, such as DVD playing, on-line
gaming and 'virtual-reality'-like applications.

Current home videogame systems rival in performance with the
most advanced high-end desktop PC systems. This talk analyses the
potential evolution of this segment market, the current state of
the art and describes the challenges for next generation games,
that will require supercomputing-level performance.

Power - The Next Frontier

Ronny Ronen,
Microprocessor Research Labs, Intel

In the past decades the world of computers has witnessed phenomenal
advances. Computers have exhibited ever-increasing performance and
decreasing costs, making them more affordable, and in turn, accelerating
additional software and hardware development that fueled this trend even
more.

While the pace of this progress has been quite impressive over the last
two decades, it is becoming harder and harder to maintain it.
Microarchitecture is now exposed to a new set of challenges and has to
consider and explicitly manage the limits of semiconductor technology -
such as power dissipation, wire delays, and soft errors.

This talk addresses the power challenge. The talk starts at looking at the
historical power trends and explaining why continuing "business as usual"
will bring the power consumption and the power density to unmanageable
levels. The talk later explains how microarchitecture affects power and
energy and will demonstrate recent strategies and tactics to achieve more
power-efficient microprocessors.

Feedback directed optimization in Compaq's compilation tools for Alpha

Robert Cohn,
Alpha Development Group, Compaq Computer Corporation

This talk describes and evaluates the feedback directed optimizations (FDO)
that are used in the Compaq C compiler tool chain for Alpha. The optimizations
include superblock formation, inlining, commando loop optimization, register
allocation, code layout, and switch statement optimization. The optimizations
either are extensions of classical optimizations or are restructuring transformations
that enable classical optimizations.

Feedback directed optimization is highly effective, achieving a 17% speedup
over aggressive classical optimization. Inlining contributes the most performance
and code layout, superblock formation, and loop restructuring are also important.
The compiler achieves large speedups with FDO, but only a small percentage
of the code and complexity is specific to FDO.

Areas for Innovations in VLSI Architecture

Uri Weiser,
Intel

Microprocessor performance was and probably will be one the VLSI Technology
driving forces. The VLSI CPU architects, until recently, adopted Main Frames
computers concepts and integrated them into Silicon. Today general purpose
VLSI CPUs become the forefront of CPU innovation.

CPU Architects should be guided by the basic unique essences of VLSI:
Frequency of operation, on die bandwidth, and short latencies. On the
other hand, general purpose Microprocessor became complex, and the
CPU performance is not monotonic across a range of different applications.
What is the solution to narrow down this high performance deviation from the
average?

The lecture will cover some initial thoughts in this direction.

Introducció a la Derivació Automatica

Enric Fontdecaba

La majoria d'algorismes numérics utilitzats per a resoldre
problemes en la física i en l'enginyeria utilitzen les derivades
dels models matemátics.

Com que la obtenció de la expresió analítica de les
derivades és una operació tediosa i propensa a errors, es
tendeix a utilitzar aproximacions de les derivades. Aquestes
aproximacions són computacionalment costoses i poden
tenir problemes de estabilitat numèrica.

Aquesta presentació introdueix un procediment que
permet, a partir de un programa que calcula una funció,
obtenir un altre programa que calcula les seves derivades
amb tota precisió i a un cost comparable al de calcular
la funció. Com es veurá, aquesta técnica té relació amb
les eines de paralelització automática.

VLSI: Is it all about integration and performance? Trends and directions

Uri Weiser,
Intel

The increase demand for performance for "specialized" MIPS, and the
difficulties to increase general purpose architecture performance, calls for
alternative solutions. How can we achieve these specialized MIPS?
Integration is one solution.

Two major integration spaces will be presented: the WAY we integrate into a
CPU die, and the WHAT we should integrate. The potential to provide
specialized power efficient MIPS is there.

The presentation will cover several new ideas that will enable future CPUs
to reach the future required performance.

Building Better Branch Predictors

Daniel Jimenez
University of Texas at Austin

Modern pipelined microprocessors consult branch predictors to speculatively
fetch and execute instructions beyond conditional branches. We present two
new branch prediction methods that address different aspects of the branch
prediction problem. The key idea is to replace the commonly used two-bit
counter with another mechanism. The first method uses the perceptron, one
of the simplest possible neural networks. Perceptrons provide better predictive
capabilities than counters and allow our predictor to consider longer branch
histories. The hardware resources needed for our method scale linearly with
the history length, in contrast with other purely dynamic schemes that
require exponential memory. Using a hierarchical organization, this complex
multi-cycle predictor can be used as a component of a fast delay sensitive
predictor. The second method uses a succint encoding of Boolean functions
known as read-once monotone Boolean formulas. By replacing a large branch
predictor component with a tiny circuit, we maintain accuracy while
decreasing branch predictor access delay and power consumption.

Speculative Multithreading: from Multiscalar to MSSP

Gurindar S. Sohi,
University of Wisconsin-Madison

My research group at Wisconsin has been working on speculative
multithreading techniques for over a decade. This talk will overview some of what we
have learned over the years. We will start with our early work on
multiscalar, continue with data-driven multithreading and speculative slices, and
then on to our most recent work on master-slave speculative parallelization
(MSSP). In the master-slave speculative parallelization model, a master
processor executes a distilled version of the program which forks slave
computations that run on slave processors. The distilled program is derived from an
original program using speculative transformations and is intended to be much
smaller than the original program. The slave threads then verify the actions of the
master, in parallel. Most of the talk will focus on the MSSP model.

Exploiting Value Locality in Physical Register File Design

Gurindar S. Sohi,
University of Wisconsin-Madison

The physical register file is an important component of a dynamically-scheduled processor.
Increasing the amount of ILP (issue width, instruction window size) places increasing demands
on the physical register file, calling for alternative physical register file organizations
and management strategies. In this talk we consider the use of value locality to optimize
the operation of a physical register file.

We observe that the value produced by an instruction is often the same as a value produced
by another recent instruction, resulting in the same value present in multiple physical registers
at the same time. By allocating a single physical register for a value, and by altering
the register rename mapping accordingly, we can reduce the physical register requirements.
We further observe that the number of writes to the physical register file can be reduced by
suppressing the writes of values already present in the register file. Furthermore, the number
of reads can be reduced by using other means to obtain the value in special cases.

We present optimizations for the special cases of the values 0 and 1. These are the most
frequently-occurring values across the spectrum of programs and also the most suitable ones for
optimization. We show how the physical register file size, as well as the number of read and write
accesses, can be reduced with trivial microarchitectural modifications by optimizing for 0's and 1's.

Parallelism and Computational Chemistry

Prof. Enrico Clementi,
Univ. d'Stratsburg, França

The development of Computational Chemistry is strictly related to the
development of parallel computers. This parallelism will be illustrated
following the evolution of both Computational Chemistry and Computers.

Automatic processor specialisation using ad-hoc functional units

Paolo Ienne,
Processor Architecture laboratory,
Ecole Politechinque Federale de Lausanne

Many systems-on-chip embedded processors can be specialised for a
given application-domain by adding ad-hoc functional units. In their
simplest form, these functional units can be used to map sections of
the dataflow graph, that is, clusters of elementary arithmetic or
logic operations; more complex hardware add-ons could attack the
control flow and, for instance, map loops onto sequential functional
units. For automatic processor specialisation (i.e., to identify and
generate automatically such ad-hoc functional units), the two
strategies imply different methods and have a markedly different
complexity. Before incurring in the complexities of control flow, the
key question appears so far unanswered: how much can one improve the
performance of an embedded processor on specific algorithms by
automatically mapping only dataflow sections of code on special
functional units? This talk presents analysis results on some
programs of the MediaBench suite. The idea is to assess the basic
scope for performance improvement and break it out in its different
sources: hardware parallelism; avoided quantization of each simple
instruction in an integral number of cycles; and simplification of the
logic due to constants. The elementary scope for speedup is increased
through additional manual optimisations (amenable to automation):
bit-width analysis and arithmetic implementation optimisation.
Finally, classic ILP techniques such as loop unrolling and predication
are used to increase the size of the basic blocks and give more scope
for the above sources of improvement. The results show that some
significant improvements in speed can be targeted without necessarily
mapping the control flow onto hardware.

EXPERT Performance Analysis Tools for Mixed-Mode Parallel Systems

Dr. Bernd Mohr ,
Research Centre Juelich
Prof. Dr. Allen Malony,
University of Oregon

The architecture design space for scalable, high-performance computing
systems is as rich as ever, with clusters of shared-memory multiprocessor
(SMP) systems (including integrated vector processors, a la the Earth
Simulator) blazing the scalability trail. Trying to keep pace, parallel
programming environments continue to evolve language and runtime technology
to more efficiently access available parallelism, while at the same time,
providing the user with abstractions to manage parallelism complexity.
Recently, the combination of shared-memory and distributed-memory
architecture in SMP clusters has motivated the use of multi-threading plus
message passing as the latest "mixed-mode" parallel programming fashion.
However, success of mixed-mode parallelism will depend not only on
available programming tools, but also on their ability to achieve good
performance. Unfortunately, the interplay of share- and distributed-memory
execution leads to greater performance complexity, complexity that current
performance analysis technology does not adequately address.

In this talk, we present the EXPERT performance-analysis environment.
EXPERT provides a complete tracing-based solution for automatic performance
analysis of MPI, OpenMP, or mixed-mode applications running on parallel
computers with SMP nodes. EXPERT describes performance problems using a
high level of abstraction in terms of common situations that result from an
inefficient use of the underlying programming model(s). The set of
supported problems is extensible and can be custom tailored to
application-specific needs. The analysis is carried out along three
interconnected dimensions: class of performance behavior, call tree
position, and thread of execution. Each dimension is arranged in a
hierarchy, so that the user can investigate the behavior on varying levels
of detail. All three dimensions are interactively accessible using a
single integrated view.

The talk reviews the EXPERT system in detail and discusses future work to
integrate its technology with the TAU performance system.

Research and Technology Programs at The E-Business Technology Institute

Dr. Chung-Jen Tan,
E-Business Technology Institute,
Hong Kong University

In this talk we will provide an overview of the organization, mission, and
research programs at ETI - a R&D orgnaization established in Sept. 1999 at
the Univesity of Hong Kong(HKU), as a partnership between IBM and the
university. In little over two and half years, ETI has established itself
as a pre-eminent e-business technology research institute in the greater
China region. With over 40 full time staff it has produced technologies in
e-commerce, securitites, content delivery, and wireless applications. ETI
has also established a satelite research center in Shanghai, China. We will
also describe how ETI staff work with faculties and industry partners in
developing solutions most relevant to the region.

Dynamic Recurrence Mappings

Prof. Graham Megson,
School of Computer Sience, Cybernetics & Electronic Engineering,
University of Reading

We look at the problem of synthesising recurrence equations onto
regular architectures. In particular, cases where the data dependencies
are not known until run time.

Solving large scale problems and grid computing

Prof. Vassil Alexandrov,
School of Computer Sience, Cybernetics & Electronic Engineering,
University of Reading

In particular we focuss on how to use Monte-Carlo methods for solving such
large problems. Examples will be drawn from air polution modelling and
information retrieval.

Exploring Improved Cache Organizations Based on Page-Level Access Behavior

Sriram Vajapeyam,
Independent Consultant, India

Effective and efficient caches continue to be important given not only
the increasing processor-memory speed disparity but also the newer
requirements of low power, localized communication, etc. A rich body of
work has addressed issues such as cache placement (conflicts), cache
tag overheads, cache latency, etc.

We approach caches from a different angle compared to previous work: we
look for patterns in the page addresses (tag bits) of mutually contending
cache accesses. We find that half-a-dozen or less page number bits (tag
bits) account for a big majority of the virtual address differences
between contending cache accesses in the SPECInt2000 benchmarks. In this
talk, we first describe these results, and also sketch a potential hard-
ware method for further reducing the number of such conflicting tag bits.
We then identify several directions for exploring improved cache organiza-
tions that exploit this behavior, including improving the choice of cache
index bits, reducing tag overhead (and thus power) for set-associative
and decoupled-sector caches, and a new sub-tagged cache organization.

Some of the research described in this talk is to appear in TCCA's
Computer Architecture Letters of July 2002, and the rest is work in
progress, some of it jointly with Siddhartha Tambat. The talk aims to
trigger discussions of possible collaborative work.

NUCA: Non-Uniform Cache Architectures for Wire-Dominated On-Chip Caches

Doug Burguer,
University of Texas at Austin

As on-chip global wire delays increase, cache design will be
affected as profoundly as processor design. Within three generations,
the time to access a cache will be a function of where in the cache a
datum resides, not the time to actually access that datum. We propose
a new class of cache designs, called NUCA, that make the cache access
time a function of where in the cache a datum resides. Cache access
latencies thus become a continuum of latencies, rather than a dingle,
discrete, worst-case delay. In our dynamic NUCA implementation, we
show how to permit important data to migrate within the cache,
allowing most of the cache accesses to be services from the cache's
closest, and therefore fastest, sub-banks. We show that these schemes
outperform all other cache organizations, including a multi-level
cache, using the same area.

In-Memory Parallelism for Database Workloads

Pedro Trancoso,
University of Cyprus

In this work we analyze the parallelization of database workloads for
an emerging memory technology: Processing-In-Memory (PIM) chips. While
most previous studies have used scientific workloads to evaluate PIM
architectures, we focus on database applications as they are a dominant
class of applications.
For our experiments we built a simple DBMS prototype, which contains
modified parallel algorithms, an in-chip data movement algorithm, and a
simple query optimizer. Compared to the single processing execution,
the average speedup for a PIM with 32 processing elements is 43 times.
Other results show that an n-way multiprocessor of similar cost cannot
perform as well. Overall, the results obtained indicate that PIM chips
are an architecture with large potential for database workloads.

Polymorphic Mechanisms in the UT-Austin TRIPS Processor

Doug Burguer,
University of Texas at Austin

As designers stretch to deepen pipelines and increase clock
rates, current processor designs are becoming more fragile; their
performance varies more widely across different application classes.
Polymorphic architectures hope to reverse this trend: by providing a
sea of ALUs and memory banks, and mapping application classes to these
hardware resources based on the application's needs, unprecedented
performance and flexibility may be realized. In this talk, I will
describe a set of mapping mechanisms for the TRIPS processor that is
intended to permit the processor to run single-threaded,
multi-threaded, vector, and streaming codes efficiently. With these
mechanisms, we are building three "major morphs" into the processor,
for desktop code (the D-morph), streaming codes (the S-morph), and
threaded/server codes (the T-morph).

Google: Finding Needles in Terabyte Haystacks

Luiz Barroso,
Google

Hiding behind a fairly simple web user interface lies a
formidable collection of systems, technologies, and infrastructure
that enables Google to serve nearly 150 million user queries
per day from an index of over three billion documents.
Achieving such scale while constantly improving on the quality
of our service is a significant challenge, and requires expertise
from virtually every discipline of Computer Science. In this talk
I will focus on the software and hardware architecture that has
enabled Google to meet such a challenge.

The Optimum Pipeline Depth

Ronny Ronen,
Microprocessor Research Labs, Intel

Determining the target frequency of the processor is one of the
fundamental decisions facing a microprocessor architect. While historical
debate of pushing frequency or IPC to improve performance continues, many
argue that modern processors have pushed pipelines beyond their optimal
depth..." is it so?

Three papers in ISCA'2002 (from IBM, UT Austin/HP and Intel) address this
issue. They try to identify what the optimal pipeline depth should be and
how it is affected by various microarchitectural characteristics like
cache sizes, branch misprediction rate, ALU timing and more. All of them
seem to agree that pipelines can be made deeper then they are today.

In my talk I will present the problem, present a slightly modified
versions of these three talk, and will add some commentary of my own.
Including, of course, the power implication that was neglected in all
three papers.

Realizing High IPC Through a Scalable, Multipath Microarchitecture

David Kaeli,
Dept. of Electrical and Computer Engineering,
Northeastern University

This work explores a microarchitecture that achieves high execution
performance on conventional single-threaded program codes without
compiler assistance. Microarchitectures that can have several hundreds
of instructions simultaneously in execution can provide a means to extract
larger amounts of instruction level parallelism, even from programs
that are very sequential in nature. However, several problems are
associated with such microarchitectures, including scalability
issued related to control flow and memory latency.

This talk will present a basic overview of our microarchitecture and discuss how
it addresses scalability as we attempt to execute many instructions in parallel.
We will also describe some of the more novel features of the machine including:
Active Stations (a Tomasulo-like intelligent reservation station),
timetags (used to enforce program order), and
disjoint execution (used to limit the impact of unpredictable program
control flow). We provide simulation results for several
geometries of our microarchitecture that illustrate how high IPC
can be realized from integer programs. We also explore algorithms
that dynamically reassign speculative paths, reallocating hardware
resources to higher priority paths.

That's all folks!!!!!