Seminar 00/01 - Abstracts


DYNAMO: A Transparent Dynamic Optimizer

Evelyn Duesterwald
Hewlett-Packard Laboratories
 

This talk describes the design and implementation of DYNAMO, a prototype
dynamic optimizer that is capable of accelerating the performance of a
native program binary at runtime. DYNAMO identifies and extracts the
dynamically hot code regions in the executing program binary and
achieves a performance boost by optimizing and laying out these
regions in a software code cache. Initially, DYNAMO observes the program
behavior through interpretation and uses a very low overhead online
profiling scheme to identify the dynamically hot program paths.  Copies of these
hot paths are optimized using light-weight optimization techniques and
emitted into the software code cache. Subsequent execution of these traces cause
the cached version to be executed. Over time, the program's working set
materializes inside the software code cache and execution takes place
increasingly more based on the optimized cached version of the input
binary.

DYNAMO is written entirely in user level software, and runs on a
PA-RISC machine under the HPUX operating system.
Our experimental evaluation of DYNAMO demonstrates that it is possible
to use a piece of software to improve the performance of a native,
statically optimized program binary, while it is executing.



Compiling and Optimizing Image Processing Algorithms  for FPGA-based Reconfigurable Computers

Walid Najjar
University of California Riverside

This talk presents a high-level language for expressing image processing
algorithms, and an optimizing compiler that targets FPGA-based
reconfigurabel computing systems. The main features of this language,
called SA-C,  are: 1) support the simple expression of  image processing
algorithms and applications, and 2) enable efficient compilation to
FPGAs.  SA-C is a subset and an extension to C. It is based on C syntax
and semantics. The optimizing compiler translates  SA-C programs into
non-recursive data flow graphs, which in turn are translated into VHDL
using commercial synthesis tools. Performance numbers are reported for
some well-known image processing routines, written in SA-C and
automatically compiled to an Annapolis Microsystems reconfigurable
systems.



Adaptive Operating Systems: An Architecture for Evolving Systems

Barton P. Miller
University of Wisconsin, Computer Sciences Department

Operating systems used to be viewed as static entities, changing almost as
slowly as the underlying hardware.  Recent commercial systems have provided
for a small amount of run-time change by allowing device drivers to be
installed while the system is running.  We are developing "adaptive operating
systems" whose code can change and evolve while the system is running.  This
adaptation can be used to instrument the code for profiling or debugging
purposes, or to modify and extend the operating system to adapt to changing
work loads, application demands, and configuration.

Our work differs from other efforts in this area in two ways.  First, we
can instrument and modify a stock, commercial operating system (Solaris) as
it was delivered to the customer.  We operate directly on the executable code
while it is running.  Second, we can modify the operating system at almost
any point in its code.  We are not constrained to system call or procedure
call replacements.

Our research is embodied in a facility called KernInst.  We will describe
the basic KernInst mechanism and several uses of KernInst, including
performance profiling and dynamically modifying and customizing the operating
system in response to its work load.  We will also present a case study
using KernInst to profile the Solaris kernel under a Web proxy server (Squid)
workload.


RWC Omni project: a portable OpenMP compiler and cluster-enabled OpenMP

Mitsuhisa Sato
Real World Computing Partnership, Japan

OpenMP is an emerging standard API for parallel programming on shared
memory multi-processors.  We are developing a portable OpenMP
compiler, Omni OpenMP compiler for SMPs and a cluster of SMPs.  Our
objectives of Omni OpenMP compiler project includes:

 - Portable implementation of OpenMP for SMPs
 - Cluster-enabled OpenMP, which support seamless programming for SMPs
   to SMP/PC clusters in OpenMP.

Omni OpenMP compiler translates C and Fortran77 programs with OpenMP
pragmas into C code suitable for compiling with a native compiler,
linked with the runtime library. The compiler was ported to several
platforms including Sun Solaris, SGI IRIX, Linux (x86 and Alpha). The
recent release 1.2 includes StackThreads implementations for nested
irregular parallelism, and cluster-enabled OpenMP implementation using
for a page-based software distributed shared memory system, SCASH on a
cluster of PCs.

In this talk, we will present implementation and performance
evaluation of our compiler, and our approach to cluster-enabled
OpenMP.



Validation of Dimemas communication model for MPI collective operations

Sergi Girona
DAC/CEPBA, UPC

This talk presents an extension of Dimemas to enable accurate performance
prediction of message passing applications with collective communication
primitives. The main contribution is a simple model for collective
communication operations that can be user-parameterized. The experiments
performed with a set of MPI benchmarks demonstrate the utility of the model.
 



Spec CPU2000 characterization for Superscalar Architectures

Smail NIAR
University of Valenciennes, FRANCE

The objective of this study is to realize performance measurements of the new
SPEC CPU2000 benchmark programs  on a superscalar processor simulator.
 The results of the these experiments will allow:

  1. To show which are the architectural parameters, having greatest

  2. influence on the performance of the SPEC CPU2000.
  3. To spot capabilities and limitations of the modern superscalar

  4. micro-architectures and where  the "Hot Spot" are.
  5. And finally, to help designers in proposing new powerful

  6. micro-architecures .


Dynamically Reconfigurable Architectures: An overview

Juan Jose Noguera
DAC, UPC

The continued progress of semiconductor technology has enabled that new
reconfigurable logic devices and architectures have appeared,
making possible the development of a new research field, Reconfigurable
Computing, which offers an intermediate solution between general-purpose
systems (microprocessors) and application specific systems (ASIC's).
The aim of the talk is to introduce the basic concepts of reconfigurable
computing. Concretely, dynamically reconfigurable devices and architectures,
which are run-time reconfigurable, are presented.
Some examples of how this type of reconfigurable devices and architectures
are coupled with standard microprocessors is explained. Also, it is explained
how reconfigurable computing, in certain applications, can help to increase the
performance of standard microprocessors.



¿Para qué (nos) sirven las encuestas docentes?

Miguel Valero
DAC, UPC

No cabe duda de que el uso de encuestas a alumnos como instrumento
de evaluación de la docencia es un tema polémico. Fue polémico cuando
empezó a usarse en la UPC y lo es ahora que se ha decidido pasar
encuestas con menor frecuencia. Es polémico aquí y también lo es
en otras universidades, en las que no han podido instaurar el sistema
por oposición de los profesores.

Ante tanta polémica, la pregunta que da título a la presentación parece
relevante. En la exposición se defenderá que la encuesta docente puede
ser un instrumento esencial para la mejora de la docencia. Para ello, sólo
necesitamos una auténtica voluntad de mejora, y un poco de espíritu de
colaboración entre nosotros. Veremos cómo puede organizarse un proceso
de mejora a través de la encuesta (desde luego, el proceso actual no sirve)
y se hará una propuesta concreta de actuación en este ámbito.



The ABC of Criptography and its algorithms

Fernando Martinez
MAII, UPC

El objetivo de la charla es presentar los algoritmos y protocolos
criptográficos más utilizados en las aplicaciones que usan
criptografía.

Se presentarán algoritmos de cifrado de clave secreta (DES y AES)
y de clave pública (RSA). Mientras que los primeros son más rápidos
y utilizan claves más cortas, los segundos tienen la ventaja de que
no es necesario el acuerdo previo de claves para iniciar una
comunicación segura. Además, la criptografía de clave pública permite
definir el concepto de firma digital, básico para el desarrollo del comercio
electrónico.

Los algoritmos criptográficos presentados son considerados
computacionalmente seguros, o sea, que con los conocimientos
y recursos actuales no es posible romperlos.



Alto: A Link Time Optimizer for the Compaq Alpha

Manel Fernandez
DAC, UPC

Traditional optimizing compilers are limited in the scope of their
optimizations by the fact that only a single function, or possibly a
single module, is available for analysis and optimization. In
particular, this means that library routines cannot be optimized to
specific calling contexts. A possible solution is to carry out
whole-program optimization at link time. The goal of the alto project is
to develop whole-program dataflow analyses and code optimization
techniques for link time program optimization; the current system is
targeted to the DEC Alpha architecture, and produces code that is
typically considerably faster than that produced using DEC's Om
link-time optimizer (with or without profile-guided inter-file
optimization carried out by the compiler).



Inside the Pentium 4 Procesor

Doug Carmean
Intel

The Pentium 4 Processor is an all new processor from Intel.  When the
processor is introduced in November, it will deliver the highest performance
of any desktop processor in the world.  This talk will walk through some of
the details of the new microarchitecture, and give some insight as the the
philosophy that was used to design the processor.  The details include a
tour through the branch misprediction pipeline and data speculation.



Differential FCM: Increasing Value Prediction Accuracy by Improving Table Usage Efficiency

Bart Goeman
Ghent University

Value prediction is a relatively new technique to increase the
Instruction Level Parallelism (ILP) in future microprocessors.
An important problem when designing a value predictor is efficiency: an
accurate predictor requires huge prediction tables. This is especially the
case for the finite context method (FCM) predictor, the most accurate one.

In this presentation, we show that the prediction accuracy of the FCM can be
greatly improved by making the FCM predict strides instead of values.
This new predictor is called the differential finite context method
(DFCM) predictor. The DFCM predictor outperforms a similar FCM predictor
by as much as 6% to 17%, depending on the prediction table size.

We use several metrics to show that the key to this success is
reduced aliasing in the level-2 table between stride and non-stride
patterns. We also show that the DFCM is superior to hybrid predictors
based on FCM and stride predictors, since its prediction accuracy is
as high as that of a hybrid one using a perfect meta-predictor.



Performance-Driven Processor Allocation

Julita Corbalan
DAC, UPC

This work is focused on processor allocation in shared-memory multiprocessor
systems, where no knowledge of the application is available when applications are
submitted. We perform the processor allocation taking into account the characteristics
of the application measured at run-time. We want to demonstrate the importance
of an accurate performance analysis and the criteria used to distribute the
 processors. With this aim, we present the SelfAnalyzer, an approach to dynamica
lly analyze the performance of applications (speedup and execution time), and the
Performance-Driven Processor Allocation (PDPA), a new scheduling policy which
distributes processors considering both the global conditions of the system and
the particular characteristics of running applications. This work also defends
the importance of the interaction between the medium-term and the long-term
scheduler to control the multiprogramming level in the case of the clairvoyant
scheduling policies.

We have implemented our proposal in a SGI Origin2000 with 64 processors
and we have compared its performance with that of some scheduling policies
proposed so far and with the native IRIX scheduling policy. Results show that
the combination of the SelfAnalyzer+PDPA with the medium/long-term scheduling
interaction outperforms the rest of the scheduling policies evaluated. The evaluation
shows that in workloads where a simple equipartition performs well, the PDPA
also performs well, and in extreme workloads where all the applications have a
bad performance, our proposal can achieve a speedup of 3.9 with respect to an
equipartition and 11.8 with respect to the native IRIX scheduling policy.



OpenMP Extensions for Thread Groups and Their Run-Time Support

Marc Gonzalez
CEPBA

We presents a set of proposals for the OpenMP shared-memory
programming model oriented towards the definition of thread groups
in the framework of nested parallelism. The presentation also describes
the additional functionalities required in the runtime library
supporting the parallel execution. The extensions have been
implemented in the OpenMP NanosCompiler. The experimental results
show the usefulness of the proposed extensions in a set of real
applications (two SPEC95FP benchmarks and a generic multi-block
code).



A Whole New Ballgame: Supercomputing on Two AA Batteries

Dr. David Baker
Vice President of Product Development, BOPS Inc.

This talk addresses the challenges of designing high performance DSPs for
embedded applications. The challenges in embedded processor design are very
different from the challenges of general-purpose processor design which has
received so much attention. Traditional benchmarks like SPEC do not
represent workloads that dominate the embedded marketplace. Benchmarks like
EEMBC better characterize signal processing and media-centric applications
within consumer appliances. Many embedded applications demand ultra high
performance on problems that are embarrassingly parallel. This performance
is not achieved by simply increasing clock frequency especially in
environments where energy management is key to success. This talk will
describe architectural innovation that provides new levels of performance at
decreased levels of power consumption in embedded environments.
 



Influencia de la arquitectura tanto hardware como del SO en el rendimiento de enrutadores IP

Oscar-Ivan Lepe-Aldama
DAC, UPC

El continuo crecimiento de la Internet esta imponiendo enormes
cargas de trabajo en los nodos que se encargan de sostenerla: los
enrutadores IP. Basicamente, estos nodos, instalados tanto en pequeñas
oficinas como en las intersecciones de las troncales en el corazón de la
Internet, son los encargados de encontrar rutas y remitir a traves de
ellas los paquetes que atraviesan el planeta usando esta red de redes.
Con el crecimiento de Internet no sólo ha crecido la cantidad de
paquetes por segundo que estos sistemas tienen que procesar, si no que
también lo han hecho la cantidad de trabajo por paquete y la necesidad
de proveer diferentes niveles de servicio. Los enrutadores IP
actualmente en operación están "perdiendo esta batalla".

El objetivo de esta charla es discutir sobre la influencia que sobre el
redimiento de los enrutadores IP tiene su arquitectura tanto hardware
como software (léase sistema operativo). Al hacerlo, se busca
identificar los posibles focos de bajo rendimiento y bosquejar posibles
soluciones para ello. La charla se apoya en una evaluación, tanto
cualitativa como cuantitativa, del rendimiento de un tipo particular de
enrutador IP: un enrutador realizado con hardware y software para PC.
Estudiar este tipo de enrutadores es importante por el potencial
telmático que su tecnología tiene. Más aún, un gran número de
enrutadores actualmente en operación tienen una arquitectura (hardware y
software) con más similitudes que diferencias respecto a la de una PC.



HPC research activities at Nara Women's University

Kazuki Joe
Department of Information and Computer Sciences Nara Women's University, JAPAN

In this talk, two HPC research topics are introduced.
 

  1. Collaboration of Parafrase-2 and NaraView for Effective Parallelization Supports

  2. NaraView is a visualization tool for the support of parallel programing. A case
    study of the parallelization of an extended huckel calculation program by the
    collaboration of Parafrase-2 and NaraView is introduced.
     
  3. Reduction of Train Noise from Telluric Current Data by Neural Networks

  4. Short-term earthquake prediction is an important and emergent research
    topics in Japan.  In this talk, an application of neural networks to the
    short-term earthquake prediction is introduced.


An Overview of Virtual Machine Architectures, Implementations and Applications

Jim Smith
University of Wisconsin-Madison

A Virtual Machine (VM) provides a program execution environment which,
from the perspective of a program, is identical to the one provided
by a conventional machine environment.  However, the real underlying
machine may be quite different from the VM;
a layer of software performs any required conversions and translations
from virtual machine to real machine.  A concealed VM software layer adds
considerable flexibility by permitting cross-platform software portability.
Furthermore, if VM software is made available to the hardware designer,
it enables implementation-dependent optimizations for improved performance,
power saving, and fault tolerance.  There are a number of virtual machine
architectures, implementations and applications.  This talk is
essentially a "mini-tutorial" that will survey implemented (and
proposed virtual machines).  Emphasis will be placed on the
important underlying technologies, and a VM taxonomy will be presented.



Transistor Count and Chip-Space Estimation of Simulated Microporcessors

Marc Steinhaus
University of Karlsruhe

This talk proposes a chip space and transistor count model of
microprocessors at register-transfer level. The estimation tool,
which is based on the model, receives its inputs from the baseline
architecture and the configuration file of a microarchitecture
simulator as e.g. the SimpleScalar toolset or the Karlsruhe SMT
multimedia simulator. The estimation tool gives a pre-silicon
complexity estimation and allows to compare different microprocessor
configurations with respect to their anticipated hardware complexity.
The estimation method is validated using configuration parameters
of a real processor yielding a transistor count and a chip space
estimation than is very close to the real processor numbers.
 



Micro-Operation Cache: A Power Aware Frontend for Variable Instruction Lenght ISA

Ronny Ronen
Microprocessor Research Labs, Intel

Novel instruction fetch structures improve processor performance by
providing higher fetch bandwidth, but at the cost of increased complexity
and higher power consumption. In this talk we introduce a simpler mechanism
- the Micro-Operation Cache (Uop Cache - UC) aimed toward reducing power.
The UC is designed to provide competitive frontend performance relative to
existing instruction fetch mechanisms, based on conventional instruction
caches, while significantly decreasing power and energy consumption. The UC
caches basic blocks of instructions; instructions are pre-decoded into
micro-operations (uops). The UC can fetch a single basic-block of uops per
cycle. Fetching complete pre-decoded basic-blocks eliminates the need to
repeatedly decode variable length instructions and simplifies the process of
predicting, fetching, rotating and aligning fetched instructions. This
elimination of power hungry operations is the main source of the UC power.
The UC coexists with the Instruction Cache (IC), it is designed to work in
concert with the IC so not to lose performance on a UC miss. The design
enables even a small UC, which features only a moderate hit rate, to be
quite effective. We propose several optional UC related mechanisms to
improve performance and reduce power.



Some Power Observations - Food for Thought

Ronny Ronen
Microprocessor Research Labs, Intel

In this short talk I will discuss very rough power and performance trade off
and possible high level strategies to address them in various segments.
I will also discuss how microarchitecture can address power.



Clocked Timing Elements:
Master-Slave Latches and Flip-Flops for High-Performance and Low-Power Systems

Vojin G. Oklobdzija
ACSEL Director, ECE Dept.,
University of California Davis

The clocked storage elements are the single most analyzed and debated circuit structures in modern microprocessors.
Their importance is in the fact that they provide a boundary between the ever-shrinking pipelined stages. The demand for
high-performance mandates detailed understanding of timing issues and the intricate inner working of timing elements.

The techniques known as: "time borrowing", "slack passing" or "cycle stealing" are based on the fact that the extra time
needed could be traded with the time allowed for the next cycle. Those techniques are increasingly used and they are
intimately related to the inner workings of timing elements. This talk will present a number of clocked storage elements used
in modern microprocessors and discuss the timing issues and design guidelines. We also discuss the rules for consistent
estimation of the real performance and power features of the Flip-Flop and Master-Slave latch structures. A new simulation
and optimization approach is presented, targeting both high-performance and power budget issues. The analysis approach
reveals the sources of performance and power consumption bottlenecks in different design styles. Certain misleading
parameters have been properly modified and weighted to reflect the real properties of the compared structures. Furthermore,
the results of the comparison of representative Master-Slave latches and Flip-Flops illustrate the advantages of presented
approach and  the suitability of different design styles for high-performance and low-power applications.



Psi - Pervasive Services Infrastructure

Dejan S. Milojicic
HP Labs, Palo Alto

Future systems have been characterized as ubiquitous, pervasive, and
invisible. They will consist of devices that are diverse in size,
performance, and power consumption. Some of these devices will be
mobile, posing additional requirements to system software and
applications. With these changes, our focus will need to move from
technology to deployment and ease of use of services. Consequently,
traditional paradigms for reasoning about, designing, and implementing
software systems and services will no longer be sufficient alone.

We believe that this future vision will need to rely on a three-
tier infrastructure consisting of back-end servers, infrastructure
servers, and front-end clients (mobile or static, handheld or
embedded). The critical question for future systems will be how
to deliver services on-demand from back-end servers to resource-
constrained clients. If we can handle the new requirements of
these systems, we can enable this computing infrastructure to
offer significantly more services to users in a more pervasive way.

This talk will present two main Psi subprojects: "services-on-demand"
and "adaptive offloaded services". It will outline the high level
architecture, present some preliminary performance measurements, and
compare the project to related work in academia and industry.



Thirty years of research on caches ...and it is not over!

Jean-Loup Baer
University of Washington

Caches were introduced over 30 years ago. They have evolved from a
single level sectored cache, whose presence was oblivious to the Instruction
Set Architecture (ISA), to a multi-level hierarchy of caches, of various
sizes and associativities, that are exposed to the ISA, and that are
accompanied by a variety of hardware and software assists.

Caches have been a great success for enhancing the performance of computer
systems and in this talk, we briefly review some of the progress made on
cache design and performance. However, in spite of the abundance of literature
on the subject, caches are not as efficient as they could be and they will
remain an active area of research as long as the challenge of the "memory wall"
is still present.  We will describe succinctly our current methodology for the design of
cache assists, a methodology that borrows from paradigms used in branch and
value prediction, and will show its application to enhancing the performance of
some features of on-chip and off-chip caches.



Post processing to correct MPEG artifacts

C. Miro-Sorolla
Laboratories d'Electronique Philips
 

This presentation shows the work we are performing at LEP on the correction of MPEG artifacts in video sequences.
Expected applications are high-end television, set top boxes and DVD storage. We will present the MPEG artifacts
we deal with by means of an example: blocking, ringing and mosquito noise. MPEG artifacts are the consequence of
a coarse quantization of the DCT coefficients in the MPEG encoding process. The MPEG blocking artifact correction
algorithm we have developed at LEP will be discussed as well as the DCT domain filtering that allows removal in
textured and other regions with a minimum of degradation. Complexity estimations of a hardware implementation of
the algorithm that meets real time constraints have been performed. Results obtained on silicon area and Input/Output
bandwidth will be presented. To apply this algorithm, the size of the MPEG-grid in the decoded video sequence must
be known. Studies about a universal grid detection algorithm are also being performed in our team. Finally, subjective
tests were necessary for the evaluation of the algorithm as none of the available objective quality measures is able
to deal with post-processed images correctly. We are also investigating the development of an objective quality
assessment method.



DB2 Query Rewrite and Optimizer overview

Calisto Zuzarte
Centre for Advanced Studies,IBM Toronto Lab

In this talk we trace the path of an SQL statement  through the compiler
and discuss in a little more detail the Query Rewrite and Optimizer
components.



Some Results on Trace Caches

Hans Vandierendonk

Wide-issue superscalar processors need to fetch multiple basic blocks
each cycle. The trace cache solves this problem by grouping multiple
basic blocks into one consecutive trace and storing this trace in a
cache. Because of the way traces are constructed, the same instruction
can be present in more than one trace, causing redundancy between
traces. Redundancy potentially harms performance, since more traces are
needed to represent the program.  In this talk, we present some results
on the forms of redundancy present in different trace cache organisations.



Dependence Based Value Prediction

Yiannakis Sazeides

In this talk we will introduce dependence-based value prediction:
prediction of values produced by program instructions based on
dependence information. A number of dependence-based value prediction
methods will be presented:

  1. prediction using a combination of dependence information from outstanding

  2. instructions and latest known architected state,
  3. prediction based on values of recurring predecessors,
  4. prediction using dependence distance from earliest predecessors.
The talk will mainly focus on the first method.

Dependence information propagated through register and memory
dependences of instructions selects the history used for obtaining a
prediction. The history is (a) architected register state, and (b)
architected memory state from a small table, memory history table,
that maintains recently stored values. Examples of dependence
information propagated are (a) register names and
memory-history-table-indices of latest predecessors with known state,
and (b) information about the outstanding instructions on the
dependence path reaching the predicted  instruction such as optypes,
pcs, immediates etc. The predictor employs memory dependence
prediction for relaying the information used to predict memory
dependent instructions.

Is shown experimentally that:
(a) when considering only register dependences average prediction of
75\% is achieved with a 64K entry direct-mapped prediction table,
(b) the use of memory history table and memory dependence prediction
increases average prediction accuracy to 82\% for a 64K entry
direct-mapped table, with several benchmarks having prediction
accuracy over 90\%, and
(c) the proposed predictor with 64K entry direct-mapped table, can
achieve similar or higher accuracy  as compared to an unbounded
context-based predictor.
 



Speculative Shared-Memory Architectures

Jose F. Martinez
University of Illinois at Urbana-Champaign

Speculative shared-memory operations enable aggressive parallelisation
in situations that prove challenging to compilers and developers.
Applications execute optimistically in parallel, and the system
continuously monitors for memory consistency violations, squashing
and restarting offending threads as needed.

We present solutions that make use of speculative shared-memory
operations to improve the performance of sequential and parallel codes.
We describe ways to successfully extract parallelism from loops that
cannot be handled by current parallelising compilers. We also
propose mechanisms to achieve higher concurrency in highly synchronised
parallel codes, in particular lock-based critical sections.
Our solutions require relatively simple hardware, and are efficiently
integrated in a cache-coherent NUMA system.



New Methods for Exploiting Program Structure and Behavior in Computer Architecture

Gurindar S. Sohi
University of Wisconsin-Madison

Processor and system performance have grown at a phenomenal
rate (60% per year) for many years; a sizeable portion
of this improvement has come from architectural and
microarchitectural techniques used to make
productive use of the available semiconductor resources.
Many of the techniques used by architects (e.g., caches
and branch predictors) exploit program behavior -- the
observed empirical characteristics of program execution.

In the next decade, advances in semiconductor technology
with provide us with lots of transistors with which to
build processing engines.  The job of a computer architect
will be to make productive use for these transistors:
carry out processing functions in more powerful
ways than in previous generations.
To do so, it is likely that program behavior would
need to be understood/captured/exploited in heretofore unknown ways.
While current hardware techniques reason about program behavior
by observing events, future hardware techniques are likely
to reason about program behavior by learning about the program
structure (relationships between program instructions) that
causes the observed behavior, and exploit these relationships.

In this talk, we will look at recently-proposed hardware
techniques to solve several problems that arise in the design
of computing systems.  These novel techniques exploit some
knowledge about the dependence relationships amongst the
instructions of a program.  We will see how program
structure-based techniques can be applied to the problems of
scheduling out-of-order memory operations, streamlining
communication through memory, managing memory hierarchies,
prefetching linked structures, and optimizing communication in
shared memory multiprocessors.

Having seen the benefits of having access to relevant program
structure information, an obvious question is how such information
should be gathered, made available available to the
execution hardware, and maintained.  There are several possibilities,
ranging from purely hardware solutions, to solutions that
make use of compile-time information.  This issue has
has several open research problems, some
of which we are investigating at Wisconsin.



Speculative Multithread Processors

Gurindar S. Sohi
University of Wisconsin-Madison

Architects of future generation processors will have hundreds of
millions of transistors with which to build computing chips.
At the same time, it is becoming clear that naive scaling of
conventional (superscalar) designs will increase complexity
and cost while not meeting performance goals.
Consequently, many computer architects are advocating a shift in
focus from high-performance to high-throughput with a corresponding
shift to multithreaded architectures.
Multithreaded architectures provide new opportunities for
extracting parallelism from a single program via thread level
speculation.  We expect to see two major forms of thread-level
speculation: control-driven and data-driven.
We believe that future processors will not only be multithreaded,
but will also support thread-level speculation, giving them t
he flexibility to operate in either multiple-program/high-throughput
or single-program/high-performance capacities.  Deployment of
such processors will require innovations in means to convey
multithreading information from software to hardware,
algorithms for thread selection and management, as well as
hardware structures to support the simultaneous execution of
collections of speculative and non-speculative threads.



Commodity Computing

Miron Livny
University of Wisconsin-Madison

The recent dramatic decrease in the cost-performance ratio of processing,
storage and communication hardware has turned computing in to a commodity.
Computers and disks are considered "supplies" and are purchased under the
same budget category as pencils and erasers. As a result of this trend, we can
find today powerful computing capabilities resting on office desks, piled on
laboratory shelves or mounted on racks in machine rooms. These abundant
computing and storage resources are managed by of-the-shelf software and
are interconnected by high-speed networks. Individuals and small groups own
these resources and exercise full control over their usage. Researchers and
engineers in academia, research laboratories and industry are looking for
frameworks and software tools that will enable them to harness this power.

In the talk we will discuss the challenges we face in transforming "communities"
of loosely coupled and distributively owned commodity hardware and software
into effective computing environments. We will present what we believe to be
the key mechanisms required to turn such communities into dependable systems
capable of delivering large amounts of computing cycles over very long time periods.

The talk is based on our decade long experience with the Condor high throughput
computing system, close interaction with a wide range of domain scientists and our
recent involvement in national efforts to develop and build computational and data Grids.



Scalable Parallel Systems at IBM Research

Marc Snir
IBM T. J. Watson Research Center

The talk will review 10 years of research on scalable parallel systems at
IBM Research, both from the perspective of technical content, and from the
perspective of the interplay between research and development in an
industrial environment.



Software Performance Engineering for Web Applications

Marin Litou
Centre for Advanced Studies, IBM Toronto Laboratory

Web-based applications are distributed across many tiers and involve a large pal
ette of technologies.  Such applications have millions of registered users and
tens of thousand of concurrent users that interact with the system through a set
 of classes of requests. Furthermore, the relative load of these classes can shift
throughout the day causing changes to system behavior and bottlenecks.

This talk will introduce the major challenges a performance engineer faces when
dealing web-based applications: performance modeling and evaluation,
capacity planning, problem determination and stress testing. Also, several quant
itative performance-engineering techniques will be discussed.
The techniques presented consider all workload conditions, are iterative in natu
re, and are hybrid mathematical programming and analytic performance evaluation
methods.

The talk will review 10 years of research on scalable parallel systems at IBM Re
search, both from the perspective of technical content, and from the perspective of
the interplay between research and development in an industrial environment.
 
 


That's all folks!!!!!