Evelyn Duesterwald
Hewlett-Packard Laboratories
This talk describes the design and implementation
of DYNAMO, a prototype
dynamic optimizer that is capable of accelerating
the performance of a
native program binary at runtime. DYNAMO
identifies and extracts the
dynamically hot code regions in the executing
program binary and
achieves a performance boost by optimizing
and laying out these
regions in a software code cache. Initially,
DYNAMO observes the program
behavior through interpretation and uses
a very low overhead online
profiling scheme to identify the dynamically
hot program paths. Copies of these
hot paths are optimized using light-weight
optimization techniques and
emitted into the software code cache.
Subsequent execution of these traces cause
the cached version to be executed. Over
time, the program's working set
materializes inside the software code
cache and execution takes place
increasingly more based on the optimized
cached version of the input
binary.
DYNAMO is written entirely in user level
software, and runs on a
PA-RISC machine under the HPUX operating
system.
Our experimental evaluation of DYNAMO
demonstrates that it is possible
to use a piece of software to improve
the performance of a native,
statically optimized program binary, while
it is executing.
Walid Najjar
University of California Riverside
This talk presents a high-level language
for expressing image processing
algorithms, and an optimizing compiler
that targets FPGA-based
reconfigurabel computing systems. The
main features of this language,
called SA-C, are: 1) support the
simple expression of image processing
algorithms and applications, and 2) enable
efficient compilation to
FPGAs. SA-C is a subset and an extension
to C. It is based on C syntax
and semantics. The optimizing compiler
translates SA-C programs into
non-recursive data flow graphs, which
in turn are translated into VHDL
using commercial synthesis tools. Performance
numbers are reported for
some well-known image processing routines,
written in SA-C and
automatically compiled to an Annapolis
Microsystems reconfigurable
systems.
Barton P. Miller
University of Wisconsin, Computer Sciences
Department
Operating systems used to be viewed as
static entities, changing almost as
slowly as the underlying hardware.
Recent commercial systems have provided
for a small amount of run-time change
by allowing device drivers to be
installed while the system is running.
We are developing "adaptive operating
systems" whose code can change and evolve
while the system is running. This
adaptation can be used to instrument the
code for profiling or debugging
purposes, or to modify and extend the
operating system to adapt to changing
work loads, application demands, and configuration.
Our work differs from other efforts in
this area in two ways. First, we
can instrument and modify a stock, commercial
operating system (Solaris) as
it was delivered to the customer.
We operate directly on the executable code
while it is running. Second, we
can modify the operating system at almost
any point in its code. We are not
constrained to system call or procedure
call replacements.
Our research is embodied in a facility
called KernInst.
We will describe
the basic KernInst mechanism and several
uses of KernInst, including
performance profiling and dynamically
modifying and customizing the operating
system in response to its work load.
We will also present a case study
using KernInst to profile the Solaris
kernel under a Web proxy server (Squid)
workload.
RWC Omni project: a portable OpenMP compiler and cluster-enabled OpenMP
Mitsuhisa Sato
Real World Computing Partnership, Japan
OpenMP is an emerging standard API for
parallel programming on shared
memory multi-processors. We are
developing a portable OpenMP
compiler, Omni OpenMP compiler for SMPs
and a cluster of SMPs. Our
objectives of Omni OpenMP compiler project
includes:
- Portable implementation of OpenMP
for SMPs
- Cluster-enabled OpenMP, which
support seamless programming for SMPs
to SMP/PC clusters in OpenMP.
Omni OpenMP compiler translates C and Fortran77
programs with OpenMP
pragmas into C code suitable for compiling
with a native compiler,
linked with the runtime library. The compiler
was ported to several
platforms including Sun Solaris, SGI IRIX,
Linux (x86 and Alpha). The
recent release 1.2 includes StackThreads
implementations for nested
irregular parallelism, and cluster-enabled
OpenMP implementation using
for a page-based software distributed
shared memory system, SCASH on a
cluster of PCs.
In this talk, we will present implementation
and performance
evaluation of our compiler, and our approach
to cluster-enabled
OpenMP.
This talk presents an extension of Dimemas
to enable accurate performance
prediction of message passing applications
with collective communication
primitives. The main contribution is a
simple model for collective
communication operations that can be user-parameterized.
The experiments
performed with a set of MPI benchmarks
demonstrate the utility of the model.
Smail NIAR
University
of Valenciennes, FRANCE
The objective of this study is to realize
performance measurements of the new
SPEC CPU2000 benchmark programs
on a superscalar processor simulator.
The results of the these experiments
will allow:
- To show which are the architectural parameters, having greatest
influence on the performance of the SPEC CPU2000.- To spot capabilities and limitations of the modern superscalar
micro-architectures and where the "Hot Spot" are.- And finally, to help designers in proposing new powerful
micro-architecures .
Juan Jose Noguera
DAC, UPC
The continued progress of semiconductor
technology has enabled that new
reconfigurable logic devices and architectures
have appeared,
making possible the development of a new
research field, Reconfigurable
Computing, which offers an intermediate
solution between general-purpose
systems (microprocessors) and application
specific systems (ASIC's).
The aim of the talk is to introduce the
basic concepts of reconfigurable
computing. Concretely, dynamically reconfigurable
devices and architectures,
which are run-time reconfigurable, are
presented.
Some examples of how this type of reconfigurable
devices and architectures
are coupled with standard microprocessors
is explained. Also, it is explained
how reconfigurable computing, in certain
applications, can help to increase the
performance of standard microprocessors.
Miguel Valero
DAC, UPC
No cabe duda de que el uso de encuestas
a alumnos como instrumento
de evaluación de la docencia es
un tema polémico. Fue polémico cuando
empezó a usarse en la UPC y lo
es ahora que se ha decidido pasar
encuestas con menor frecuencia. Es polémico
aquí y también lo es
en otras universidades, en las que no
han podido instaurar el sistema
por oposición de los profesores.
Ante tanta polémica, la pregunta
que da título a la presentación parece
relevante. En la exposición se
defenderá que la encuesta docente puede
ser un instrumento esencial para la mejora
de la docencia. Para ello, sólo
necesitamos una auténtica voluntad
de mejora, y un poco de espíritu de
colaboración entre nosotros. Veremos
cómo puede organizarse un proceso
de mejora a través de la encuesta
(desde luego, el proceso actual no sirve)
y se hará una propuesta concreta
de actuación en este ámbito.
Fernando Martinez
MAII, UPC
El objetivo de la charla es presentar los
algoritmos y protocolos
criptográficos más utilizados
en las aplicaciones que usan
criptografía.
Se presentarán algoritmos de cifrado
de clave secreta (DES y AES)
y de clave pública (RSA). Mientras
que los primeros son más rápidos
y utilizan claves más cortas, los
segundos tienen la ventaja de que
no es necesario el acuerdo previo de claves
para iniciar una
comunicación segura. Además,
la criptografía de clave pública permite
definir el concepto de firma digital,
básico para el desarrollo del comercio
electrónico.
Los algoritmos criptográficos presentados
son considerados
computacionalmente seguros, o sea, que
con los conocimientos
y recursos actuales no es posible romperlos.
Manel Fernandez
DAC, UPC
Traditional optimizing compilers are limited
in the scope of their
optimizations by the fact that only a
single function, or possibly a
single module, is available for analysis
and optimization. In
particular, this means that library routines
cannot be optimized to
specific calling contexts. A possible
solution is to carry out
whole-program optimization at link time.
The goal of the alto project is
to develop whole-program dataflow analyses
and code optimization
techniques for link time program optimization;
the current system is
targeted to the DEC Alpha architecture,
and produces code that is
typically considerably faster than that
produced using DEC's Om
link-time optimizer (with or without profile-guided
inter-file
optimization carried out by the compiler).
Doug Carmean
Intel
The Pentium 4 Processor is an all new processor
from Intel. When the
processor is introduced in November, it
will deliver the highest performance
of any desktop processor in the world.
This talk will walk through some of
the details of the new microarchitecture,
and give some insight as the the
philosophy that was used to design the
processor. The details include a
tour through the branch misprediction
pipeline and data speculation.
Bart Goeman
Ghent University
Value prediction is a relatively new technique
to increase the
Instruction Level Parallelism (ILP) in
future microprocessors.
An important problem when designing a
value predictor is efficiency: an
accurate predictor requires huge prediction
tables. This is especially the
case for the finite context method (FCM)
predictor, the most accurate one.
In this presentation, we show that the
prediction accuracy of the FCM can be
greatly improved by making the FCM predict
strides instead of values.
This new predictor is called the differential
finite context method
(DFCM) predictor. The DFCM predictor outperforms
a similar FCM predictor
by as much as 6% to 17%, depending on
the prediction table size.
We use several metrics to show that the
key to this success is
reduced aliasing in the level-2 table
between stride and non-stride
patterns. We also show that the DFCM is
superior to hybrid predictors
based on FCM and stride predictors, since
its prediction accuracy is
as high as that of a hybrid one using
a perfect meta-predictor.
Julita Corbalan
DAC, UPC
This work is focused on processor allocation
in shared-memory multiprocessor
systems, where no knowledge of the application
is available when applications are
submitted. We perform the processor allocation
taking into account the characteristics
of the application measured at run-time.
We want to demonstrate the importance
of an accurate performance analysis and
the criteria used to distribute the
processors. With this aim, we present
the SelfAnalyzer, an approach to dynamica
lly analyze the performance of applications
(speedup and execution time), and the
Performance-Driven Processor Allocation
(PDPA), a new scheduling policy which
distributes processors considering both
the global conditions of the system and
the particular characteristics of running
applications. This work also defends
the importance of the interaction between
the medium-term and the long-term
scheduler to control the multiprogramming
level in the case of the clairvoyant
scheduling policies.
We have implemented our proposal in a SGI
Origin2000 with 64 processors
and we have compared its performance with
that of some scheduling policies
proposed so far and with the native IRIX
scheduling policy. Results show that
the combination of the SelfAnalyzer+PDPA
with the medium/long-term scheduling
interaction outperforms the rest of the
scheduling policies evaluated. The evaluation
shows that in workloads where a simple
equipartition performs well, the PDPA
also performs well, and in extreme workloads
where all the applications have a
bad performance, our proposal can achieve
a speedup of 3.9 with respect to an
equipartition and 11.8 with respect to
the native IRIX scheduling policy.
Marc Gonzalez
CEPBA
We presents a set of proposals for the
OpenMP shared-memory
programming model oriented towards the
definition of thread groups
in the framework of nested parallelism.
The presentation also describes
the additional functionalities required
in the runtime library
supporting the parallel execution. The
extensions have been
implemented in the OpenMP NanosCompiler.
The experimental results
show the usefulness of the proposed extensions
in a set of real
applications (two SPEC95FP benchmarks
and a generic multi-block
code).
Dr. David Baker
Vice President of Product Development, BOPS
Inc.
This talk addresses the challenges of designing high performance DSPs
for
embedded applications. The challenges in embedded processor design
are very
different from the challenges of general-purpose processor design which
has
received so much attention. Traditional benchmarks like SPEC do not
represent workloads that dominate the embedded marketplace. Benchmarks
like
EEMBC better characterize signal processing and media-centric applications
within consumer appliances. Many embedded applications demand ultra
high
performance on problems that are embarrassingly parallel. This performance
is not achieved by simply increasing clock frequency especially in
environments where energy management is key to success. This talk will
describe architectural innovation that provides new levels of performance
at
decreased levels of power consumption in embedded environments.
Oscar-Ivan Lepe-Aldama
DAC, UPC
El continuo crecimiento de la Internet esta imponiendo enormes
cargas de trabajo en los nodos que se encargan de sostenerla: los
enrutadores IP. Basicamente, estos nodos, instalados tanto en pequeñas
oficinas como en las intersecciones de las troncales en el corazón
de la
Internet, son los encargados de encontrar rutas y remitir a traves
de
ellas los paquetes que atraviesan el planeta usando esta red de redes.
Con el crecimiento de Internet no sólo ha crecido la cantidad
de
paquetes por segundo que estos sistemas tienen que procesar, si no
que
también lo han hecho la cantidad de trabajo por paquete y la
necesidad
de proveer diferentes niveles de servicio. Los enrutadores IP
actualmente en operación están "perdiendo esta batalla".
El objetivo de esta charla es discutir sobre la influencia que sobre
el
redimiento de los enrutadores IP tiene su arquitectura tanto hardware
como software (léase sistema operativo). Al hacerlo, se busca
identificar los posibles focos de bajo rendimiento y bosquejar posibles
soluciones para ello. La charla se apoya en una evaluación,
tanto
cualitativa como cuantitativa, del rendimiento de un tipo particular
de
enrutador IP: un enrutador realizado con hardware y software para PC.
Estudiar este tipo de enrutadores es importante por el potencial
telmático que su tecnología tiene. Más aún,
un gran número de
enrutadores actualmente en operación tienen una arquitectura
(hardware y
software) con más similitudes que diferencias respecto a la
de una PC.
Kazuki Joe
Department of Information and Computer Sciences Nara Women's University,
JAPAN
In this talk, two HPC research topics are introduced.
Jim Smith
University of Wisconsin-Madison
A Virtual Machine (VM) provides a program execution environment which,
from the perspective of a program, is identical to the one provided
by a conventional machine environment. However, the real underlying
machine may be quite different from the VM;
a layer of software performs any required conversions and translations
from virtual machine to real machine. A concealed VM software
layer adds
considerable flexibility by permitting cross-platform software portability.
Furthermore, if VM software is made available to the hardware designer,
it enables implementation-dependent optimizations for improved performance,
power saving, and fault tolerance. There are a number of virtual
machine
architectures, implementations and applications. This talk is
essentially a "mini-tutorial" that will survey implemented (and
proposed virtual machines). Emphasis will be placed on the
important underlying technologies, and a VM taxonomy will be presented.
Marc Steinhaus
University of Karlsruhe
This talk proposes a chip space and transistor count model of
microprocessors at register-transfer level. The estimation tool,
which is based on the model, receives its inputs from the baseline
architecture and the configuration file of a microarchitecture
simulator as e.g. the SimpleScalar toolset or the Karlsruhe SMT
multimedia simulator. The estimation tool gives a pre-silicon
complexity estimation and allows to compare different microprocessor
configurations with respect to their anticipated hardware complexity.
The estimation method is validated using configuration parameters
of a real processor yielding a transistor count and a chip space
estimation than is very close to the real processor numbers.
Ronny Ronen
Microprocessor Research Labs, Intel
Novel instruction fetch structures improve processor performance by
providing higher fetch bandwidth, but at the cost of increased complexity
and higher power consumption. In this talk we introduce a simpler mechanism
- the Micro-Operation Cache (Uop Cache - UC) aimed toward reducing
power.
The UC is designed to provide competitive frontend performance relative
to
existing instruction fetch mechanisms, based on conventional instruction
caches, while significantly decreasing power and energy consumption.
The UC
caches basic blocks of instructions; instructions are pre-decoded into
micro-operations (uops). The UC can fetch a single basic-block of uops
per
cycle. Fetching complete pre-decoded basic-blocks eliminates the need
to
repeatedly decode variable length instructions and simplifies the process
of
predicting, fetching, rotating and aligning fetched instructions. This
elimination of power hungry operations is the main source of the UC
power.
The UC coexists with the Instruction Cache (IC), it is designed to
work in
concert with the IC so not to lose performance on a UC miss. The design
enables even a small UC, which features only a moderate hit rate, to
be
quite effective. We propose several optional UC related mechanisms
to
improve performance and reduce power.
Ronny Ronen
Microprocessor Research Labs, Intel
In this short talk I will discuss very rough power and performance trade
off
and possible high level strategies to address them in various segments.
I will also discuss how microarchitecture can address power.
Vojin G. Oklobdzija
ACSEL Director, ECE
Dept.,
University of California Davis
The clocked storage elements are the single most analyzed and debated
circuit structures in modern microprocessors.
Their importance is in the fact that they provide a boundary between
the ever-shrinking pipelined stages. The demand for
high-performance mandates detailed understanding of timing issues and
the intricate inner working of timing elements.
The techniques known as: "time borrowing", "slack passing" or "cycle
stealing" are based on the fact that the extra time
needed could be traded with the time allowed for the next cycle. Those
techniques are increasingly used and they are
intimately related to the inner workings of timing elements. This talk
will present a number of clocked storage elements used
in modern microprocessors and discuss the timing issues and design
guidelines. We also discuss the rules for consistent
estimation of the real performance and power features of the Flip-Flop
and Master-Slave latch structures. A new simulation
and optimization approach is presented, targeting both high-performance
and power budget issues. The analysis approach
reveals the sources of performance and power consumption bottlenecks
in different design styles. Certain misleading
parameters have been properly modified and weighted to reflect the
real properties of the compared structures. Furthermore,
the results of the comparison of representative Master-Slave latches
and Flip-Flops illustrate the advantages of presented
approach and the suitability of different design styles for high-performance
and low-power applications.
Dejan S. Milojicic
HP Labs, Palo Alto
Future systems have been characterized as ubiquitous, pervasive, and
invisible. They will consist of devices that are diverse in size,
performance, and power consumption. Some of these devices will be
mobile, posing additional requirements to system software and
applications. With these changes, our focus will need to move from
technology to deployment and ease of use of services. Consequently,
traditional paradigms for reasoning about, designing, and implementing
software systems and services will no longer be sufficient alone.
We believe that this future vision will need to rely on a three-
tier infrastructure consisting of back-end servers, infrastructure
servers, and front-end clients (mobile or static, handheld or
embedded). The critical question for future systems will be how
to deliver services on-demand from back-end servers to resource-
constrained clients. If we can handle the new requirements of
these systems, we can enable this computing infrastructure to
offer significantly more services to users in a more pervasive way.
This talk will present two main Psi subprojects: "services-on-demand"
and "adaptive offloaded services". It will outline the high level
architecture, present some preliminary performance measurements, and
compare the project to related work in academia and industry.
Jean-Loup Baer
University of Washington
Caches were introduced over 30 years ago. They have evolved from a
single level sectored cache, whose presence was oblivious to the Instruction
Set Architecture (ISA), to a multi-level hierarchy of caches, of various
sizes and associativities, that are exposed to the ISA, and that are
accompanied by a variety of hardware and software assists.
Caches have been a great success for enhancing the performance of computer
systems and in this talk, we briefly review some of the progress made
on
cache design and performance. However, in spite of the abundance of
literature
on the subject, caches are not as efficient as they could be and they
will
remain an active area of research as long as the challenge of the "memory
wall"
is still present. We will describe succinctly our current methodology
for the design of
cache assists, a methodology that borrows from paradigms used in branch
and
value prediction, and will show its application to enhancing the performance
of
some features of on-chip and off-chip caches.
C. Miro-Sorolla
Laboratories d'Electronique Philips
This presentation shows the work we are performing at LEP on the correction
of MPEG artifacts in video sequences.
Expected applications are high-end television, set top boxes and DVD
storage. We will present the MPEG artifacts
we deal with by means of an example: blocking, ringing and mosquito
noise. MPEG artifacts are the consequence of
a coarse quantization of the DCT coefficients in the MPEG encoding
process. The MPEG blocking artifact correction
algorithm we have developed at LEP will be discussed as well as the
DCT domain filtering that allows removal in
textured and other regions with a minimum of degradation. Complexity
estimations of a hardware implementation of
the algorithm that meets real time constraints have been performed.
Results obtained on silicon area and Input/Output
bandwidth will be presented. To apply this algorithm, the size of the
MPEG-grid in the decoded video sequence must
be known. Studies about a universal grid detection algorithm are also
being performed in our team. Finally, subjective
tests were necessary for the evaluation of the algorithm as none of
the available objective quality measures is able
to deal with post-processed images correctly. We are also investigating
the development of an objective quality
assessment method.
Calisto Zuzarte
Centre for Advanced Studies,IBM Toronto Lab
In this talk we trace the path of an SQL statement through the
compiler
and discuss in a little more detail the Query Rewrite and Optimizer
components.
Hans Vandierendonk
Wide-issue superscalar processors need to fetch multiple basic blocks
each cycle. The trace cache solves this problem by grouping multiple
basic blocks into one consecutive trace and storing this trace in a
cache. Because of the way traces are constructed, the same instruction
can be present in more than one trace, causing redundancy between
traces. Redundancy potentially harms performance, since more traces
are
needed to represent the program. In this talk, we present some
results
on the forms of redundancy present in different trace cache organisations.
Yiannakis Sazeides
In this talk we will introduce dependence-based value prediction:
prediction of values produced by program instructions based on
dependence information. A number of dependence-based value prediction
methods will be presented:
Dependence information propagated through register and memory
dependences of instructions selects the history used for obtaining
a
prediction. The history is (a) architected register state, and (b)
architected memory state from a small table, memory history table,
that maintains recently stored values. Examples of dependence
information propagated are (a) register names and
memory-history-table-indices of latest predecessors with known state,
and (b) information about the outstanding instructions on the
dependence path reaching the predicted instruction such as optypes,
pcs, immediates etc. The predictor employs memory dependence
prediction for relaying the information used to predict memory
dependent instructions.
Is shown experimentally that:
(a) when considering only register dependences average prediction of
75\% is achieved with a 64K entry direct-mapped prediction table,
(b) the use of memory history table and memory dependence prediction
increases average prediction accuracy to 82\% for a 64K entry
direct-mapped table, with several benchmarks having prediction
accuracy over 90\%, and
(c) the proposed predictor with 64K entry direct-mapped table, can
achieve similar or higher accuracy as compared to an unbounded
context-based predictor.
Jose F. Martinez
University of Illinois at Urbana-Champaign
Speculative shared-memory operations enable aggressive parallelisation
in situations that prove challenging to compilers and developers.
Applications execute optimistically in parallel, and the system
continuously monitors for memory consistency violations, squashing
and restarting offending threads as needed.
We present solutions that make use of speculative shared-memory
operations to improve the performance of sequential and parallel codes.
We describe ways to successfully extract parallelism from loops that
cannot be handled by current parallelising compilers. We also
propose mechanisms to achieve higher concurrency in highly synchronised
parallel codes, in particular lock-based critical sections.
Our solutions require relatively simple hardware, and are efficiently
integrated in a cache-coherent NUMA system.
Gurindar S. Sohi
University of Wisconsin-Madison
Processor and system performance have grown at a phenomenal
rate (60% per year) for many years; a sizeable portion
of this improvement has come from architectural and
microarchitectural techniques used to make
productive use of the available semiconductor resources.
Many of the techniques used by architects (e.g., caches
and branch predictors) exploit program behavior -- the
observed empirical characteristics of program execution.
In the next decade, advances in semiconductor technology
with provide us with lots of transistors with which to
build processing engines. The job of a computer architect
will be to make productive use for these transistors:
carry out processing functions in more powerful
ways than in previous generations.
To do so, it is likely that program behavior would
need to be understood/captured/exploited in heretofore unknown ways.
While current hardware techniques reason about program behavior
by observing events, future hardware techniques are likely
to reason about program behavior by learning about the program
structure (relationships between program instructions) that
causes the observed behavior, and exploit these relationships.
In this talk, we will look at recently-proposed hardware
techniques to solve several problems that arise in the design
of computing systems. These novel techniques exploit some
knowledge about the dependence relationships amongst the
instructions of a program. We will see how program
structure-based techniques can be applied to the problems of
scheduling out-of-order memory operations, streamlining
communication through memory, managing memory hierarchies,
prefetching linked structures, and optimizing communication in
shared memory multiprocessors.
Having seen the benefits of having access to relevant program
structure information, an obvious question is how such information
should be gathered, made available available to the
execution hardware, and maintained. There are several possibilities,
ranging from purely hardware solutions, to solutions that
make use of compile-time information. This issue has
has several open research problems, some
of which we are investigating at Wisconsin.
Gurindar S. Sohi
University of Wisconsin-Madison
Architects of future generation processors will have hundreds of
millions of transistors with which to build computing chips.
At the same time, it is becoming clear that naive scaling of
conventional (superscalar) designs will increase complexity
and cost while not meeting performance goals.
Consequently, many computer architects are advocating a shift in
focus from high-performance to high-throughput with a corresponding
shift to multithreaded architectures.
Multithreaded architectures provide new opportunities for
extracting parallelism from a single program via thread level
speculation. We expect to see two major forms of thread-level
speculation: control-driven and data-driven.
We believe that future processors will not only be multithreaded,
but will also support thread-level speculation, giving them t
he flexibility to operate in either multiple-program/high-throughput
or single-program/high-performance capacities. Deployment of
such processors will require innovations in means to convey
multithreading information from software to hardware,
algorithms for thread selection and management, as well as
hardware structures to support the simultaneous execution of
collections of speculative and non-speculative threads.
Miron Livny
University of Wisconsin-Madison
The recent dramatic decrease in the cost-performance ratio of processing,
storage and communication hardware has turned computing in to a commodity.
Computers and disks are considered "supplies" and are purchased under
the
same budget category as pencils and erasers. As a result of this trend,
we can
find today powerful computing capabilities resting on office desks,
piled on
laboratory shelves or mounted on racks in machine rooms. These abundant
computing and storage resources are managed by of-the-shelf software
and
are interconnected by high-speed networks. Individuals and small groups
own
these resources and exercise full control over their usage. Researchers
and
engineers in academia, research laboratories and industry are looking
for
frameworks and software tools that will enable them to harness this
power.
In the talk we will discuss the challenges we face in transforming "communities"
of loosely coupled and distributively owned commodity hardware and
software
into effective computing environments. We will present what we believe
to be
the key mechanisms required to turn such communities into dependable
systems
capable of delivering large amounts of computing cycles over very long
time periods.
The talk is based on our decade long experience with the Condor high
throughput
computing system, close interaction with a wide range of domain scientists
and our
recent involvement in national efforts to develop and build computational
and data Grids.
Marc Snir
IBM T. J. Watson Research Center
The talk will review 10 years of research on scalable parallel systems
at
IBM Research, both from the perspective of technical content, and from
the
perspective of the interplay between research and development in an
industrial environment.
Marin Litou
Centre for Advanced Studies, IBM Toronto Laboratory
Web-based applications are distributed across many tiers and involve
a large pal
ette of technologies. Such applications have millions of registered
users and
tens of thousand of concurrent users that interact with the system
through a set
of classes of requests. Furthermore, the relative load of these
classes can shift
throughout the day causing changes to system behavior and bottlenecks.
This talk will introduce the major challenges a performance engineer
faces when
dealing web-based applications: performance modeling and evaluation,
capacity planning, problem determination and stress testing. Also,
several quant
itative performance-engineering techniques will be discussed.
The techniques presented consider all workload conditions, are iterative
in natu
re, and are hybrid mathematical programming and analytic performance
evaluation
methods.
The talk will review 10 years of research on scalable parallel systems
at IBM Re
search, both from the perspective of technical content, and from the
perspective of
the interplay between research and development in an industrial environment.
That's all folks!!!!!