Seminar 2K abstracts

Abstracts seminar 2000

On the Interaction between Commercial Workloads and Memory Systems in High-Performance Servers

Per Stenstrom
Chalmers University of Technology. Gothenburg (Sweden)

While the increasing speed gap between processors and memory has gained significant attention, most research into high-performance memory systems has been driven by scientific applications. In particular, very little work has targeted database applications, a key market driver for server technology.

Our current project aims at understanding how the complex interaction between database applications and other important server workloads and memory systems in uni- and multiprocessor architectures impacts on the performance. We have particularly focused on locality and coherence issues and explored a wide range of techniques to improve the performance.

This talk has three goals. One goal is to give insights into the performance issues of memory systems for database applications. These include how working sets scale with the database size in DSS workloads and how OLTP workloads interact with coherence protocols. The second goal is to present improvementtechniques to circumvent some of the performance problems. These techniques include novel prefetch approaches for pointer-intensive workloads and coherence schemes for dominant access patterns in OLTP workloads. Finally, in studying these issues, we have spent a great deal of effort in developing a productive experimental methodology. Our approach has been to use complete system simulation, sometimes in combination with analytical models. The talk will give many examples of how these methodological approaches have been succesfully used.

IMPACT second generation EPIC architecture

Wen-mei Hwu
University of Illinois at Urbana-Champaign, USA

The Flight of the Condor - A Decade of High Throughput Computing

Miron Livny
University of Wisconsin, USA

For many experimental scientists, scientific progress and quality of research are strongly linked to computing throughput. In other words, most scientists are less concerned about the instantaneous performance of the environment (typically measured in Floating Point Operations per Second (FLOPS)). Instead, what matters to them is the amount of computing they can harness over a month or a year. Floating point operations per second has long been the principal metric used in High Performance Computing (HPC) efforts. The computing community has devoted little attention to environments that can deliver large amounts of processing capacity over long periods of time. We refer to such environments as High Throughput Computing (HTC) environments.

The key to HTC is effective management and exploitation of all available computing resources. For more than a decade, we have been developing, implementing, and deploying, HTC software tools that can effectively harness the capacity of hundreds of distributively owned computing resources. Using our software, the Condor resource management environment, scientists may simultaneously and transparently exploit the capacity of resources they are not even aware exist. The design of Condor follows a novel approach to HTC that is based on a layered Resource Allocation and Management architecture. In the talk we will outline the overall architecture of Condor and discuss the interaction between the different layers. We will outline the principals that have been guiding our work and the lessons we have learned from deploying Condor and from interacting with scientists and engineers from a wide range of disciplines. Special attention will be devoted to role the master-worker computing paradigm plays in High Throughput Computing.

Emergent Technologies Are Driving a Renaissance in Architecture Innovation

Thomas Sterling
California Institute of Technology and NASA Jet Propulsion Laboratory, USA

Although recent trends in high end computer system development have been focused on CMOS based COTS microprocessors, past advances often reflected complementing innovation in both device technology and
computer architecture. Even as the MPP is stretched to incorporate thousands of processors to reach a peak performance of a few Teraflops, a Federally sponsored interdisciplinary research team is developing a new latency tolerant parallel computer architecture that integrates advanced synergistic technologies to achieve a Petaflops performance within a few years. The Hybrid Technology Multi-Threaded (or HTMT) class of parallel computer architecture exploits the extreme capabilities of superconductor logic, optical communications, holographic storage, and memory-with-logic semiconductors technologies to make possible petaflops scale computing within practical constraints of engineering complexity, power consumption, and cost. A new computer architecture will employ these building blocks with high efficiency through a combination of multithreaded processor structure and an innovative memory structure called "percolation". Together, these new techniques revolutionize the relationship between processors and memory, allowing simplicity of design, wide margins in operation, and high efficiency of performance. These combined advances of technology and architecture will deliver unprecedented performance and price-performance. This presentation will describe the exceptional properties of the advanced technologies incorporated by HTMT and the challenges being met by its revolutionary architecture.

SEMPRE: Propuesta de una Arquitectura SMT con Capacidad de Ejecucion de Procesos

Ronaldo A. de Lara Gonçalves
Universidade Estadual de Maringa (Brasil)

El profesor Ronaldo Gonçalves es un becario de doctorado brasileno que
vino a la UPC para hacer parte de un simulador SMT. El hara una rapida
presentacion de su grupo de trabajo y de las actividades que ellos
desarollan en el proyecto APSE (Arquitecturas de Procesadors
Superescalares). Despues presentara su propuesta de doctorado.

El profesor Ronaldo esta trabajando con el proyecto de una arquitectura
SMT (simultaneous multithreaded) llamada SEMPRE, que tiene capacidad de
gerenciar procesos (aplicaciones) y facilitar el trabajo del sistema
operativo. La arquitectura SEMPRE hace el escalonamento de los procesos
y ejecuta instrucciones especiales: create_process, kill_process,
supend_process y resume_process. Ademas,
esta arquitectura tiene un mecanismo llamado Process Prefetching para
adelantar la busca de procesos de la memoria L2 para la memoria L1, de
acuerdo con el algoritmo de escalonamento de procesos.

What to do and why in microprocessor research

Mario Nemirovsky
Xstreamlogic

Microprocessor architecture has evolved significantly since the introduction
of processors in the early 1970's. Despite of substantial changes in
processor's workloads, the concept of a "general-purpose processor" has
survived very successfully. Since the performance of these processors has
increased so rapidly and some specialized operations such as floating point,
multimedia, 3D extensions were incorporated, the need for special-purpose
processors has not been felt. However, new requirements are quickly emerging
from various fronts such as communications, games, and real-time applications.
Are the days of "general-purpose processors" counted? In this talk we will
discuss these new requirements and current trends in microprocessor technology
from both an industry and academic perspective.

Beyond Moore's Law

Gordon Bell
Microsoft Bay Area Research Center

Moore's Law is the most common explanation
of the rapid evolution of computing. In addition,
Bell's Law c1975 perhaps a corollary of Moore's
Law describes why various computer classes form.
Other laws related to demand e.g. Metcalfe's Law
describe the value of network or supply e.g.
manufacturing learning curves, Bill's Law for
software cost are equally important.

A Case for User-Level Dynamic Page Migration

Dimitris Nikolopoulos
HPCLAB - Dept. of Computer Engineering and Informatics. University of Patras

We present dynamic user-level page migration, a runtime
technique that transparently enables parallel programs to self-tune their
memory performance via feedback obtained through runtime
performance monitoring on distributed shared memory multiprocessors.
Runtime infromation from page reference counters is used to trigger page
migrations, towards reducing the latency of remote memory accesses.
Our technique exploits information available to the program
both at compile time and at runtime in order to improve
the accuracy and the timeliness of page migrations, as well as
to amortize better the cost of page migrations, compared
to a page migration engine implemented in the operating system.
We present a competitive and a predictive algorithm for dynamic
user-level page migration. These algorithms work in synergy to
improve the responsiveness of the program to external events that
may necessitate page migrations, such as preemptions and migrations of
threads. We also present a new technique for
preventing page ping-pongs and a page forwarding mechanism for tuning
applications with phase changes. Our experimental evidence on a
SGI Origin2000, shows that unmodified OpenMP codes linked with our
runtime system for dynamic page migration are effectively immune to the
initial memory placement of the operating system. This result suggests that it
might not be necessary to compromise the simplicity of the OpenMP
shared-memory programming standard with data placement directives.
Furthermore, our runtime system achieves solid performance improvements
compared to the IRIX 6.5.5 page migration engine, for single parallel
OpenMP codes and multiprogrammed workloads.

Predicting the usefulness of a block result: a micro-architectural technique for high-perfomance low-power processors

Enric Musoll
Xstreamlogic

This paper proposes a micro-architectural technique in which a
prediction is made for some power-hungry units of a processor. The
prediction consists of whether the result of a particular unit
or block of logic will be useful in order to execute the current
instruction. If it is predicted useless, then that block is
disabled.
It would be ideal if the predictions were totally accurate, thus
not decreasing the instruction-per-cycle (IPC) performance metric.
However, this is not the case: the IPC might be degraded which in turn may
offset the power savings obtained with the predictors due
to the extra cycles to complete the execution of the application
being run on the processor.
In general, some logic may determine which of the block(s) that have
a predictor associated will be disabled based on the outcome of the
predictors and possibly some other signals from the processor. The
overall processor power-consumption reduction is a function of
how accurate the predictors are, what percentage of the total power
consumption corresponds to the blocks being predicted, and how sensitive to
the IPC the different blocks are.
A case example is presented where two blocks are predicted for low
power: the on-chip L2 cache for instruction fetches, and the Branch-
target Buffer. The IPC vs. power-consumption design space is explored
for a particular micro-processor architecture. Both the average and
the peak-power consumption are targeted. Although the power analysis
is beyond the scope of this paper, high-level estimations are done
to show that it is plausible that the ideas described might produce
a significant reduction in useless block accesses. Clearly, this
reduction may be exploited to reduce the power consumption demands
of high-performance processors.

Flexible and Efficient Routing Based on Progressive
Deadlock Recovery

Timothy Mark Pinkston
SMART Interconnects Group
University of Southern California

The development of fully-adaptive, cut-through networks is important
for achieving high performance in communication-critical parallel
processor systems. Increased flexibility in routing allows network
bandwidth to be used efficiently but, also, creates more opportunity
for cyclic resource dependencies to form which can result in deadlock.
If not guarded against, deadlocks in routing make packets block in the
network indefinitely and, eventually, could result in the entire
network coming to a complete standstill. This talk presents a simple,
flexible and efficient routing approach for multicomputer/multiprocessor
interconnection networks which is based on progressive deadlock
recovery, as opposed to deadlock avoidance or regressive deadlock
recovery. Performance is optimized by allowing the maximum routing
freedom provided by network resources to be exploited. True
fully-adaptive routing is supported in which all physical and virtual
channels at each node in the network are available to packets without
regard for deadlocks. Deadlock cycles, upon forming, are efficiently
broken in finite time by progressively routing one of the blocked
packets through a connected, deadlock-free recovery path. This new
routing approach enables the design of high-throughput networks that
provide excellent performance. Simulations indicate that progressive
deadlock recovery routing improves throughput by as much as 45% and
25% over leading deadlock avoidance-based and regressive
recovery-based routing schemes, respectively.

Critical Issues in Design and Implementation of Interconnects for
Workstation Clusters

Jose Duato
Universitat Politecnica de Valencia

This talk analyzes some critical issues in the design of
interconnects for networks of workstations (NOWs), describing
recently proposed techniques. In particular, this talk analyzes the
requirements to route messages in NOWs. Then, several techniques for
deadlock handling and their application to the design of adaptive
routing algorithms are presented. The talk will describe techniques
for deadlock avoidance and recovery, showing their application to
networks of workstations with possibly irregular topologies. Also,
implementation details will be presented, including flow control
protocols, channel pipelining, deadlock detection, and injection
limitation. Some performance evaluation results will show the
benefits of using sophisticated routing techniques.

The talk will also consider the requirements to support multimedia
traffic, briefly presenting a single-chip architecture for multimedia
routers as well as techniques to dynamically update routing
tables in the presence of topology changes without producing
deadlocks.

Active Yellow Pages (ActYP): a resource management system for PUNCH

Dolors Royo
UPC-DAC

En aquesta xerrada presentare el treball desenvolupat durant
la meva estada a la universitat de Purdue.

Durant l'estada vaig iniciar el disseny i desenvolupament del
sub-sistema de gestio de recursos del sistema PUNCH.
PUNCH (Purdue University Network-Computing Hubs)
es un software de xarxa (sobre internet) que
permet la comparticio de recursos (basicament maquines i software)
entre tots els usuaris (alumnes i professors) dels diferents campus
(un total de 6) i centres tecnologics de la Universitat de Purdue.

Separació de senyals musicals amb menys micròfons que instruments
o
Quan la paral.lelització d'algorismes condueix a la seva descomposició

Pau Bofill
UPC-DAC

L'anàlisi per components independents és ja una tècnica estàndard per
a la separació cega de fonts a partir d'un conjunt de mescles linials, i
consisteix a la pràctica a resoldre un problema d'optimització
numèrica. La situació es complica, però, quan el nombre de mescles
(sensors) és inferior al nombre de senyals originals (fonts). El treball
que presentem aquí és el resultat d'analitzar el paral.lelisme d'un
dels enfocs proposats a la literatura, que va conduir a la
descomposició de l'algorisme i, en definitiva, a la seva simplificació.
La tècnica resultant s'il.lustra amb exemples de separació de senyals
musicals.

Smarter Memory Controllers: Improving Memory System Performance from the Bottom Up

Sally A. McKee
University of Utah

Microprocessor speed is increasing much faster than memory system speed:
both speed-growth curves are exponential, but they represent diverging
exponentials. The traditional approach to attacking the memory system
bottleneck has been to build deeper and more complex cache hierarchies.
Although caching may work well for parts of programs that exhibit high locality,
many important commercial and scientific workloads lack the locality
of reference that makes caching effective. A 1996 study by Sites and
Perl on a commercial database workload shows that memory bus and DRAM
latencies cause an 8X slowdown from peak performance to actual performance.
Another 1996 study by Burger, Goodman, and Kagi finds that when compared
to an optimal cache the efficiency of current caching techniques is
generally less than 20\% and that cache sizes are up to 2000 times larger.
The evidence is clear: no matter how hard we push it, traditional caching cannot
bridge the growing processor-memory performance gap.

This talk presents research that attacks the memory problem at a different
level -- the memory controller. We describe the Stream Memory Controller (SMC)
system built at the University of Virginia, and the Impulse Adaptable Memory
System being built at the University of Utah. The SMC dynamically reorders
accesses to stream elements to avoid bank conflicts and bus turnaround delays
and to exploit locality of reference within the page buffers of the DRAMs.
Impulse virtualizes unused physical addresses and dynamically remaps data
to improve cache and bus utilization. For instance, Impulse can gather sparse
data into dense cache lines that have high locality and leave a smaller
cache footprint. Gathering strided data improves performance on regular
applications, and gathering data through an indirection vector improves
performance on irregular applications (e.g., Impulse improves the NAS conjugate
gradient benchmark's performance by 67%).
By prefetching data within the memory controller, Impulse
uses bus bandwidth more efficiently, transferring only the needed data when
it is requested.

Both these systems require that the compiler or application
writer supply the access pattern information that the memory controller
exploits. The SMC optimizes low-level memory performance for regular applications. In
contrast, Impulse optimizes
performance within the cache hierarchy and the memory back end
for both regular and irregular computations.

Design of a Parallel Vector Access Unit for SDRAM Memory Systems

Binu K. Mathew, Sally A. McKee, John B. Carter, Al Davis
University of Utah

We are attacking the memory bottleneck by building a "smart" memory
controller that improves effective memory bandwidth, bus utilization,
and cache efficiency by letting applications dictate how their data is
accessed and cached. This paper describes   a Parallel Vector Access
unit (PVA), the vector   memory subsystem that efficiently "gathers"
sparse, strided data structures in parallel on a multibank SDRAM
memory. We have validated our PVA design via gate-level simulat ion,
and have evaluated its   performance via functional simulation   and
formal analysis. On unit-stride vectors, PVA performance equals or
exceeds that of an SDRAM system optimized for cache line fills.   On
vectors with larger strides, the PVA is up to 32.8 times faster. Our
design is up to 3.3 times faster than a pipelined, serial SDRAM memory
system that gathers sparse vector data, and the gathering mechanism is
two to five times faster than in other PVAs with similar goals. Our
PVA only slightly increases hardware complexity with respect to these
other systems, and the scalable design is appropriate for a range of
computing platforms, from vector supercomputers to commodity PCs.

From EARTH to HTMT: Evolution of A Multithreaded Architecture Model
Prof. G.R. Gao
Department of Electrical and Computer Engineering
University of Delaware

We begin by a brief review on the evolution of fine-grain multithreaded models
and architectures --- in particular the EARTH (Efficient Architecture For Running
Threads) architecture model. We outline the program execution model and
architecture issues for multithreaded processing elements suitable for use as
the node processor facing the challenges of both regular and irregular applications.
These include the question of choosing appropriate program execution model, the
organization of the processing element to achieve good utilization and speedup in
the pretense of irregular data and control flow patterns, the support for fine-grain
interprocessor communication and dynamic load balancing, and related compiling issues.
We have implemented the EARTH architecture model on a number of experimental
high-performance multiprocessor platforms and we will present some experiment
results on the effectiveness of the EARTH architecture on irregular applications.

Recent evolution of the EARTH model and its impacts on the HTMT Petaflop
architecture model also be addressed.

Deep Blue: IBM's Massively Parallel Chess Machine
Dr. Gabriel M. Silberman
Centre for Advanced Studies, IBM

IBM's premiere chess system, based on an IBM RS/6000 SP
scalable parallel processor, made history by defeating
world chess champion Garry Kasparov. Deep Blue's chess
prowess stems from its capacity to examine over 200
million board positions per second, utilizing the computing
resources of a 32-node IBM RS/6000-SP, populated with 512
special purpose chess accelerators.

In this talk we describe some of the technology behind
Deep Blue and how chess knowledge was incorporated into its
software, as well as the attitude of the media and general
public during the match.

Cooperative Research at IBM's Centre for Advanced Studies (CAS)
Gabriel M. Silberman
Centre for Advanced Studies, IBM

The Centre for Advanced Studies (CAS) was established by the IBM Toronto
Laboratory in 1991 to strengthen the links between academia and industry.
Since then, this research group has been helping bridge the gap between
industry requirements for new technology and academia.

CAS places an emphasis on solving real problems encountered by
developers. These lead to long-term projects from which prototypes and
publication-quality research can be delivered, all in collaboration with
other members of the research community.

QSW Technologies and Their Implementation in High Performance Clustering Environments.
John Taylor
QSW

To deliver, and moreover to continue to sustain, the levels of computational
performance required by the most demanding applications within a
cost-effective manner requires a unique combination of technologies. Such
Supercomputers have always been at the vanguard of HPC and provide many of
the impetuses relevant to the general purpose computing industry as a whole.
To address this high technology area, QSW have created a generic system
architecture based on an ultra-high performance interconnect and an
extensible resource management system. This combination enables commodity
components to be integrated and scaled into the Supercomputer class. QSW is
now applying these technologies in partnership at both large clustered
SMP/UNIX configurations, towards multi-TFLOP performance, and is now set to
launch LINUX configurations providing a more amenable entry-level
capability.

Un Modelo y su Arquitectura Software para la Visualización de Documentos en la Web.
Ricardo Baeza
Universidad de Chile - Santiago

Actualmente cada buscador de Internet tiene su propia interfaz.
Por otro lado, existen varias metaforas de visualizacion que pueden
ser muy utiles en los buscadores. En esta charla mostramos un
modelo para visualizadores y buscadores y posibles arquitecturas
de software para su implementacion, en base a un lenguaje intermedio
que usa XML.

Parallelizing the MPEG-2 Video Decompression for Simultaneous Multithreaded Processors
Prof. Dr. Theo Ungerer
Department of Computer Design and Fault Tolerance
University of Karlsruhe, D-76128 Karlsruhe, Germany

MPEG-2 is the video compression standard
that has been chosen for digital TV, DVD and HDTV.
The MPEG-2 video decompression combines algorithm
parts like the inverse discrete cosine transform
which exhibit high computational demands,
completely memory bound elements like the motion compensation,
and highly irregular program flows in the Huffman decoder.
We show how the MPEG-2 video decompression algorithm
can be made multithreaded resulting in a theoretical speedup of
about 6.5 compared to the sequential algorithm.

Our target architectures are simultaneous multithreaded (SMT)
processor models with multimedia enhancements.
We start with a wide-issue superscalar processor, enhance it by
the simultaneous multithreading technique, by multimedia units,
and by an additional on-chip RAM storage.
The simulations with the handcoded multithreaded MPEG-2 video
decompression algorithm as workload show that 8-threaded,
8-issue processor configurations reach a threefold speed-up
over comparable single-threaded, 8-issue models.

Loop Tiling for Iterative Stencil Computations
Dra. Marta Jiménez
UPC-DAC

En esta charla presentare como podemos aplicar Loop Tiling (o Blocking)
a aplicaciones numericas como Jacobi, Swim, Mgrid, etc.
Estas aplicaciones (del tipo "Iterative Stencil Computations") presentan
un alto grado de reutilizacion de los datos. Por lo tanto, transformaciones
como Loop Tiling, cuyo objetivo es explotar la reutilizacion de los datos
inherente de los programas, pueden ser muy efectivas.
Sin embargo, debido a la estructura de este tipo de aplicaciones, no podemos
aplicar Loop Tiling de la forma convencional. En este trabajo presentare
como podemos hacerlo y mostrare algunos resultados experimentales sobre dos
microprocesadores diferentes (ALPHA 21164 y ALPHA 21264).
Este trabajo fue realizado durante mi estancia en Compaq.

Recent Advances in Computer Architecture
Daniel Tabak
George Mason University, USA
on leave at TU Wien, AUSTRIA

In anticipation of future developments the possible architectural
realizations of a billion transistor chip are discussed. In particular:

1.Single CPU Superspeculative Superscalar.
2.Simultaneous Multithreaded (SMT).
3.Chip Multiprocessor (CMP).
4.Intelligent RAM (IRAM).
5.RAW Processors.

The properties of the above solutions are compared both
qualitatively and quantitatively.
Subsequently the new trend of explicitly parallel instruction
computing (EPIC) and an introduction to the new IA-64 Intel and
HP architecture are covered.

The Design of a Technology Platform for Custom VLIW Embedded Processors
Paolo Faraboschi
Hewlett-Packard Laboratories Cambridge
1 Main Street, 10th floor
Cambridge, MA 02142

The goal of the Custom-Fit Processor (CFP) project at HP Labs
is to produce a scalable and customizable technology platform
for embedded VLIW processors. The CFP project targets
System-On-Chip designs, where currently a combination of core
processors, DSP engines and specialized ASICs is required for
high-performance applications. Custom-designing a core
processor provides enough performance to absorbe some
ASIC/DSP functionality, reduce costs and time-to-market.

The first outcome of the CFP project is the Lx Family of
Embedded VLIW Cores, designed by STMicroelectronics and
Hewlett-Packard Laboratories. For Lx we developed the
architecture and software from the beginning to support both
scalability (variable numbers of identical processing
resources) and customizability (special purpose resources).

The Lx technology is based on
· A new clustered Very Long Instruction Word (VLIW) core
architecture and microarchitecture that ensures scalability
and customizability; and that we can customize to a specific
application domain
· A software toolchain based on aggressive
instruction-level parallel (ILP) compiler technology that
gives the user a uniform view of the technology platform at
the programming language level

The first Lx implementation shows that specialization for an
application domain is very effective, yielding to large gains
in the price/performance ratio. Using the design of Lx as a
concrete example, the talk will address issues like: When is
scaling or customization beneficial? How can one determine
the degree of customization or scaling that will provide the
greatest payoff for a particular application domain? What
architectural compromises have to be made to contain the
complexity inherent in a customizable and scalable processor
family?

Performance of Hybrid Message-Passing and Shared-Memory Parallelism for Discrete Element Modeling
David Henty
Edinburgh Parallel Computing Centre,
University of Edinburgh,
Scotland, UK.

The current trend in HPC hardware is towards clusters of shared-memory
(SMP) compute nodes. For applications developers the major question is
how best to program these SMP clusters. To address this we study an
algorithm from Discrete Element Modeling, parallelised using both the
message-passing and shared-memory models simultaneously (``hybrid''
parallelisation). The natural load-balancing methods are different in
the two parallel models, the shared-memory method being in principle
more efficient for very load-imbalanced problems. It is therefore
possible that hybrid parallelism will be beneficial on SMP clusters. We
benchmark MPP, SMP and cluster architectures, compare the performance of
the pure OpenMP and MPI codes, and evaluate the effectiveness of hybrid
parallelism. Although we observe cases where OpenMP is more efficient
than MPI on a single SMP node, we conclude that our current OpenMP
implementation is not yet efficient enough for hybrid parallelism to
outperform pure MPI on an SMP cluster.

Una nueva familia de encaminadores: Bubble Routers
Mon Beivide
Universidad de Cantabria

En esta conferencia se presenta el diseño de un conjunto de
encaminadores de paquetes especialmente pensados para arquitecturas ccNUMA.
Su simplicidad de diseño hace que sean perfectamente adecuados para su
integracion en las nuevas familias de microprocesadores que implementan
on-chip tanto el router como el interfaz de mensajes. Las caracteristicas
basicas de esta familia de routers se basan en nuevas tecnicas de control
de flujo de paquetes y nuevas tecnicas de arbitraje de los conmutadores.
Para demostrar de una manera cuantitativa su superior rendimiento frente a
soluciones estandard se ha desarrollado una metodologia de evaluacion que
parte desde los diseños hardware de los routers hasta la simulacion
conducida por ejecucion del sistema completo corriendo aplicaciones reales.

A Low-Complexity Issue Logic
Ramon Canal and Antonio González
UPC-DAC

One of the main concerns in today's processor design is the issue logic.
Instruction-level parallelism is usually favored by an out-of-order
issue mechanism where instructions can issue independently of the
program order. The out-of-order scheme yields the best performance but
at the same time introduces important hardware costs such as an
associative look-up, which might be prohibitive for wide issue
processors with large instruction windows. This associative search may
slow-down the clock-rate and it has an important impact on power
consumption. In this work, two new issue schemes that reduce the
hardware complexity of the issue logic with minimal impact on the
average number of instructions executed per cycle are presented.

Keywords: instruction issue logic, wide-issue superscalar, out-of-order
issue, in-order issue.

Architecture Directions for Multi-GHz Processors
J. E. Smith
Dept. of Elect. and Comp. Engr.
Univ. of Wisconsin-Madison

As microprocessors push beyond 1 GHz, processor architects
are facing a number of challenges. Wire delays are becoming
critical, power considerations temper the availability of
billions   of   transistors, and reliability is becoming
increasingly critical. To accommodate these challenges (and
opportunities), microarchitectures will process instruction
streams in a distributed fashion -- instruction level dis-
tributed processing (ILDP). ILDP will be implemented in a
variety of ways, including both homogeneous and heterogene-
ous elements.   To help find run-time parallelism, orches-
trate distributed hardware resources, implement power con-
servation   strategies,   and   to   provide fault-tolerant
features, an additional layer of abstraction -- the virtual
machine layer -- will become an essential ingredient.
Finally, new instruction set architectures may be developed
to better focus on communication and dependence, rather than
computation and independence as is commonly done today.

Low Power Design Strategies
Daniele Folegnani
Universita di Genova

The constant increase in integration of ICs and the higher frequencies
of operation of modern microarchitectures result in a significant
increment in power consumption even though technology and voltage
scaling approaches are adopted to mitigate this effect.
Issues such as the high cost of cooling systems, portability,
reliability and today's high demand of mobile devices ask for new
evaluation methods and techniques to attack this problem. A
comprehensive analysis of models and techniques related to power
consumption is given.
We finally propose a detailed analysis of power consumption of a typical
superscalar processor and a run-time technique able to reduce
significantly the power consumption of the issue logic.

Pro64: Performance Compilers for IA-64
Jim Dehnert
SGI

The Pro64(TM) compilers for IA-64, recently released to open source by
SGI, were developed by retargeting the MIPSpro compilers for the MIPS
processor line and adding new support for unusual IA-64 architectural
features. Due to strong influences going back to the similar Cydra 5,
and a long history of addressing the issues IA-64 was designed to
address, these compilers are uniquely suited to take full advantage of
the IA-64 performance features, while providing the robustness of an
established product. Because of their open source status and structure
as a series of phases based on a powerful intermediate language, the
compilers are also an excellent basis for compiler and architecture
research.

In this talk, Dr. Dehnert will describe the structure and capabilities
of the Pro64 compilers, which include interprocedural analysis and
optimization, and unparalleled SSA-based global optimizer, loop nest
optimization and parallelization, and a hyperblock-based code generator
with software pipelining, global scheduling, and other low-level
optimizations. He will also summarize some of the key technologies
used by the compilers, including the Target Description Tables that
minimize target-specific coding in the compilers, feedback
capabilities, SSA, software pipelining, and global code motion.

Further information, including a list of publications related to the
Pro64 compilers, can be found on the SGI web site

http://oss.sgi.com/projects/Pro64

Improved Spill Code Generation for Software Pipelined Loops
Javier Zalamea
DAC-UPC

Software pipelining is a loop scheduling technique that extracts
parallelism out of loops by overlapping the execution of several
consecutive iterations. Due to the overlapping of iterations, schedules
impose high register requirements during their execution. A schedule is
valid if it requires at most the number of registers available in the
target architecture. If not, its register requirements have to be
reduced either by decreasing the iteration overlapping or by spilling
registers to memory. In this paper we describe a set of heuristics to
increase the quality of register--constrained modulo schedules. The
heuristics decide between the two previous alternatives and define
criteria for effectively selecting spilling candidates. The heuristics
proposed for reducing the register pressure can be applied to any
software pipelining technique. The proposals are evaluated using a
register--conscious software pipeliner on a workbench composed of a
large set of loops from the Perfect Club benchmark and a set of
processor configurations. Proposals in this paper are compared against a
previous proposal already described in the literature. For one of these
processor configurations and the set of loops that do not fit in the
available registers (32), a speed--up of 1.68 and a reduction of the
memory traffic by a factor of 0.57 are achieved with an affordable
increase in compilation time. For all the loops, this represents a
speed--up of 1.38 and a reduction of the memory traffic by a factor of
0.7.

The PUNCH Portal to Internet Computing: Run Any Software Anywhere via WWW Browsers
Nirav Kapadia
University of Purdue

The trends toward network-based and service-oriented computing are evident,
but the viability of these paradigms hinge on technology that can make
software usable via a "universal client" that serves as a portal to a
distributed computing system. Given the current state-of-the-art, World Wide
Web browsers are ideal as universal clients. On the other hand, designing a
server-side infrastructure that provides Web access to arbitrary software is a
challenging task: the design must meet the needs of its users, work within the
constraints of available hardware, interoperate with existing Web standards
and operating systems, support legacy software, and be able to accommodate new
and emerging technologies.

The Purdue University Network Computing Hubs (PUNCH) is a prototype system of
this type - it allows users to run unmodified software via standard WWW
browsers. The infrastructure currently provides access to tools for
semiconductor technology, VLSI design, computer architecture, and parallel
programming. Forty tools developed by four vendors and eight universities are
available (www.ece.purdue.edu/punch). To date, PUNCH has been utilized by more
than 3,000 users, who have logged over 3,000,000 hits, and have initiated more
than 200,000 simulations.

This talk will focus on our approach to key issues in the design of such a
system: 1) the interface to the external world, 2) the internal system design,
3) support for legacy applications, and 4) resource management across
administrative domains. The interface to the external world is based on a
unique network desktop that allows Web access to computing services by
treating URLs as locations in a dynamic, virtual, and side-effect based
address-space. The internal system architecture consists of a three-level
hierarchy of cooperating servers; each server can be replicated, configured,
and managed independently. Legacy applications are supported by way of a
resource-description language that meets the needs of diverse software; a new
application can be added in as little as thirty minutes. Resources in
different administrative domains are controlled independently via a
meta-programming environment that allows administrators to specify usage
policies. Specified constraints are used by the system to make cost and
performance tradeoffs at run-time.

Realización hardware y software del algoritmo de criptografía IDEA: comparación de algunas realizaciones
Eduardo Sánchez
Laboratoire de Systèmes Logiques
Ecole Politechnique Fédérale de Lausanne, Suisse

El proposito de esta charla es presentar varias realizaciones del
algoritmo IDEA hechas en nuestro laboratorio. La primera realizacion
utiliza el repertorio de instrucciones multimedia del procesador
Itanium, el nuevo procesador VLIW de Intel y HP. La segunda, llamada
Cryptobooster, es un procesador criptografico completamente realizado
al interior de una FPGA. Esta ultima realizacion presenta mejores
resultados que la solucion software del Itanium, tanto del punto de
vista velocidad como seguridad.

Una presentacion general de la EPFL y de nuestro laboratorio sera
hecha igualmente, para presentar el marco global de nuestros trabajos.

La teoría de sistemas en la teorización sobre los procesos visuales retinales
Roberto Moreno
Universidad de Las Palmas de Gran Canaria

La descripción formal y la realización de modelos y teorías del sistema nervioso se lleva a cabo en un esquema de niveles que requiere diferentes herramientas formales para cada nivel. Estos niveles van desde el nivel de neurotransmisores,potenciales de membrana y potenciales de acción, tratables con herramientas de la bioquímica y la biofísica, hasta los niveles relacionados con el código neuronal en la corteza, los procesos cooperativos, la seguridad de funcionamiento y la llamada "interacción social" entre distintas partes del sistema nervioso central. Aquí se requieren las herramientas formales propias de la metodología simbólica que se usa en Inteligencia Artificial,como mínimo.

En un nivel intermedio, se sitúan los procesos de codificación neuronal, simples y múltiples, en los sistemas sensoriales y en los efectores (sistemas glandulares y motores), así como los procesos computacionales que tienen lugar en los mismos. Aquí, los métodos apropiados son los que corresponden al Proceso de Señales y a la Teoría Clásica de Sistemas. El caso de mayor relevancia es el de la retina de los vertebrados, que posee la típica estructura computacional por capas de la corteza, proyectándose sin embargo, al extremo sensorial.

En la exposición, se afronta desde esta perspectiva el problema de la teorización y modelado del comportamiento de las células retinales de distintos vertebrados en la escala evolutiva, desde los anfibios a los mamíferos más avanzados. Se estudia la descripción y se presentan modelos funcionales para explicar la variedad de respuestas de las neuronas, sobre todo las neuronas de "salida", las células ganglionares, ante la variedad de estímulos visuales. Estudiaremos el tipo de computación neuronal realizado por éstas, que , curiosamente, se simplifica amedida que se avanza en la escala evolutiva, ( moviéndose la computación más elaborada hacia la corteza). Se verán modelos de codificación múltiple en las salidas ganglionares (nervio óptico), y en definitiva, algo de lo que el ojo le "dice" al cerebro.

Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors
Marcelo Cintra, Jose F. Martinez, and Josep Torrellas
Department of Computer Science
University of Illinois at Urbana-Champaign
{cintra,martinez,torrellas}@cs.uiuc.edu

Speculative parallelization aggressively executes in
parallel codes that cannot be fully parallelized by the compiler.
Past proposals of hardware schemes have mostly focused on
single-chip multiprocessors (CMPs), whose effectiveness is necessarily
limited by their small size. Very few schemes have attempted this
technique in the context of scalable shared-memory systems.
In this talk, I will present and evaluate a new hardware
scheme for scalable speculative parallelization. This design needs
relatively simple hardware and is efficiently integrated into a
cache-coherent NUMA system. We have designed the scheme in a hierarchical
manner that largely abstracts away the internals of the node. We
efectively utilize a speculative CMP as the building block for our scheme.
Simulations show that the architecture proposed
delivers good speedups at a modest hardware cost. For a set of
important non-analyzable scientific loops, we report average
speedups of 4.2 for 16 processors. We show that support
for per-word speculative state is required by our
applications, or else the performance suffers greatly.

Scaling the Cray MTA
Burton Smith
Tera / Cray Inc.

The Cray MTA is a uniform shared memory multiprocessor that uses
fine-grain multithreading to tolerate memory latency, synchronization
latency, and even functional unit latency. The challenge in scaling
up such a system is to keep the interconnection network affordable.
This talk will discuss the unusual attributes of the MTA architecture
and a few ideas to address the scaling problem for high bandwidth
systems.

Porting the OpenMP environment Nanos on software DSM systems
Christian Perez
LIP, ENS Lyon

Thanks to its shared memory view, OpenMP is a nice standard to write
parallel applications. Many compilers are yet available for shared memory
machines. However, regarding to the performance/price ratio of
clusters of PC, it appears interesting to execute OpenMP programs on
such machines.

This talk presents our work in porting Nanos, an OpenMP environment
for shared memory machines, on distributed memory multiprocessor
machines thanks to software DSM systems. We will focus on the problems
we had to face because of the lack of functionality of existing SDSM
systems.

Based on the experience of porting the OpenMP environment Nanos on top
of the SDSM JIAJIA and DSM-PM2, we will discuss functionalities that
SDSM should support like shared-memory allocation, shared-stack
issues, data consistency and synchronization.

SR8000 Concept
Tim Lanfear
Hitachi Europe GmbH

The SR8000 is Hitachi's latest state-of-the-art
supercomputer combining the benefits of both
vector and massively parallel processing (MPP)
architectures.

This new computer merges the two design
paradigms from Hitachi's earlier supercomputer
products, the vector architecture S-3000 series and
the MPP SR2201, to form a new generation of
high performance computer.

The PROMIS Compiler
Chris Koopmans
University of Illinois

PROMIS is an extensible automatic parallelizing compiler, which is both
multilingual and retargetable. It has an integrated front- and back-end
operating on a single unified/universal intermediate representation.
PROMIS exploits multiple levels of static and dynamic parallelism,
ranging from task- and loop-level to instruction-level parallelism. The
integrated front- and back-end gives PROMIS the unique ability to make
tradeoffs between high- and low-level parallelism. In addition, PROMIS
employs a symbolic analyzer based on conditional algebra to provide
control sensitive and interprocedural information. The symbolic analyzer
works in conjunction with every other analysis and transformation pass
to offer a powerful optimizer. PROMIS was designed with extensibility in
mind, and many facilitating features are built directly in to the
compiler. It is also retargetable, using a universal machine descriptor
to offer an abstract definition of the machine. This allows PROMIS to
easily retarget DSPs, NOWs, NUMAs, and more.

Performance Optimization on High Performance Computers
- A Challenge for Software Tools -
Wolfgang E. Nagel
Dresden University of Technology
Center for High Performance Computing (ZHR)

With the growing demands in model-building on computer simulation, the programming
conversion techniques for efficient algorithms and procedures must keep up with rapid
progress on the instrumental platforms in computer architecture. Beyond moderately parallel
vector computers, parallel computers in quite ?massive parallel? form have been pushing into
the market for several years now. In many cases, the efficient use of such parallel computers is
a challenge to parallel programming.

The talk will discuss aspects of programming and optimization HPC applications on parallel
computers. Special emphasis will be placed on supportive software to

That's all folks!!!!!