Per Stenstrom
Chalmers University of Technology. Gothenburg (Sweden)
While the increasing speed gap between processors and memory has gained significant attention, most research into high-performance memory systems has been driven by scientific applications. In particular, very little work has targeted database applications, a key market driver for server technology.
Our current project aims at understanding how the complex interaction between database applications and other important server workloads and memory systems in uni- and multiprocessor architectures impacts on the performance. We have particularly focused on locality and coherence issues and explored a wide range of techniques to improve the performance.
This talk has three goals. One goal is to give insights into the performance
issues of memory systems for database applications. These include
how working sets scale with the database size in DSS workloads and how
OLTP workloads interact with coherence protocols. The second goal is to
present improvementtechniques to circumvent some of the performance problems.
These techniques include novel prefetch approaches for pointer-intensive
workloads and coherence schemes for dominant access patterns in OLTP workloads.
Finally, in studying these issues, we have spent a great deal of effort
in developing a productive experimental methodology. Our approach has been
to use complete system simulation, sometimes in combination with analytical
models. The talk will give many examples of how these methodological approaches
have been succesfully used.
Wen-mei Hwu
University of Illinois at Urbana-Champaign, USA
?
Miron Livny
University of Wisconsin, USA
For many experimental scientists, scientific progress and quality of research are strongly linked to computing throughput. In other words, most scientists are less concerned about the instantaneous performance of the environment (typically measured in Floating Point Operations per Second (FLOPS)). Instead, what matters to them is the amount of computing they can harness over a month or a year. Floating point operations per second has long been the principal metric used in High Performance Computing (HPC) efforts. The computing community has devoted little attention to environments that can deliver large amounts of processing capacity over long periods of time. We refer to such environments as High Throughput Computing (HTC) environments.
The key to HTC is effective management and exploitation of all available
computing resources. For more than a decade, we have been developing, implementing,
and deploying, HTC software tools that can effectively harness the capacity
of hundreds of distributively owned computing resources. Using our software,
the Condor resource management environment, scientists may simultaneously
and transparently exploit the capacity of resources they are not even aware
exist. The design of Condor follows a novel approach to HTC that is based
on a layered Resource Allocation and Management architecture. In the talk
we will outline the overall architecture of Condor and discuss the interaction
between the different layers. We will outline the principals that
have been guiding our work and the lessons we have learned from deploying
Condor and from interacting with scientists and engineers from a wide range
of disciplines. Special attention will be devoted to role the master-worker
computing paradigm plays in High Throughput Computing.
Thomas Sterling
California Institute of Technology and NASA Jet Propulsion Laboratory,
USA
Although recent trends in high end computer system development have
been focused on CMOS based COTS microprocessors, past advances often reflected
complementing innovation in both device technology and
computer architecture. Even as the MPP is stretched to incorporate
thousands of processors to reach a peak performance of a few Teraflops,
a Federally sponsored interdisciplinary research team is developing a new
latency tolerant parallel computer architecture that integrates advanced
synergistic technologies to achieve a Petaflops performance within a few
years. The Hybrid Technology Multi-Threaded (or HTMT) class of parallel
computer architecture exploits the extreme capabilities of superconductor
logic, optical communications, holographic storage, and memory-with-logic
semiconductors technologies to make possible petaflops scale computing
within practical constraints of engineering complexity, power consumption,
and cost. A new computer architecture will employ these building blocks
with high efficiency through a combination of multithreaded processor structure
and an innovative memory structure called "percolation". Together, these
new techniques revolutionize the relationship between processors and memory,
allowing simplicity of design, wide margins in operation, and high efficiency
of performance. These combined advances of technology and architecture
will deliver unprecedented performance and price-performance. This presentation
will describe the exceptional properties of the advanced technologies incorporated
by HTMT and the challenges being met by its revolutionary architecture.
Ronaldo A. de Lara Gonçalves
Universidade Estadual de Maringa (Brasil)
El profesor Ronaldo Gonçalves es un becario de doctorado brasileno
que
vino a la UPC para hacer parte de un simulador SMT. El hara una rapida
presentacion de su grupo de trabajo y de las actividades que ellos
desarollan en el proyecto APSE (Arquitecturas de Procesadors
Superescalares). Despues presentara su propuesta de doctorado.
El profesor Ronaldo esta trabajando con el proyecto de una arquitectura
SMT (simultaneous multithreaded) llamada SEMPRE, que tiene capacidad
de
gerenciar procesos (aplicaciones) y facilitar el trabajo del sistema
operativo. La arquitectura SEMPRE hace el escalonamento de los procesos
y ejecuta instrucciones especiales: create_process, kill_process,
supend_process y resume_process. Ademas,
esta arquitectura tiene un mecanismo llamado Process Prefetching para
adelantar la busca de procesos de la memoria L2 para la memoria L1,
de
acuerdo con el algoritmo de escalonamento de procesos.
Mario Nemirovsky
Xstreamlogic
Microprocessor architecture has evolved significantly since the introduction
of processors in the early 1970's. Despite of substantial changes
in
processor's workloads, the concept of a "general-purpose processor"
has
survived very successfully. Since the performance of these processors
has
increased so rapidly and some specialized operations such as floating
point,
multimedia, 3D extensions were incorporated, the need for special-purpose
processors has not been felt. However, new requirements are quickly
emerging
from various fronts such as communications, games, and real-time applications.
Are the days of "general-purpose processors" counted? In this
talk we will
discuss these new requirements and current trends in microprocessor
technology
from both an industry and academic perspective.
Gordon Bell
Microsoft Bay Area Research Center
Moore's Law is the most common explanation
of the rapid evolution of computing. In addition,
Bell's Law c1975 perhaps a corollary of Moore's
Law describes why various computer classes form.
Other laws related to demand e.g. Metcalfe's Law
describe the value of network or supply e.g.
manufacturing learning curves, Bill's Law for
software cost are equally important.
Dimitris Nikolopoulos
HPCLAB - Dept. of Computer Engineering and Informatics. University
of Patras
We present dynamic user-level page migration, a runtime
technique that transparently enables parallel programs to self-tune
their
memory performance via feedback obtained through runtime
performance monitoring on distributed shared memory multiprocessors.
Runtime infromation from page reference counters is used to trigger
page
migrations, towards reducing the latency of remote memory accesses.
Our technique exploits information available to the program
both at compile time and at runtime in order to improve
the accuracy and the timeliness of page migrations, as well as
to amortize better the cost of page migrations, compared
to a page migration engine implemented in the operating system.
We present a competitive and a predictive algorithm for dynamic
user-level page migration. These algorithms work in synergy to
improve the responsiveness of the program to external events that
may necessitate page migrations, such as preemptions and migrations
of
threads. We also present a new technique for
preventing page ping-pongs and a page forwarding mechanism for tuning
applications with phase changes. Our experimental evidence on
a
SGI Origin2000, shows that unmodified OpenMP codes linked with our
runtime system for dynamic page migration are effectively immune to
the
initial memory placement of the operating system. This result suggests
that it
might not be necessary to compromise the simplicity of the OpenMP
shared-memory programming standard with data placement directives.
Furthermore, our runtime system achieves solid performance improvements
compared to the IRIX 6.5.5 page migration engine, for single parallel
OpenMP codes and multiprogrammed workloads.
Enric Musoll
Xstreamlogic
This paper proposes a micro-architectural technique in which a
prediction is made for some power-hungry units of a processor. The
prediction consists of whether the result of a particular unit
or block of logic will be useful in order to execute the current
instruction. If it is predicted useless, then that block is
disabled.
It would be ideal if the predictions were totally accurate, thus
not decreasing the instruction-per-cycle (IPC) performance metric.
However, this is not the case: the IPC might be degraded which in turn
may
offset the power savings obtained with the predictors due
to the extra cycles to complete the execution of the application
being run on the processor.
In general, some logic may determine which of the block(s) that have
a predictor associated will be disabled based on the outcome of the
predictors and possibly some other signals from the processor. The
overall processor power-consumption reduction is a function of
how accurate the predictors are, what percentage of the total power
consumption corresponds to the blocks being predicted, and how sensitive
to
the IPC the different blocks are.
A case example is presented where two blocks are predicted for low
power: the on-chip L2 cache for instruction fetches, and the Branch-
target Buffer. The IPC vs. power-consumption design space is explored
for a particular micro-processor architecture. Both the average and
the peak-power consumption are targeted. Although the power analysis
is beyond the scope of this paper, high-level estimations are done
to show that it is plausible that the ideas described might produce
a significant reduction in useless block accesses. Clearly, this
reduction may be exploited to reduce the power consumption demands
of high-performance processors.
Timothy Mark Pinkston
SMART Interconnects Group
University of Southern California
The development of fully-adaptive, cut-through networks is important
for achieving high performance in communication-critical parallel
processor systems. Increased flexibility in routing allows network
bandwidth to be used efficiently but, also, creates more opportunity
for cyclic resource dependencies to form which can result in deadlock.
If not guarded against, deadlocks in routing make packets block in
the
network indefinitely and, eventually, could result in the entire
network coming to a complete standstill. This talk presents a
simple,
flexible and efficient routing approach for multicomputer/multiprocessor
interconnection networks which is based on progressive deadlock
recovery, as opposed to deadlock avoidance or regressive deadlock
recovery. Performance is optimized by allowing the maximum routing
freedom provided by network resources to be exploited. True
fully-adaptive routing is supported in which all physical and virtual
channels at each node in the network are available to packets without
regard for deadlocks. Deadlock cycles, upon forming, are efficiently
broken in finite time by progressively routing one of the blocked
packets through a connected, deadlock-free recovery path. This
new
routing approach enables the design of high-throughput networks that
provide excellent performance. Simulations indicate that progressive
deadlock recovery routing improves throughput by as much as 45% and
25% over leading deadlock avoidance-based and regressive
recovery-based routing schemes, respectively.
Jose Duato
Universitat Politecnica de Valencia
This talk analyzes some critical issues in the design of
interconnects for networks of workstations (NOWs), describing
recently proposed techniques. In particular, this talk analyzes the
requirements to route messages in NOWs. Then, several techniques for
deadlock handling and their application to the design of adaptive
routing algorithms are presented. The talk will describe techniques
for deadlock avoidance and recovery, showing their application to
networks of workstations with possibly irregular topologies. Also,
implementation details will be presented, including flow control
protocols, channel pipelining, deadlock detection, and injection
limitation. Some performance evaluation results will show the
benefits of using sophisticated routing techniques.
The talk will also consider the requirements to support multimedia
traffic, briefly presenting a single-chip architecture for multimedia
routers as well as techniques to dynamically update routing
tables in the presence of topology changes without producing
deadlocks.
Dolors Royo
UPC-DAC
En aquesta xerrada presentare el treball desenvolupat durant
la meva estada a la universitat de Purdue.
Durant l'estada vaig iniciar el disseny i desenvolupament del
sub-sistema de gestio de recursos del sistema PUNCH.
PUNCH (Purdue University Network-Computing Hubs)
es un software de xarxa (sobre internet) que
permet la comparticio de recursos (basicament maquines i software)
entre tots els usuaris (alumnes i professors) dels diferents campus
(un total de 6) i centres tecnologics de la Universitat de Purdue.
Pau Bofill
UPC-DAC
L'anàlisi per components independents és ja una tècnica
estàndard per
a la separació cega de fonts a partir d'un conjunt de mescles
linials, i
consisteix a la pràctica a resoldre un problema d'optimització
numèrica. La situació es complica, però, quan
el nombre de mescles
(sensors) és inferior al nombre de senyals originals (fonts).
El treball
que presentem aquí és el resultat d'analitzar el paral.lelisme
d'un
dels enfocs proposats a la literatura, que va conduir a la
descomposició de l'algorisme i, en definitiva, a la seva simplificació.
La tècnica resultant s'il.lustra amb exemples de separació
de senyals
musicals.
Sally A. McKee
University of Utah
Microprocessor speed is increasing much faster than memory system
speed:
both speed-growth curves are exponential, but they represent
diverging
exponentials. The traditional approach to attacking the memory
system
bottleneck has been to build deeper and more complex cache hierarchies.
Although caching may work well for parts of programs that exhibit
high locality,
many important commercial and scientific workloads lack the locality
of reference that makes caching effective. A 1996 study
by Sites and
Perl on a commercial database workload shows that memory
bus and DRAM
latencies cause an 8X slowdown from peak performance to actual
performance.
Another 1996 study by Burger, Goodman, and Kagi finds that when
compared
to an optimal cache the efficiency of current caching techniques
is
generally less than 20\% and that cache sizes are up to 2000
times larger.
The evidence is clear: no matter how hard we push it, traditional
caching cannot
bridge the growing processor-memory performance gap.
This talk presents research that attacks the memory problem at
a different
level -- the memory controller. We describe the Stream Memory
Controller (SMC)
system built at the University of Virginia, and the Impulse Adaptable
Memory
System being built at the University of Utah. The SMC dynamically
reorders
accesses to stream elements to avoid bank conflicts and bus turnaround
delays
and to exploit locality of reference within the page buffers
of the DRAMs.
Impulse virtualizes unused physical addresses and dynamically
remaps data
to improve cache and bus utilization. For instance, Impulse can
gather sparse
data into dense cache lines that have high locality and leave
a smaller
cache footprint. Gathering strided data improves performance
on regular
applications, and gathering data through an indirection vector
improves
performance on irregular applications (e.g., Impulse improves
the NAS conjugate
gradient benchmark's performance by 67%).
By prefetching data within the memory controller, Impulse
uses bus bandwidth more efficiently, transferring only the needed
data when
it is requested.
Both these systems require that the compiler or application
writer supply the access pattern information that the memory
controller
exploits. The SMC optimizes low-level memory performance for
regular applications. In
contrast, Impulse optimizes
performance within the cache hierarchy and the memory back end
for both regular and irregular computations.
Binu K. Mathew, Sally A. McKee, John B. Carter, Al Davis
University of Utah
We are attacking the memory bottleneck by building
a "smart" memory
controller that improves effective memory bandwidth, bus utilization,
and cache efficiency by letting applications dictate how their data
is
accessed and cached. This paper describes a
Parallel Vector Access
unit (PVA), the vector memory subsystem that
efficiently "gathers"
sparse, strided data structures in parallel
on a multibank SDRAM
memory. We have validated our PVA design via gate-level
simulat ion,
and have evaluated its performance via functional
simulation and
formal analysis. On unit-stride vectors, PVA performance
equals or
exceeds that of an SDRAM system optimized for cache line
fills. On
vectors with larger strides, the PVA is up to 32.8 times
faster. Our
design is up to 3.3 times faster than a pipelined, serial SDRAM memory
system that gathers sparse vector data, and the gathering mechanism
is
two to five times faster than in other PVAs with similar
goals. Our
PVA only slightly increases hardware complexity with respect
to these
other systems, and the scalable design is appropriate for
a range of
computing platforms, from vector supercomputers to commodity PCs.
We begin by a brief review on the evolution of fine-grain multithreaded
models
and architectures --- in particular the EARTH (Efficient Architecture
For Running
Threads) architecture model. We outline the program execution model
and
architecture issues for multithreaded processing elements suitable
for use as
the node processor facing the challenges of both regular and irregular
applications.
These include the question of choosing appropriate program execution
model, the
organization of the processing element to achieve good utilization
and speedup in
the pretense of irregular data and control flow patterns, the support
for fine-grain
interprocessor communication and dynamic load balancing, and related
compiling issues.
We have implemented the EARTH architecture model on a number of
experimental
high-performance multiprocessor platforms and we will present some
experiment
results on the effectiveness of the EARTH architecture on irregular
applications.
Recent evolution of the EARTH model and its impacts on the HTMT Petaflop
architecture model also be addressed.
IBM's premiere chess system, based on an IBM RS/6000 SP
scalable parallel processor, made history by defeating
world chess champion Garry Kasparov. Deep Blue's chess
prowess stems from its capacity to examine over 200
million board positions per second, utilizing the computing
resources of a 32-node IBM RS/6000-SP, populated with 512
special purpose chess accelerators.
In this talk we describe some of the technology behind
Deep Blue and how chess knowledge was incorporated into its
software, as well as the attitude of the media and general
public during the match.
The Centre for Advanced Studies (CAS) was established by the IBM Toronto
Laboratory in 1991 to strengthen the links between academia and industry.
Since then, this research group has been helping bridge the gap between
industry requirements for new technology and academia.
CAS places an emphasis on solving real problems encountered by
developers. These lead to long-term projects from which prototypes
and
publication-quality research can be delivered, all in collaboration
with
other members of the research community.
To deliver, and moreover to continue to sustain, the levels of computational
performance required by the most demanding applications within a
cost-effective manner requires a unique combination of technologies.
Such
Supercomputers have always been at the vanguard of HPC and provide
many of
the impetuses relevant to the general purpose computing industry as
a whole.
To address this high technology area, QSW have created a generic system
architecture based on an ultra-high performance interconnect and an
extensible resource management system. This combination enables commodity
components to be integrated and scaled into the Supercomputer class.
QSW is
now applying these technologies in partnership at both large clustered
SMP/UNIX configurations, towards multi-TFLOP performance, and is now
set to
launch LINUX configurations providing a more amenable entry-level
capability.
Actualmente cada buscador de Internet tiene su propia interfaz.
Por otro lado, existen varias metaforas de visualizacion que pueden
ser muy utiles en los buscadores. En esta charla mostramos un
modelo para visualizadores y buscadores y posibles arquitecturas
de software para su implementacion, en base a un lenguaje intermedio
que usa XML.
MPEG-2 is the video compression standard
that has been chosen for digital TV, DVD and HDTV.
The MPEG-2 video decompression combines algorithm
parts like the inverse discrete cosine transform
which exhibit high computational demands,
completely memory bound elements like the motion compensation,
and highly irregular program flows in the Huffman decoder.
We show how the MPEG-2 video decompression algorithm
can be made multithreaded resulting in a theoretical speedup of
about 6.5 compared to the sequential algorithm.
Our target architectures are simultaneous multithreaded (SMT)
processor models with multimedia enhancements.
We start with a wide-issue superscalar processor, enhance it by
the simultaneous multithreading technique, by multimedia units,
and by an additional on-chip RAM storage.
The simulations with the handcoded multithreaded MPEG-2 video
decompression algorithm as workload show that 8-threaded,
8-issue processor configurations reach a threefold speed-up
over comparable single-threaded, 8-issue models.
En esta charla presentare como podemos aplicar Loop Tiling (o Blocking)
a aplicaciones numericas como Jacobi, Swim, Mgrid, etc.
Estas aplicaciones (del tipo "Iterative Stencil Computations") presentan
un alto grado de reutilizacion de los datos. Por lo tanto, transformaciones
como Loop Tiling, cuyo objetivo es explotar la reutilizacion de los
datos
inherente de los programas, pueden ser muy efectivas.
Sin embargo, debido a la estructura de este tipo de aplicaciones, no
podemos
aplicar Loop Tiling de la forma convencional. En este trabajo presentare
como podemos hacerlo y mostrare algunos resultados experimentales sobre
dos
microprocesadores diferentes (ALPHA 21164 y ALPHA 21264).
Este trabajo fue realizado durante mi estancia en Compaq.
In anticipation of future developments the possible architectural
realizations of a billion transistor chip are discussed. In particular:
1.Single CPU Superspeculative Superscalar.
2.Simultaneous Multithreaded (SMT).
3.Chip Multiprocessor (CMP).
4.Intelligent RAM (IRAM).
5.RAW Processors.
The properties of the above solutions are compared both
qualitatively and quantitatively.
Subsequently the new trend of explicitly parallel instruction
computing (EPIC) and an introduction to the new IA-64 Intel and
HP architecture are covered.
The goal of the Custom-Fit Processor (CFP) project at HP Labs
is to produce a scalable and customizable technology platform
for embedded VLIW processors. The CFP project targets
System-On-Chip designs, where currently a combination of core
processors, DSP engines and specialized ASICs is required for
high-performance applications. Custom-designing a core
processor provides enough performance to absorbe some
ASIC/DSP functionality, reduce costs and time-to-market.
The first outcome of the CFP project is the Lx Family of
Embedded VLIW Cores, designed by STMicroelectronics and
Hewlett-Packard Laboratories. For Lx we developed the
architecture and software from the beginning to support both
scalability (variable numbers of identical processing
resources) and customizability (special purpose resources).
The Lx technology is based on
· A new clustered Very Long Instruction Word (VLIW) core
architecture and microarchitecture that ensures scalability
and customizability; and that we can customize to a specific
application domain
· A software toolchain based on aggressive
instruction-level parallel (ILP) compiler technology that
gives the user a uniform view of the technology platform at
the programming language level
The first Lx implementation shows that specialization for an
application domain is very effective, yielding to large gains
in the price/performance ratio. Using the design of Lx as a
concrete example, the talk will address issues like: When is
scaling or customization beneficial? How can one determine
the degree of customization or scaling that will provide the
greatest payoff for a particular application domain? What
architectural compromises have to be made to contain the
complexity inherent in a customizable and scalable processor
family?
The current trend in HPC hardware is towards clusters of shared-memory
(SMP) compute nodes. For applications developers the major question
is
how best to program these SMP clusters. To address this we study an
algorithm from Discrete Element Modeling, parallelised using both the
message-passing and shared-memory models simultaneously (``hybrid''
parallelisation). The natural load-balancing methods are different
in
the two parallel models, the shared-memory method being in principle
more efficient for very load-imbalanced problems. It is therefore
possible that hybrid parallelism will be beneficial on SMP clusters.
We
benchmark MPP, SMP and cluster architectures, compare the performance
of
the pure OpenMP and MPI codes, and evaluate the effectiveness of hybrid
parallelism. Although we observe cases where OpenMP is more efficient
than MPI on a single SMP node, we conclude that our current OpenMP
implementation is not yet efficient enough for hybrid parallelism to
outperform pure MPI on an SMP cluster.
En esta conferencia se presenta el diseño de un conjunto
de
encaminadores de paquetes especialmente pensados para arquitecturas
ccNUMA.
Su simplicidad de diseño hace que sean perfectamente adecuados
para su
integracion en las nuevas familias de microprocesadores que implementan
on-chip tanto el router como el interfaz de mensajes. Las caracteristicas
basicas de esta familia de routers se basan en nuevas tecnicas de control
de flujo de paquetes y nuevas tecnicas de arbitraje de los conmutadores.
Para demostrar de una manera cuantitativa su superior rendimiento frente
a
soluciones estandard se ha desarrollado una metodologia de evaluacion
que
parte desde los diseños hardware de los routers hasta la simulacion
conducida por ejecucion del sistema completo corriendo aplicaciones
reales.
One of the main concerns in today's processor design is the issue logic.
Instruction-level parallelism is usually favored by an out-of-order
issue mechanism where instructions can issue independently of the
program order. The out-of-order scheme yields the best performance
but
at the same time introduces important hardware costs such as an
associative look-up, which might be prohibitive for wide issue
processors with large instruction windows. This associative search
may
slow-down the clock-rate and it has an important impact on power
consumption. In this work, two new issue schemes that reduce the
hardware complexity of the issue logic with minimal impact on the
average number of instructions executed per cycle are presented.
Keywords: instruction issue logic, wide-issue superscalar, out-of-order
issue, in-order issue.
As microprocessors push beyond 1 GHz, processor architects
are facing a number of challenges. Wire delays are becoming
critical, power considerations temper the availability
of
billions of transistors, and reliability
is becoming
increasingly critical. To accommodate these challenges (and
opportunities), microarchitectures will process instruction
streams in a distributed fashion -- instruction level dis-
tributed processing (ILDP). ILDP will be implemented
in a
variety of ways, including both homogeneous and heterogene-
ous elements. To help find run-time parallelism,
orches-
trate distributed hardware resources, implement power con-
servation strategies, and to
provide fault-tolerant
features, an additional layer of abstraction -- the virtual
machine layer -- will become an
essential ingredient.
Finally, new instruction set architectures may be developed
to better focus on communication and dependence, rather than
computation and independence as is commonly done today.
The constant increase in integration of ICs and the higher frequencies
of operation of modern microarchitectures result in a significant
increment in power consumption even though technology and voltage
scaling approaches are adopted to mitigate this effect.
Issues such as the high cost of cooling systems, portability,
reliability and today's high demand of mobile devices ask for new
evaluation methods and techniques to attack this problem. A
comprehensive analysis of models and techniques related to power
consumption is given.
We finally propose a detailed analysis of power consumption of a typical
superscalar processor and a run-time technique able to reduce
significantly the power consumption of the issue logic.
The Pro64(TM) compilers for IA-64, recently released to open source
by
SGI, were developed by retargeting the MIPSpro compilers for the MIPS
processor line and adding new support for unusual IA-64 architectural
features. Due to strong influences going back to the similar
Cydra 5,
and a long history of addressing the issues IA-64 was designed to
address, these compilers are uniquely suited to take full advantage
of
the IA-64 performance features, while providing the robustness of an
established product. Because of their open source status and
structure
as a series of phases based on a powerful intermediate language, the
compilers are also an excellent basis for compiler and architecture
research.
In this talk, Dr. Dehnert will describe the structure and capabilities
of the Pro64 compilers, which include interprocedural analysis and
optimization, and unparalleled SSA-based global optimizer, loop nest
optimization and parallelization, and a hyperblock-based code generator
with software pipelining, global scheduling, and other low-level
optimizations. He will also summarize some of the key technologies
used by the compilers, including the Target Description Tables that
minimize target-specific coding in the compilers, feedback
capabilities, SSA, software pipelining, and global code motion.
Further information, including a list of publications related to the
Pro64 compilers, can be found on the SGI web site
http://oss.sgi.com/projects/Pro64
Software pipelining is a loop scheduling technique that extracts
parallelism out of loops by overlapping the execution of several
consecutive iterations. Due to the overlapping of iterations, schedules
impose high register requirements during their execution. A schedule
is
valid if it requires at most the number of registers available in the
target architecture. If not, its register requirements have to be
reduced either by decreasing the iteration overlapping or by spilling
registers to memory. In this paper we describe a set of heuristics
to
increase the quality of register--constrained modulo schedules. The
heuristics decide between the two previous alternatives and define
criteria for effectively selecting spilling candidates. The heuristics
proposed for reducing the register pressure can be applied to any
software pipelining technique. The proposals are evaluated using a
register--conscious software pipeliner on a workbench composed of a
large set of loops from the Perfect Club benchmark and a set of
processor configurations. Proposals in this paper are compared against
a
previous proposal already described in the literature. For one of these
processor configurations and the set of loops that do not fit in the
available registers (32), a speed--up of 1.68 and a reduction of the
memory traffic by a factor of 0.57 are achieved with an affordable
increase in compilation time. For all the loops, this represents a
speed--up of 1.38 and a reduction of the memory traffic by a factor
of
0.7.
The trends toward network-based and service-oriented computing are evident,
but the viability of these paradigms hinge on technology that can make
software usable via a "universal client" that serves as a portal to
a
distributed computing system. Given the current state-of-the-art, World
Wide
Web browsers are ideal as universal clients. On the other hand, designing
a
server-side infrastructure that provides Web access to arbitrary software
is a
challenging task: the design must meet the needs of its users, work
within the
constraints of available hardware, interoperate with existing Web standards
and operating systems, support legacy software, and be able to accommodate
new
and emerging technologies.
The Purdue University Network Computing Hubs (PUNCH) is a prototype
system of
this type - it allows users to run unmodified software via standard
WWW
browsers. The infrastructure currently provides access to tools for
semiconductor technology, VLSI design, computer architecture, and parallel
programming. Forty tools developed by four vendors and eight universities
are
available (www.ece.purdue.edu/punch). To date, PUNCH has been utilized
by more
than 3,000 users, who have logged over 3,000,000 hits, and have initiated
more
than 200,000 simulations.
This talk will focus on our approach to key issues in the design of
such a
system: 1) the interface to the external world, 2) the internal system
design,
3) support for legacy applications, and 4) resource management across
administrative domains. The interface to the external world is based
on a
unique network desktop that allows Web access to computing services
by
treating URLs as locations in a dynamic, virtual, and side-effect based
address-space. The internal system architecture consists of a three-level
hierarchy of cooperating servers; each server can be replicated, configured,
and managed independently. Legacy applications are supported by way
of a
resource-description language that meets the needs of diverse software;
a new
application can be added in as little as thirty minutes. Resources
in
different administrative domains are controlled independently via a
meta-programming environment that allows administrators to specify
usage
policies. Specified constraints are used by the system to make cost
and
performance tradeoffs at run-time.
El proposito de esta charla es presentar varias realizaciones del
algoritmo IDEA hechas en nuestro laboratorio. La primera realizacion
utiliza el repertorio de instrucciones multimedia del procesador
Itanium, el nuevo procesador VLIW de Intel y HP. La segunda,
llamada
Cryptobooster, es un procesador criptografico completamente realizado
al interior de una FPGA. Esta ultima realizacion presenta mejores
resultados que la solucion software del Itanium, tanto del punto de
vista velocidad como seguridad.
Una presentacion general de la EPFL y de nuestro laboratorio sera
hecha igualmente, para presentar el marco global de nuestros trabajos.
La descripción formal y la realización de modelos y teorías del sistema nervioso se lleva a cabo en un esquema de niveles que requiere diferentes herramientas formales para cada nivel. Estos niveles van desde el nivel de neurotransmisores,potenciales de membrana y potenciales de acción, tratables con herramientas de la bioquímica y la biofísica, hasta los niveles relacionados con el código neuronal en la corteza, los procesos cooperativos, la seguridad de funcionamiento y la llamada "interacción social" entre distintas partes del sistema nervioso central. Aquí se requieren las herramientas formales propias de la metodología simbólica que se usa en Inteligencia Artificial,como mínimo.
En un nivel intermedio, se sitúan los procesos de codificación neuronal, simples y múltiples, en los sistemas sensoriales y en los efectores (sistemas glandulares y motores), así como los procesos computacionales que tienen lugar en los mismos. Aquí, los métodos apropiados son los que corresponden al Proceso de Señales y a la Teoría Clásica de Sistemas. El caso de mayor relevancia es el de la retina de los vertebrados, que posee la típica estructura computacional por capas de la corteza, proyectándose sin embargo, al extremo sensorial.
En la exposición, se afronta desde esta perspectiva el problema
de la teorización y modelado del comportamiento de las células
retinales de distintos vertebrados en la escala evolutiva, desde los anfibios
a los mamíferos más avanzados. Se estudia la descripción
y se presentan modelos funcionales para explicar la variedad de respuestas
de las neuronas, sobre todo las neuronas de "salida", las células
ganglionares, ante la variedad de estímulos visuales. Estudiaremos
el tipo de computación neuronal realizado por éstas, que
, curiosamente, se simplifica amedida que se avanza en la escala
evolutiva, ( moviéndose la computación más elaborada
hacia la corteza). Se verán modelos de codificación múltiple
en las salidas ganglionares (nervio óptico), y en definitiva, algo
de lo que el ojo le "dice" al cerebro.
Speculative parallelization aggressively executes in
parallel codes that cannot be fully parallelized by the compiler.
Past proposals of hardware schemes have mostly focused on
single-chip multiprocessors (CMPs), whose effectiveness is necessarily
limited by their small size. Very few schemes have attempted this
technique in the context of scalable shared-memory systems.
In this talk, I will present and evaluate a new hardware
scheme for scalable speculative parallelization. This design needs
relatively simple hardware and is efficiently integrated into a
cache-coherent NUMA system. We have designed the scheme in a hierarchical
manner that largely abstracts away the internals of the node. We
efectively utilize a speculative CMP as the building block for our
scheme.
Simulations show that the architecture proposed
delivers good speedups at a modest hardware cost. For a set of
important non-analyzable scientific loops, we report average
speedups of 4.2 for 16 processors. We show that support
for per-word speculative state is required by our
applications, or else the performance suffers greatly.
The Cray MTA is a uniform shared memory multiprocessor that uses
fine-grain multithreading to tolerate memory latency, synchronization
latency, and even functional unit latency. The challenge in scaling
up such a system is to keep the interconnection network affordable.
This talk will discuss the unusual attributes of the MTA architecture
and a few ideas to address the scaling problem for high bandwidth
systems.
Thanks to its shared memory view, OpenMP is a nice standard to write
parallel applications. Many compilers are yet available for shared
memory
machines. However, regarding to the performance/price ratio of
clusters of PC, it appears interesting to execute OpenMP programs on
such machines.
This talk presents our work in porting Nanos, an OpenMP environment
for shared memory machines, on distributed memory multiprocessor
machines thanks to software DSM systems. We will focus on the problems
we had to face because of the lack of functionality of existing SDSM
systems.
Based on the experience of porting the OpenMP environment Nanos on top
of the SDSM JIAJIA and DSM-PM2, we will discuss functionalities that
SDSM should support like shared-memory allocation, shared-stack
issues, data consistency and synchronization.
The SR8000 is Hitachi's latest state-of-the-art
supercomputer combining the benefits of both
vector and massively parallel processing (MPP)
architectures.
This new computer merges the two design
paradigms from Hitachi's earlier supercomputer
products, the vector architecture S-3000 series and
the MPP SR2201, to form a new generation of
high performance computer.
PROMIS is an extensible automatic parallelizing compiler, which is both
multilingual and retargetable. It has an integrated front- and back-end
operating on a single unified/universal intermediate representation.
PROMIS exploits multiple levels of static and dynamic parallelism,
ranging from task- and loop-level to instruction-level parallelism.
The
integrated front- and back-end gives PROMIS the unique ability to make
tradeoffs between high- and low-level parallelism. In addition, PROMIS
employs a symbolic analyzer based on conditional algebra to provide
control sensitive and interprocedural information. The symbolic analyzer
works in conjunction with every other analysis and transformation pass
to offer a powerful optimizer. PROMIS was designed with extensibility
in
mind, and many facilitating features are built directly in to the
compiler. It is also retargetable, using a universal machine descriptor
to offer an abstract definition of the machine. This allows PROMIS
to
easily retarget DSPs, NOWs, NUMAs, and more.
With the growing demands in model-building on computer simulation, the
programming
conversion techniques for efficient algorithms and procedures must
keep up with rapid
progress on the instrumental platforms in computer architecture. Beyond
moderately parallel
vector computers, parallel computers in quite ?massive parallel? form
have been pushing into
the market for several years now. In many cases, the efficient use
of such parallel computers is
a challenge to parallel programming.
The talk will discuss aspects of programming and optimization HPC applications
on parallel
computers. Special emphasis will be placed on supportive software to