Resums de les presentacions del Seminari CAP, 1998-1999

Java for High Performance Computing

Jordi Garcia

Java s'ha convertit en un dels llenguatges mes populars en els darrers anys. La ventatja principal del Java es la portabilitat, fent d'aquest el llenguatge preferit dels desenvolupadors d'aplicacions per internet.

Per altre banda, el futur de la Computacio d'Altes Prestacions passa per saber aprofitar la potencia de calcul que hi ha distribuida per tot el mon, fent d'internet el mitja de comunicacio mes utilitzat. Pero el Java no ha sigut dissenyat per a donar suport al calcul cientific.

En aquesta xerrada es preten donar una visio dels esforços que s'estan portant a terme en diferents grups de recerca per a orientar el Java cap al mon de la Computacio d'Altes Prestacions.

Aquests esforços es poden classificar en tres nivells:

extensio del llenguatge per donar suport al calcul numeric i a l'execucio en paral.lel;
tecniques de compilacio i reestructuracio de codi, tant a alt nivell (llenguatge Java) com a baix nivell (bytecode);
i definicio d'un entorn d'altes prestacions per a l'execucio de programes Java: execucions distribuides, Parallel Java Virtual Machine, ...

Split Last-Address Predictor

Enric Morancho

Recent works have proposed the use of prediction techniques to execute speculatively true data-dependent operations. However, the predictability of the operations do not spread uniformly among them. Then, we propose the use of run-time classification of instructions to increase the efficiency of the predictors. At run time, the proposed mechanism classifies instructions according to their predictability, decoupling this classification from prediction table. Then, the classification is used to avoid the unpredictable instructions from being candidates to allocate an entry in the prediction table. The previous idea of run-time classification is applied to the last-address predictor (Split Last-Address Predictor). The goal of this predictor is to reduce the latency of load instructions. Memory access is performed after the effective address is predicted concurrently with instruction fetch, after that, next true data-dependent instructions can be executed speculatively. We show that our proposal applied to the last-address predictor captures the same predictability than the last-address predictor proposed in literature, increases its accuracy, and reduces its area-cost a 19%.

Posibilidades de Realización Física de la PRAM (Parallel Access Random Machine)

Michael Lindig

Temas a tratar en la charla :
a.- Tiempo de acceso a memoria en máquinas paralelas de memoria compartida.
b.- Aspectos de realización física para las mismas aquitecturas.
c.- Arbitraje de acceso.
d.- Un esquema de arbitraje distribuido para redes tipo crossbar.
e.- Un esquema de memorias caché distribuido para redes tipo crossbar.
f.- Posibilidades de construcción fisica.
g.- Conclusiones

Instrumentació de binaris amb DIXIE

Roger Espasa

The Dixie projects seeks a a tool that allows flexible instrumentation of program binaries to perform computer architecture research. The major features of the Dixie tool are listed below:

Full architecture coverage: The tool is able to provide at any point in time during the execution of the traced binary the value of all the architected registers of the machine, the value of all memory locations and the sequence of user instructions executed.
High accuracy: The tool does not distort in any way the virtual addresses generated by the program or the location of the program itself. That is, the tool accurately reproduces all instruction and data addresses generated during program execution as if the program was being executed in its native environment.
Wrong Path Execution: The tool is able to trace the sequence of instructions that follow a control flow mis-speculation. That is, on a wrong branch prediction, the tool provides the instructions being fetched, its associated register values and the potentially speculative memory contents.
Multi-ISA: A very important characteristic of Dixie is that it is easily retargetable. Currently it supports a RISC-style ISA (Alpha), a vector ISA (Convex) and is being ported to a CISC-style ISA (x86). The tool should also be able to work on user-defined ISAs, that is, ISAs that do not have real processors to run on.
On-the-fly trace processing: The tool allows execution-driven simulators, to avoid storing huge trace files to disk.
Cross-Platform execution: Since the tool is based on emulation techniques, it can be used to execute and trace binaries of a certain architecture on top of a completely different architecture. Currently, the Convex binaries can be executed on an Alpha workstation.
Atom-like interface: Due to its widespread use, the user interface provided by the ATOM tool has been chosen as the user interface for Dixie.

Cache Sensitive Modulo Scheduling

Jesus Sanchez

This work focuses on the interaction between software prefetching (both binding and nonbinding) and software pipelining for VLIW machines. First, it is shown that evaluating software pipelined schedules without considering memory effects can be rather inaccurate due to stalls caused by dependences with memory instructions (even if a lockup-free cache is considered). It is also shown that the penalty of the stalls is higher than the effect of spill code. Second, we show that in general binding schemes are more powerful than nonbinding ones for software pipelined schedules. Finally, the main contribution of this paper is an heuristic scheme that schedules some memory operations according to the locality estimated at compile time and other attributes of the dependence graph. The proposed scheme is shown to outperform other heuristic approaches since it achieves a better trade-off between compute and stall time than the others.

Introduccio a la logica Difusa

Frederic Vila

En aquesta xerrada es preten donar una breu idea de la Teoria dels conjunts Fuzzy i els diferents camps on l'us de la logica difusa aporta una notable millora (intel·ligencia artificial, control de sistemes, etc.).

Tant mateix, es descriura breument l'estructura d'un FLC MISO (Fuzzy Logic Controller - Multiple Input Single Output), per tal de poder veure com seria la implementacio hardware d'un senzill sistema basat en logica difusa.

Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures

David Lopez

The inherent instruction-level parallelism (ILP) of current applications (specially those based on floating point computations) has driven hardware designers and compilers writers to investigate aggressive techniques for exploiting program parallelism at the lowest level. To execute more operations per cycle, many proc essors are designed with growing degrees of resource replication (buses and functional units); however, the high cost in terms of area and cycle time of this technique precludes the use of high degrees of replication. An alternative to resource replication is resource widening, that has also been used in some recent designs, in which the width of th e resources is increased.

In this paper we evaluate a broad set of design alternatives that combine both replication and widening. For each alternative we perform an estimation of the ILP limits (including the impact of spill code for several register file configurations) and the cost in terms of area and access time of the register file. We also perform a technological projection for the next 10 years in order to foresee the possible implementable alternatives. From this study we conclude that the be st performance is obtained when combining certain degrees of replication and widening in the hardware resources. The results have been obtained from a large number of inner loops from numeric al programs scheduled for VLIW architectures.

Promenvir: Una eina d'analisis estadistic de dades desenvolupada al CEPBA

Sergi Girona

Durant els anys 1996 i 1997, el CEPBA ha participat en el desenvolupament d'aquesta eina i en la seva definició. Promenvir és una eina pensada inicialment per a fer analisis estadistic de sistemes mecanics d'escala industrial tot utilitzant solvers comercials.

Degut a la definició de l'eina, s'ha aconseguit que el que estava orientat a sistemes mecanics sigui d'utilitat en qualsevol area on s'utilitzi un software(solver) en el que els paràmetres d'entrada poden no pendre valors fixes, sino tindre petites variacions; i afectar d'aquesta forma els resultats del sistema.

El que ens permet estudiar Promenvir és com afecten les variacions de paràmetres d'entrada al solver(simulador) respecte a la sortida. La resolució és realitza amb metodes de Monte-Carlo on s'inicien diferentes execucions amb diferents valors, i el sistema s'encarrega de distribuir la carrega tot mirant de reduir la utilització de recursos.

Es farà una petita descripció del disseny de l'eina, així com de les funcionalitats que proporciona, i finalment un exemple d'utilització.

Cal dir, que Promenvir és una eina comercial que ja esta sent utilitzada per diferents empreses internacionals; i que a la UPC la podem utilitzar de forma gratuita.

Optimización del rendimiento de la cache de instrucciones para aplicaciones comerciales

Alex Ramirez

Instruction fetch bandwidth is feared to be a major limiting factor to the performance of future wide-issue aggressive superscalars. Consequently, it is crucial to develop techniques to increase the number of useful instructions per cycle provided to the processor. Unfortunately, most of the past work in this area has largely focused on engineering workloads, rather than on the more challenging, badly-behaved popular commercial workloads.

In this talk, we focus on Database applications running Decision Support workloads. We characterize the locality patterns of database kernel code and find frequently executed paths. Using this information, we propose an algorithm to lay out the basic blocks of the database kernel for improved I-fetch. Finally, we evaluate the scheme via simulations.

Our results show a miss reduction of 60-98% for realistic I-cache sizes and a doubling of the number of instructions executed between taken branches. As a consequence we increase the fetch bandwith provided by an aggressive sequential fetch unit from 5.8 for the original code to 10.6 using our proposed layout. Our software scheme combines well with hardware schemes like a Trace Cache providing up to 12.1 instruction per cycle, suggesting that commercial workloads may be amenable to the aggressive I-fetch of future superscalars.

Analisis cuantitativo del SPEC95

Agustin Fernandez

Todo lo que usted queria saber sobre el SPEC95 y nunca se atrevio a simularlo.

Vector Microprocessors: A Case Study in VLSI Processor Design

Krste Asanovic

El seminari mostra en detall el proces de disseny d'un microprocessador vectorial a partir d'eines de domini public aixi com eines desenvolupades de forma especifica. La presentacio es basa en la experiencia de l'autor en el disseny del microprocessador T0 realitzat a la universitat de Berkely.

T0 (Torrent-0) is a single-chip fixed-point vector microprocessor designed for multimedia, human-interface, neural network, and other digital signal processing tasks. T0 includes a MIPS-II compatible 32-bit integer RISC core, a 1KB instruction cache, a high performance fixed-point vector coprocessor, a 128-bit wide external memory interface, and a byte-serial host interface. Fabricated in a 1.0 micron CMOS process, the die measures 16.75mm square and contains 730,701 transistors. At the maximum clock frequency of 45MHz, T0 can simultaneously sustain 720 million arithmetic operations per second and 720MB/s of external memory bandwidth, with up to 30MB/s of DMA I/O. The first use of T0 is as the core of the SPERT-II workstation accelerator board.

Day 1: T0 design rationale. Why T0 was done the way it was.
Day 2: VLSI microprocessor design flow. How T0 was built, methodology and tools.
Day 3: Future architectures. Pipeline design including exceptions, rake caches, flags and speculative vector execution, etc. Update on the Berkeley VIRAM-1 project - describing their architecture and design flow. Future directions - (or, what next after vectors...)

Simplescalar, una eina de simulacio d'arquitectures superescalars

Jose Gonzalez

En esta charla se hara una introduccion a la herramienta de simulacion Simplescalar, centrada en la version 3.0. Se explicara cada uno de los diferentes simuladores que componen la herramienta, asi como la utilidad de cada uno de ellos. Se explicaran tambien trucos para poder instrumentar "a mano" programas de manera similar a como lo hace ATOM: conseguir traza de loads y stores junto con su direccion efectiva, traza de saltos condicionales, como obtener los valores de los registros, etc. Finalmente se explicara en detalle el simulador de una arquitectura fuera de orden: como empezar a trabajar, sus puntos fuertes... sus puntos debiles. El objetivo de esta charla es conseguir que aquellos que quieran utilizar esta herramienta no necesiten invertir muchas horas en entender su funcionamiento basico.

Busqueda en el Web: Desafios y Soluciones

Ricardo Baeza

En esta charla describimos las caracteristicas actuales de la Web y los problemas que existen para buscar en ella. Luego detallamos como funcionan los buscadores tipo AltaVista desde el punto de vista de hardware y de software, al igual que directorios como Yahoo! y los nuevos lenguajes de consulta de Web. Terminamos con nuevos resultados que solucionan parcialmente algunos problemas, tales como busqueda en texto comprimido, busqueda aproximada, lenguajes de consulta visual y visualizacion de respuestas.

Speculative Multithreaded Processors

Pedro marcuello

We present a novel processor microarchitecture that relieves four of the most important bottlenecks of superscalar processors: the serialization imposed by true dependences, the instruction window size, the complexity of a wide issue machine and the instruction fetch bandwidth requirements. The new microarchitecture executes simultaneously multiple threads of control obtained from a single program by means of control speculation techniques that do not require any compiler/user support. In this way, it works on a large instruction window composed of multiple nonadjacent small windows. Multiple simultaneous threads execute different iterations of the same loop, which requires the same fetch bandwidth as a single thread since they share the same code. Dependences among different threads as well as the values that flow through them are speculated by means of data prediction techniques. The novel processor organization does not require any special feature in the instruction set architecture; its novel features are completely based on hardware mechanisms. The architecture is scalable in the sense that it consists of a number of processing units with separate hardware for issuing and executing instructions. The preliminary evaluation results show the potential of the new architecture to achieve a high IPC rate. For instance, a processor with 4 four-issue processing units achieves an IPC from 2.2 to 9.9 for the Spec95 benchmarks.

A Direct-Execution Framework for Fast and Accurate Simulation of Superscalar Processors

Venkata Krishnan

Multiprocessor system evaluation has traditionally been based on direct-execution based Execution-Driven Simulations (EDS). In such environments, the processor component of the system is not fully modeled. With wide-issue superscalar processors being the norm in today's multiprocessor nodes, there is an urgent need for modeling the processor accurately. However, using direct-execution to model a superscalar processor has been considered an open problem. Hence, current approaches model the processor by interpreting the application executable. Unfortunately, this approach can be slow. In this paper, we propose a novel direct-execution framework that allows accurate simulation of wide-issue superscalar processors without the need for code interpretation. This is achieved with the aid of an {\em Interface Window} between the front-end and the architectural simulator, that buffers the necessary information. This eliminates the need for full-fledged instruction emulation. Overall, this approach enables detailed yet fast EDS of superscalar processors. Finally, we evaluate the framework and show good performance for uni- and multi-processor configurations.

La herramienta Ictineo: navegando entre alto nivel y bajo nivel

Cristina Barrado

Dentro de la línea de Computación de Altas Prestaciones, se están llevando a cabo numerosos trabajos de investigación orientados a mejorar la eficiencia en la ejecución de los programas a base de extraer el máximo rendimiento a los computadores existentes o proponiendo nuevas concepciones de la arquitectura de los mismos. Para muchos de estos trabajos las fases finales de compilación son un punto crucial del camino a seguir.

En este seminario se presenta el estado de un proyecto que nació hace ahora 4 años con el objetivo de ayudar en los trabajos realizados en varias tesis del departamento. El proyecto se bautizó con el nombre de Ictíneo , y actualmente está dando soporte dentro del departamento a algunas tesis más, a un proyecto europeo y es la base de varios proyectos docentes. También está siendo usado fuera de este departamento, en otras universidades y centros de investigación.

Swap compression: a way to increase application performance

Toni Cortes

The performance of out-of-core applications tends to be poor due to the high overhead added by the swapping mechanism. The same problem may be found in highly-loaded multi-programming systems where all running applications have to use the swap space in order to be able to execute at the same time. Furthermore, those out-of-core applications might not be able to run on laptop computers as their resources are usually smaller than the ones found in a desktop system. In this paper, we present a solution to both problems. The idea consists of compressing the swapped pages and keeping them in a swap cache whenever possible. Although it is not the first time this solution has been proposed, we claim that our system is the first one to achieve significant performance gains. This happens even in cases where our predecessors ended up slowing down the performance of the applications. All the results presented in this paper have been obtained by running real applications on a modified system. for VLIW architectures.

Instruction Set Architectures

Trevor Mudge

This talk is in two parts. First I will provide a brief sketch my research history and past work on computer architecture at the University of Michigan. Second, I will present a talk on instruction set architectures (ISAs). In this talk I will trace the development of modern ISAs and their implementations. I will argue that the details of an ISA are not as important as many computer architects maintain, provided some essential ingredients are included. The overwhelming consideration for the near future is binary compatibility in the case of desktop systems, although this is not the case for embedded systems. With this as a constraint I will suggest how future systems might develop.

Memories

Trevor Mudge

In this talk I make the point that memory systems are the limiting factor in today's computer systems. I will then examine three levels of the memory hierarchy: registers, caches, and main memory. At each level I will summarize the current state of research and suggest some new directions that may help to close the "memory gap."

Making Instruction Sets Irrelevant
and
Improving Performance in the Process

Jim E. Smith

The current hardware/software interface, i.e. the instruc- tion set architecture (ISA) is very confining. And there are currently a number of approaches that attempt to bypass it, ranging from fine-grain CISC-to-RISC pipeline transla- tion to coarse-grain binary translation. We propose extend- ing this approach by co-designing a layer of software along with a hardware microarchitecture. All conventional software is then supported by a virtual ISA that sits on top of the co-designed virtual machine. Unlike conventional virtual machines targeted at platform independence, however, performance is our foremost goal, and the native ISA can be non-standard. If virtual machine co-design is used, a number of barriers to hardware advances come down. The cooperation between virtual machine software and hardware can provide a means for managing complexity while fulfilling the promise of proposed performance features.

Design and Evaluation of Architectures for Commercial Applications

Luiz Andre Barroso

A new important design target for high-performance computers are commercial workloads, including various database applications and Web servers. These workloads constitute the largest market segment for mid-range and high-end machines, and their dominance is expected to increase. However, these applications have not received due attention by the research community given their size/complexity and some restrictions in publishing results using them. In this seminar I will describe some of the most relevant commercial application benchmarks, introduce a variety of tools and methods that are useful in studying them, and present a series of case studies that illustrate how these workloads are likely to shape the architecture of next-generation systems.

The Potential of Data Value Speculation to Boost ILP

Jose Gonzalez

Instruction Level Parallelism (ILP) is one of the key issues to boost the performance of future generation processors. Data dependences may become one of the main bottlenecks that limit the performance of such future architectures due to the serialization that they introduce in the execution of programs. Data value speculation is a novel mechanism that may avoid such serialization by predicting the values that flow among dependent instructions. This work presents a study of the limits of the performance potential of data value speculation to boost ILP. Both realistic and probabilistic value predictors are considered in this paper. The interaction with some other critical features of the microarchitecture is researched: instruction window size, branch prediction accuracy and instruction latency. Results show that data value speculation has a significant potential to boost ILP. However, for a relatively small instruction window, like the one that can currently be built with the best branch predictors, the achieved speedup is moderate. The speedup provided by data value speculation is also dependent on the criticality of the correctly predicted instructions. Finally, it is shown that the speedup that data value speculation may achieve in superscalar processors, in the way that it has been used so far, can be approximated by a linear function of the prediction accuracy.

Modelo de area y tiempo para bancos de registros multipuerto y bancos de colas

Josep Llosa

In this presentation an analytic model for the area and acces time for multiported register files and and for queue files will be described. The inputs of the model are: number of registers/queues, the width of the registers, the number of read and write ports, and the number of elements per queue. Annother input used by the model are the technological parameters of the process.

The Synergy of Multithreading and Access/Execute Decoupling

Joan Manuel Parcerisa

This work presents and evaluates a novel processor microarchitecture which combines two paradigms: access/execute decoupling and simultaneous multithreading. We investigate how both techniques complement each other: while decoupling features an excellent memory latency hiding efficiency, multithreading supplies the in-order issue stage with enough ILP to hide the functional unit latencies. Its partitioned layout, together with its in-order issue policy makes it potentially less complex, in terms of critical path delays, than a centralized out-of-order design, to support future growths in issue-width and clock speed.

The simulations show that by adding decoupling to a multithreaded architecture, its miss latency tolerance is sharply increased and in addition, it needs fewer threads to achieve maximum throughput, especially for a large miss latency. Fewer threads result in a hardware complexity reduction and lower demands on the memory system, which becomes a critical resource for large miss latencies, since bandwidth may become a bottleneck.

Proyecto MHAOTEU: Memory Hierarchy Analysis and Optimization Tools for the End-User

Antonio Gonzalez

Memory latency is one of the main reasons for performance degradation of current computers. This problem is exacerbated by the fact that the relative memory latency increases by 52% per year. On the other hand, there is a lack of tools to help programmers to improve their applications from a memory perspective. We can therefore see a growing demand for solutions to this problem.

This project aims at developing a set of tools that will help program developers to tune their applications for a better use of the memory hierarchy. These tools will target both sequential and parallel machines.

The set of tools will help programmers to analyse the performance of their codes from a memory perspective, and will allow them to transform their programs in order to reduce the memory penalties.

The tools will be interactive and use both static and dynamic information since fully automatic transformations based on a static analysis of the program are rather limited in their performance.

Hardware and Software tools for Parallel Processing

Skevos Evripidou

The presenation will concetrated on two of his projects: (1) Thread Synchronization Unit (TSU) a building block for Multithreaded architecture with conventional Microprocessors and (2) Net-dbx: A Java Powered tool for Interactive debugging of MPI programs across the Internet and Net-Console: A Java Powered tool for The TSU is a hardware-based mechanism which provides data-driven synchronization, based on the D3-model, for conventional microprocessors in a large-scale network of workstations. During program execution, the TSU schedules each thread based on data availability. Scheduling based on data availability provides tolerance to long memory and communications latencies inherent in large-scale multiprocessors, thus making the proposed architecture truly scalable and easily programmed. The TSU also provides for significant performance improvements by independently managing the flow of data to a processing node's cache based on data availability. An overall development goal will be to eliminate cache misses in each of the processing nodes through the use of the TSU design. The TSU is being built from high-speed, high-density, high I/O count field-programmable gate arrays (FPGA's). The TSU board is a plug-in board which will fit in the L2 cache-connector provided on the Pentium motherboards.

Net-dbx is a tool that utilizes Java and other World Wide Web tools for the debugging of MPI programs from anywhere in the Internet. Net-dbx is a source-level interactive debugger with the full power of gdb (the GNU Debugger) augmented with the debug functionality of the public-domain MPI implementation environments. The main effort was on a low overhead, yet powerful, graphical interface supported by low-bandwidth connections. The portability of the tool is of great importance as well, because it enables the tool to be used on heterogeneous nodes that participate in an MPI multicomputer. The user of our system simply points his browser to the Net-dbx page, logs in to the destination system, and starts debugging by interacting with the tool, just as with any GUI environment. The user can dynamically select which MPI processes to view/debug. A special WWW-based environment has been designed and implemented to host the system prototype.

Command Vector Memory Systems: High Performance at Low Cost

Jesús Corbal

The focus of this paper is on designing both a low cost and high performance, hi gh bandwidth vector memory system that takes advantage of modern commodity SDRAM memory chips. To successfully extract the full bandwidth from SDRAM parts, we propose a new me mory system organization based on sending commands to the memory system as opposed to sending individual addresses. A command specifies, in a few bytes, a request for multiple independent memory words. A co mmand is similar to a burst found in DRAM memories, but does not require the memory words to be cons ecutive. The command is sent to all sections of the memory array simultaneously, thus not requiring a crossbar in the proper sense. Our simulations show that this command based memory system can improve performan ce over a traditional SDRAM-based memory system by factors that range between 1.15 up to 1.54. Moreover, in many cases, the command memory system outperforms even the best SRA M memory system under consideration. Overall the command based memory system achieves similar or better results than a 10ns SRAM memory system (a) using fewe r banks and (b) using memory devices that are between 15 to 60 times cheaper.

Delivering Instruction Bandwidth using a Trace Cache

Sanjay Jeram Patel

One of the necessary elements for a high performance processor is a high bandwidth instruction fetch mechanism. Processors capable of executing four instructions each cycle are being built today; machines capable of executing up to sixteen are being investigated for future generations. To take advantage of such wide execution bandwidth requires equally wide fetch capability. Conventional instruction fetch mechanisms fall short at wide fetch because of the inability to easily deal with branches. The trace cache is a mechanism that can deliver many instructions per cycle, more than a single basic block, by caching segments of the dynamic instruction stream. In the trace cache, logically contiguous instructions are placed in physically contiguous storage.
In this talk, I will describe our work on developing the trace cache into an effective means of delivering instructions to a 16-wide superscalar processor. Partial Matching, Inactive Issue, Branch Promotion, and Trace Packing increase the effectiveness of the cache by systematically removing the limitations that inhibit its performance. Partial Matching and Inactive Issue add flexibility to the trace cache by allowing the branch predictor to select only a portion of a cache line for issue. Branch Promotion is a means of dynamically converting conditional branches that highly favor one outcome into unconditional branches with a faulting semantic. Trace Packing explores options that construct trace cache lines more effectively. I will also present our analysis of the trace cache, showing the trace cache's effect on branch resolution time and a measurement of instruction duplication within the cache.

Smart Register Files

Trevor N. Mudge

This talk examines how the compiler can more efficiently use a large number of processor registers. The placement of data items into registers, called register allocation, is an important compiler optimizations for high-speed computers because registers are the fastest storage devices in the computer system. However, register allocation has been limited in scope because of aliasing in the memory system. To break this limitation and allow more data to be placed into registers, new compiler and microarchitecture support is needed.
We propose the modification of register access semantics to include an indirect access mode. We call this optimization the Smart Register File. The smart register file allows the relaxation of overly-conservative assumptions in the compiler by having the hardware provide support for aliased data items in processor registers. As a result, the compiler can allocate data from a larger pool of candidates than in a conventional system. An attendant advantage is that the smart register file reduces the number of load and store operations executed by the processor. The simple addition of an indirect register access mode not only simplifies alias handling, but also provides opportunities for other optimizations. This talk examines several such optimizations.

A multithreaded runtime environment with thread migration for data-parallel compilers

Christian Perez

This talk studies the benefits of compiling data-parallel languages onto a multithreaded runtime environment providing dynamic thread migration facility. Each abstract process is mapped onto a thread, so that dynamic load balancing can be achieved by migrating threads among the processing nodes. The talk describes and evaluates an implementation of this idea in the Adaptor HPF and the UNH C* data-parallel compilers. It shows that no deep modifications of the compilers are needed, and that the overhead of managing threads can be kept small. The end is dedicated to present some preliminary experiments.

Architecture Alternatives for Single System Image Clusters

Rajkumar Buyya

The availability of high-speed networks and increasingly powerful commodity micr oprocessors are making the usage of clusters, or networks, of computers an appea ling vehicle for cost effective parallel/dependable computing. Clusters, built u sing commodity-of-the-shelf (COTS) hardware components as well as free, or commo nly used, software, are playing a major role in redefining the concept of superc omputing. In this talk, we focus on the issues involved in making loosely couple d networked computers transparently appear and work as a single system. This pro perly of cluster is popularly called as "single system image", SSI in short.

A single system image properly of clusters can be created either by software or/ and hardware. In this talk, we focus on techniques for achieving a single system image by software means-cluster underware (i.e, at OS kernel level), cluster mi ddleware (i.e., a layer between applications and operation systems), and applica tion level. We also discuss the use of these techniques in research and commerci al software systems (such as MOSIX, UnixWare, PARMON, Nimrod, etc.) supporting a single system image. We bring out merits and demerits of these SSI techniques a nd conclude the talk with highlighting possible directions that cluster computin g research and market is heading towards.

The Mass Customization of Instruction-set Architectures

Joseph A. Fisher

Hewlett-Packard Laboratories

Cambridge, MA

Not so long ago, almost every computer manufacturer supported its own unique and proprietary computer architecture; some supported more than one. But now, due to the pressure of market efficiencies, the general-purpose computing world is standardarizing on a very small number of processor architectures. Even in the embedded computing world, where the specialization of CPUs seems particularly attractive, the same pressures apply and architectures are becoming much more standardized. Thus it seems clear that computer architecture is a less attractive field for researchers than it had been, and that we will see little variety in future architectures.

In this talk I argue that we will soon see the trend going in the opposite direction: we will see far greater specialization of architectures to specific uses. I believe that the performance benefits are potentially great, especially in embedded processors. For both embedded and general-purpose processors, I'll discuss what the barriers to customization are, and the emerging technologies that break those barriers.

Upcoming Architectural Advances in Distributed Shared Memory Machines and Their Impact on Programmability

Josep Torrellas
Computer Science Department
University of Illinois at Urbana-Champaign

Major advances in the architecture of Distributed Shared-Memory (DSM) multiprocessors are in the offing. The major drive behind them is the increasing integration of transistors on a chip. In the first part of this talk, we will discuss some of the observed trends. One of the trends is the progressive system integration. This trend involves increasingly integrating processor and main memory, using either a single package with several processor and memory chips, or chips that combine both processor and memory. Alternatively, we can integrate multiple compute engines on a chip. Another trend is to use the extra on-chip transistors to design more sophisticated designs. One such approach is to devote many processor resources to predict future events and, based on the prediction, speculatively execute future activity now. Overall, all these advances promise large performance improvements. However, because they enhance the memory hierarchy in non-trivial ways, to truly exploit them will often involve even more efforts in application performance-tuning. In the second part of the talk, we will discuss some of these problems.

The paper corresponding to this talk is available from the website.

Compiler-Directed Elimination of Dynamic Computation Redundancy

Daniel A. Connors
University of Illinois, Urbana-Champaign

One of the major challenges to increasing processor performance is overcoming the fundamental dataflow limitation imposed by data dependences. By reusing previous computation results, the dataflow limit can be surpassed for sequences of operations that are otherwise redundantly executed. Although traditional compiler techniques eliminate program redundancy, these optimization techniques rely on the detection of static redundancy. Static redundancy requires the complete assertion that the computations be definitely redundant for all executions. As such, compiler techniques have no mechanism for capturing dynamic redundancy, redundancy occurring over a temporal set of definitions. As a result, several empirical studies indicate that significant amounts of redundancy, or value locality, still exist in optimized programs. This talk will describe an integrated compiler and architecture approach to exploit value locality for large regions of code. In this approach, the compiler performs analysis to identify code regions whose computation can be reused during dynamic execution. The instruction set architecture provides a simple interface for the compiler to communicate the scope of each reuse region and its live-out register information to the hardware. During run-time, the execution results of these reusable computation regions are recorded into hardware buffers for potential reuse. Each reuse can eliminate the execution of a large number of dynamic instructions. The rationale and initial results of the proposed approach to improve performance and exploit instruction repetition will be presented.

That's all folks!!!!!