Seminar 2006-07 abstracts

Seminar 06/07 - Abstracts

Techniques for Reduction of Misspeculation Penalties in Transactional Memory Systems
Per Stenström
Department of Computer Science and Engineering, Chalmers University of Technology

Transactional memory promises to reduce the programming efforts for multi-core processors by avoiding the serialization imposed by coarse-grain locking using critical sections. However, transactional memory schemes that rely on lazy conflict resolution squash one of the transactions involved in a data race and forces it to re-execute from the very beginning which for long-running transactions can waste lots of useful execution. Moreover, such schemes are prone to starvation which may cause repeated misspeculations to severely prolong the execution time of parallel programs.
In this talk, I will present a number of techniques that can help reducing the penalty associated with misspeculations in transactional memory systems using lazy conflict resolution. One of the techniques records what locations are involved in data races and uses this information in future transaction invocations to insert checkpoints to avoid re-executing the transaction from the very beginning. Another technique records the number of misspeculations that each transaction has encountered, where a high count is a symptom of a potentially starved transaction. By giving such transactions priority to commit, I show that starvation can be avoided. Experimental results demonstrate that proposed techniques can improve the performance of parallel programs designed for transactional memory systems significantly.

Graphic processors to speed-up simulations for the design of high performance solar receptors
Marc Daumas
CNRS, France

Graphics Processing Units (GPU) are now powerful and flexible systems adapted and used for other purposes than graphics calculations (General Purpose computation on GPU - GPGPU). We present here a prototype to be integrated into simulation codes that estimate temperatures, velocities and pressure to design next generations of solar receivers. Such codes will delegate to our contribution on GPUs the accurate computation of heat transfers due to radiations. We use Monte-Carlo line-by-line ray-tracing through finite volumes. That means data-parallel arithmetic transformations on large data structures. Our prototype is inspired on the source code of GPUBench. Our performances on two recent graphics cards (Nvidia 7800GTX and ATI RX1800XL) show some speed-up higher than 400 compared to CPU implementations leaving most of CPU computing resources available.

Utility-Oriented Grid Computing and the Gridbus Middleware
Rajkumar Buyya
Grid Computing and Distributed Systems Laboratory, The University of Melbourne, Australia

Grid computing, one of the latest buzzwords in the ICT industry, is emerging as a new paradigm for Internet-based parallel and distributing computing. It enables the sharing, selection, and aggregation of geographically distributed autonomous resources, such as computers (PCs, servers, clusters, supercomputers), databases, and scientific instruments, for solving large-scale problems in science, engineering, and commerce. It leverages existing IT infrastructure to optimize compute resources and manage data and computing workloads. The developers of Grids and Grid applications need to address numerous challenges: security, heterogeneity, dynamicity, scalability, reliability, service creation and pricing, resource discovery, resource management, application decomposition and service composition, and qualify of services. A number of projects around the world are developing technologies that help address one or more of these challenges. To address some these challenges, the Gridbus Project at the University of Melbourne has developed grid middleware technologies that support rapid creation and deployment of eScience and eBusiness applications on enterprise and global Grids.
In this seminar, we present technological evolution and key challenges in building and managing Utility Grids. We place emphasis on fundamental challenges of Grid economy, how to design and develop Grid technologies and applications capable of dynamically leasing services of distributed resources at runtime depending on their availability, capability, performance, cost, and users' quality of service requirements. We then introduce Gridbus Project R&D efforts with focus on distributed computational economy for effective management of resources. We briefly present various components of the Gridbus Toolkit and then discuss, in detail, the Gridbus service broker that supports composition and deployment of applications on utility Grids. Case studies on the use of Gridbus middleware in creation of various Grid applications (such as distributed molecular docking, high energy physics, and natural language processing) and their deployment on national and international Grids will also be highlighted.

Temperature-Aware Techniques
Israel Koren
Department of Electrical and Computer Engineering University of Massachusetts, Amherst

Power density of microprocessors is increasing with every new process generation resulting in higher maximum chip temperatures. High chip temperatures greatly affects its reliability, raises leakage power and makes cooling more expensive. Temperature projection at the design stage and during the lifetme of the chip are necessary for development of temperature aware techniques to control chip temperature. In this talk, we first present a new transient thermal simulation algorithm, TILTS, which is much faster than conventional simulation algorithms. Based on TILTS, we developed a lightweight runtime temperature monitoring tool, called Temptor, that uses internal performance counters to estimate runtime chip temperature distributions with negligible overhead and good precision. Then, we present a temperature-aware floorplanning algorithm that is capable of reducing the maximum chip block's temperature by modifying the placement of blocks. The floorplans of the Alpha and Pentium Pro microprocessors are used to illustrate the benefits of the floorplanning algorithm.

Reconfigurable computing, a new supercomputing paradigm
Walid A. Najjar
Department of Computer Science and Engineering at the University of California Riverside

The ROCCC (Riverside Optimizing Configurable Computing Compiler) is an optimizing C-to-VHDL compiler used to compile routines written in a subset of C to an application-specific circuit on an FPGA. ROCCC incorporates several powerful parallelizing transformations targeted towards code generation for FPGAs and can achieve performance comparable to hand-coded VHDL. We have demonstrated speedups ranging from 800x to several 1,000x over the Itanium 2 1.6 GHz on an SGI Altix 4700 with one RASC RC 100 blade.

Parallel Execution Models for Future Multicore Architectures
Guri Sohi
Computer Sciences Department, University of Wisconsin-Madison

With uniprocessor performance increases leveling off, and with the semiconductor industry moving towards multicore processors, novel ways of parallelizing the execution of a variety of computing applications will be needed. To be viable, a parallel execution model for future multicore architectures should not only mesh well with the programming styles that we expect in the future, but perhaps even leverage the characteristics of such programming styles.
This talk will review proposed parallel execution models and present Program Demultiplexing (PD), an execution model that creates concurrency in sequential programs by "demultiplexing" methods (functions or subroutines). Call sites of a demultiplexed method in the program are associated with handlers that allow the method to be separated from the sequential program and executed on an auxillary processor. The demultiplexed execution of a method (and its handler) is speculative and occurs when the inputs of the method are (speculatively) available, which is typically far in advance of when the method is actually called in the sequential execution. A trigger, composed of predicates that are based on program counters and memory write addresses, launches the (speculative) execution of the method on another processor.
Results from our initial experience with a simulation model for PD will be presented. For eight integer benchmarks from the SPEC2000 suite, programs written in C with no explicit concurrency and/or motivation to create concurrency, we achieve a harmonic mean speedup of 1.8x on four processors. We believe that PD can achieve further speedup when opportunities for concurrency are exposed to programmers and/or on applications that use modern object-oriented languages.
Given time, the talk will touch on some other hardware trends that are likely to have implications for how software is written in the future.

Reducing systems management complexity using server virtualization technology
Malgorzata Steinder
IBM T. J. Watson Research Center

Server virtualization opens up a range of new possibilities for datacenter management, through the availability of new automation mechanisms that can be exploited to manage software within virtual machines. Server virtualization permits innovative ways of software distribution inside pre-installed, pre-wired, and pre-configured freeze-dried disk images. This allows simplifying the process of software installation. Server virtualization also enables powerful and flexible autonomic control, through management software that maintains the system in a desired state in the face of changing workload and demand. In this talk, I will discuss a project that aims at simplifying systems management with the use of server virtualization technologies. I will introduce the concept of Virtual Software Appliance and discuss tools we build to manage the entire life-cycle of a Virtual Software Appliance.

Efficient Power Behavior Characterization
Daniel A. Jiménez
Department of Computer Science, Rutgers University

Fine-grained program power behavior is useful in both evaluating power optimizations and observing power optimization opportunities. Detailed power simulation is time consuming and often inaccurate. Physical power measurement is faster and objective. However, fine-grained measurement generates enormous amounts of data in which locating important features is difficult, while coarse-grained measurement sacrifices important detail.
We present a program power behavior characterization infrastructure that identifies program phases, selects a representative interval of execution for each phase, and instruments the program to enable precise power measurement of these intervals to get their time-dependent power behavior.
We show that the representative intervals accurately model the fine-grained time-dependent behavior of the program. They also accurately estimate the total energy of a program. Our compiler infrastructure allows for easy mapping between a measurement result and its corresponding source code. We improve the accuracy of our technique over previous work by using edge vectors, i.e., counts of traversals of control-flow edges, instead of basic block vectors, as well as incorporating event counters into our phase classification.
We validate our infrastructure through the physical power measurement of 10 SPEC CPU 2000 integer benchmarks on an Intel Pentium 4 system. We show that using edge vectors reduces the error of estimating total program energy by 35% over using basic block vectors, and using edge vectors plus event counters reduces the error of estimating the fine-grained time-dependent power profile by 22% over using basic block vectors.

Fusión Automática de Ontologías Usando Propiedades Semánticas
Adolfo Guzmán-Arenas
Centro de Investigación en Computación, Instituto Politécnico Nacional, México

El conocimiento de un ser humano se va acumulando conforme a lo que sucede en su entorno, las fuentes de información tienen un papel importante en este proceso; no se arranca de cero, inclusive un animal nace con conocimiento previo. Su conocimiento aumenta al agregar nuevos conceptos o asociarlos a los ya posee. Aunque existe información del exterior que puede contradecir o confundir a un ser humano, éste cuenta con las herramientas que le permite resolverlo de alguna manera. A éste cúmulo de información se le puede llamar su ontología.
Las ontologías también se pueden estructurar y definir en las computadoras. Este trabajo se centra en la unión de ontologías entre computadoras, durante esta unión pueden suceder los mismos casos que en una persona; la diferencia es que las máquinas carecen de sentido común y los desafíos son hacer la fusión de manera automática, que no se detenga ante los problemas (redundancias, distinto nivel de descripción, sinónimos, homónimos…) que se presenten y que el resultado sea lo más cercano a la fusión natural de conocimiento del ser humano.
Hay actualmente trabajos que realizan la unión de ontologías, mas lo hacen de manera semiautomática (un editor de ontologías que ofrece sugerencias al usuario). También, en principio, es posible escribir algoritmos que unan ontologías representadas en lenguajes formales (lógicos). Empero, las ontologías deben ser consistentes para poder ser representadas. Desafortunadamente, las ontologías reales son inconsistentes e incompletas.
Este trabajo presenta un proceso de unión de ontologías automático y robusto. Automático porque la computadora detecta y resuelve los problemas que se presentan durante el proceso de la unión, y robusto porque realiza la unión pese a que las ontologías son mutuamente inconsistentes o representan la información desde distintos ángulos, con distinto detalle, con homónimos, usando palabras sin definir... Se demuestra la eficiencia del algoritmo de fusión a través de varios ejemplos reales con documentos obtenidos de Internet cuyas ontologías se construyeron manualmente y se fusionaron de manera automática. Los resultados tuvieron un ligero margen de error en comparación con la fusión manual de un usuario experto en el tema del documento.

Más información.
Artículo
Tesis de doctorado

Network on a Chip (NoC): an alternative approach for the design of next generation System on a Chip (SoC)
Nader Bagherzadeh
Departament of Electronical Engineering and Computer Science, University of California Irvine

In this talk I will give an overview of NoC architectural concepts and then discuss our current research for modeling and improving all aspects of an NoC architecture. Our emphasize is to find an efficient replacement for bus based architectures for very large number of IPs for an SoC design where scalability of the design prohibits current approaches to be practical, because of the limitations of wire delay and other factors related to 65 nm technology and beyond.

Time for Change for Linux: from Metronomes to Alarm Clocks
Daniel P. Bovet and Marco Cesati
University of Rome "Tor Vergata"

The Linux timekeeping architecture is radically changing. In this talk we introduce briefly the timekeeping hardware devices and we show how two different versions of Linux 2.6, namely the current 2.6.20 and the forthcoming 2.6.21, are exploiting them. Furthermore, we present our personal ideas on how the Linux timekeeping architecture might evolve in a near future.

Virtual Private Caches
Jim Smith
Department of Electrical and Computer Engineering, University of Wisconsin- Madison

Virtual Private Machines (VPM) provide a framework for implementing Quality of Service (QoS) policies in CMP-based computer systems. VPMs incorporate microarchitecture mecha- nisms that allow shares of hardware resources to be allo- cated to concurrently executing threads. VPMs can thereby provide applications with an upper bound on execution time regardless of other thread activity. Virtual Private Caches (VPCs) are an important element of VPMs, and VPC hardware consists of two major components: a VPC Arbiter, which man- ages shared resource bandwidth, and a VPC Capacity Manager. Both the VPC Arbiter and VPC Capacity Manager provide mini- mum service guarantees that, when combined, achieve QoS within the cache. Simulation-based evaluation shows that conventional cache sharing policies allow concurrently executing threads to affect each other significantly. In contrast, VPCs meet their QoS performance objectives and have a fairness policy for distributing excess resources that is amenable to gen- eral purpose multi-threaded systems. On a CMP running het- erogeneous workloads, VPCs improve throughput by eliminating negative interference, i.e., VPCs improve average workload performance by 14% and the performance of the slowest pro- gram in a workload is improved by an average of 25%.

Cache Architecture and Mapping for Packet Forwarding Applications
Govind Ramaswamy
Supercomputer Education and Research Centre Indian Institute of Science, Bangalore (INDIA)

Packet forwarding is a memory-intensive application requiring multiple accesses through a trie structure. With the requirement to process packets at line rates high performance routers need to forward millions of packets every second. Even with an efficient lookup algorithm like the LC-trie, each packet needs up to 5 memory accesses. Earlier work shows that a single cache for the nodes of an LC-trie can reduce the number of external memory accesses. We propose a Heterogeneous Segmented Cache Architecture (HSCA) for packet forwarding application that exploits the significantly different locality characteristics of accesses to level-one and the lower-level nodes of an LC-trie. Further, we improve the hit rate of the cache for level-one nodes, which are accessed more than 80% of the time, by introducing a novel two-level mapping based cache placement scheme to reduce conflict misses. Performance results reveal that our base HSCA scheme reduces the number of misses incurred with a unified scheme by as much as 25%, leading to a 32% speedup in average memory access time. Two-level mapping further enhances the performance of the base HSCA by up to 13% leading to an overall improvement of up to 40% over the unified scheme.

Why NOT symmetric Chip Multiprocessing
Uri Weiser
Commex Technologies

Environment: Technology;
-Process technology roadmap is changing: Min Vcc does not scale anymore.
-Due to business reasons, products die size will stay constant while moving from process N to process N+1.
- The above technology trends will lead to a 1.5X increase in product power (assuming usage of all die functions) when moving from process N to process N+1. Using Dynamic Voltage Scaling (DVS) reached its limit already.
- The implications of power constrains ARE: Not all die functions can be active simultaneously.

Environment: Media applications
-Media (and off course graphics) will become a major thrust in future PC business
--The environment of performance hungry media applications
--The GainBandwidth Product and its Performance/Power implications

The environment implications:
-Homogenous/Symetric (e.g. CMP) cores will provide only a temporary solution to the power problem.
-Application specific heterogynous functions are highly power efficient.
- Non-simultaneously operation of these Heterogeneous functions can be used to overcome the power problem.

Conclusions

An Overview of High Performance Computing and Challenges for the Future
Jack Dongarra
University of Tennessee and Oak Ridge National Laboratory

In this talk we examine how high performance computing has changed over the last 10-year and look toward the future in terms of trends. These changes have had and will continue to have a major impact on our software. A new generation of software libraries and algorithms are needed for the effective and reliable use of (wide area) dynamic, distributed and parallel environments. Some of the software and algorithm challenges have already been encountered, such as management of communication and memory hierarchies through a combination of compile--time and run--time techniques, but the increased scale of computation, depth of memory hierarchies, range of latencies, and increased run--time environment variability will make these problems much harder. We will focus on three themes: - Automatically tuned linear algebra software - Redesign of software to fit multicore architectures - The importance of fault tolerance

Rapidly Selecting Good Compiler Optimizations using Performance Counters
John Cavazos
Institute of Computing Systems Architecture (ISCA), School of Informatics, Edinburgh University

Applying the right compiler optimizations to a particular program can have a significant impact on program performance. Due to the non-linear interaction of compiler optimizations, however, determining the best setting is non-trivial. There have been several proposed techniques that search the space of compiler options to find good solutions; however such approaches can be expensive. This paper proposes a different approach using performance counters as a means of determining good compiler optimization settings. This is achieved by learning a model off-line which can then be used to determine good settings for any new program. We show that such an approach outperforms the state-of-the-art and is two orders of magnitude faster on average. Furthermore, we show that our performance counter-based approach outperforms techniques based on static code features. Using our technique we achieve a 17% improvement over the highest optimization setting of the commercial PathScale EKOPath 2.3.1 optimizing compiler on the SPEC benchmark suite on a recent AMD Athlon 64 3700+ platform.

Continuous run-time adaptation and optimization of statically compiled programs
Grigori Fursin
INRIA Futurs, France

In this talk I will present our on-going work on simple and practical automatic run-time adaptation technique for statically compiled programs with varying context and behavior. Unlike other complex dynamic recompilation frameworks, our technique relies on function/loop versioning and static low-overhead monitoring and adaptation routines. We extend our previous work on adaption techniques for regular numerical codes with stable phases (presented at HiPEAC 2005) to codes with any (irregular) behavior by using time-slot run-time performance monitoring and statistical selection of appropriate versions. We are currently implementing this technique in the gcc compiler and plan to make it publicly available. We use it for continuous program optimizations, for speeding up iterative optimizations and for auto-tuning of libraries. We also plan to use this technique for OS kernel adaptation and for program auto-parallelization.

Exploiting Multiple Cores Now: Scalability and Reliability for Off-the-shelf Software
Emery Berger
Department of Computer Science. University of Massachusetts Amherst

Multiple core CPUs are here today. The prevailing notion is that we need to rewrite applications not originally designed to support multiple processors to make them multithreaded. Because of the difficulty of programming correct and efficient multithreaded applications (e.g., race conditions, deadlocks, and scalability bottlenecks), this is a major challenge. This talk presents two alternative approaches that bring the power of multiple cores to today's software. The first approach focuses on building highly-concurrent client-server applications from legacy code. I present a system called Flux that allows users to take unmodified off-the-shelf *sequential* C and C++ code and build concurrent applications. The Flux compiler combines the Flux program and the sequential code to generate a deadlock-free, high-concurrency server. Flux also generates discrete event simulators that accurately predict actual server performance under load. While the Flux language was initially targeted at servers, we have found it to be a useful abstraction for sensor networks, and I will briefly talk about our use of an energy-aware variant of Flux in a deployment on the backs of endangered turtles. The second approach uses the extra processing power of multicore CPUs to make legacy C/C++ applications more reliable. I present a system called DieHard that uses randomization and replication to transparently harden programs against a wide range of errors, including buffer overflows and dangling pointers. Instead of crashing or running amok, DieHard lets programs continue to run correctly in the face of memory errors with high probability. This is joint work with Brendan Burns, Kevin Grimaldi, Alex Kostadinov, Jacob Sorber, and Mark Corner (University of Massachusetts Amherst), and Ben Zorn (Microsoft Research).

Potential of TLS to Exploit Unexploitable Parallelism
Alex Nicolau
Department of Computer Science, University of California

Recent research in thread-level speculation (TLS) has proposed several mechanisms for optimistic execution of difficult-to-analyze serial codes in parallel. Though it has been shown that TLS helps to achieve higher levels of parallelism, evaluation of the unique performance potential of TLS, i.e., performance gain that be achieved only through speculation, has not received much attention. In this paper, we evaluate this aspect, by separating the speedup achievable via true TLP (thread-level parallelism) and TLS, for the SPEC CPU20002006 benchmarks. Further, we dissect the performance potential of each type of speculation: control speculation, data dependence speculation and data value speculation. To the best of our knowledge, this is the first dissection study of its kind. Assuming an oracle TLS mechanism which corresponds to perfect speculation and zero threading overhead whereby the execution time of a candidate program region (for speculative execution) can be reduced to zero, our study shows that, at the loop-level, the upper bound on the arithmetic mean and geometric mean speedup achievable via TLS across SPEC CPU2000 is 39.16% (standard deviation = 31.23) and 18.18% respectively.

Strassen's Algorithm and ATLAS Hybridization
Alex Nicolau
Department of Computer Science, University of California

Strassen's algorithm has practical performance benefits for architectures with simple memory hierarchies, because it trades computationally expensive matrix multiplications (MM) with cheaper matrix additions (MA). However, it presents no advantages for modern high-performance architectures with deep memory hierarchies, because MAs exploit limited data reuse. We present an easy-to-use adaptive algorithm combining Strassen's recursion and a highly-tuned version of ATLAS MM. In fact, we introduce a last step in the ATLAS-installation process that determines whether Strassen's may achieve any speedup. We present a recursive algorithm achieving up to 30% speed-up versus ATLAS alone. We show experimental results for a variety of systems.

That's all folks!!!!!