Seminar 2007-08 abstracts

Seminar 07/08 - Abstracts

Management of Virtualized Environments: An IBM Research perspective
Vasanth (Vas) Bala
IBM T.J. Watson Research Center

Virtualization technology is driving profound changes in the way large data centers are designed and managed. Moving to a virtualized infrastructure solves many problems such as hardware utilization and power consumption. But it also creates many new problems such as virtual machine image sprawl and a new management layer in the IT stack. From a technology perspective, the evolution of virtualization technology from a server-centric (mainframe) abstraction to today's distributed data center wide abstraction is also driving new requirements into the design of hardware platforms, OS/middleware platforms and software/services processes. This is poised to disrupt many conventional models for systems and software design. In this talk, I will give an overview of the emerging trends in the industry adoption of virtualization, followed by an IBM Research perspective of what new technical challenges this creates and how we are addressing them.

Computer Design in the Nanometer Scale Era: Challenges and Solutions
David Brooks
Harvard University

Technology scaling has enabled tremendous growth in the computing industry over the past few decades. However, recent trends in power dissipation, reliability, thermal constraints, and device variability threaten to limit the continued benefits of device scaling, curtail performance improvements, and cause increased leakage power in future technology generations. The temporal and spatial scales of these effects motivate holistic solutions that span the circuit, architecture, and software layers. In this talk, I will describe several ongoing projects that seek to address technology scaling issues. These projects include efforts in the areas of a) power and performance modeling and design space optimization for future chip-multiprocessor systems, b) variability-tolerant microarchitectures that are flexible in both latency and localized supply voltage, and c) accelerator-based architectures for power/performance efficiency. The talk will also discuss our chip prototyping efforts that support this work.

Scaling Up Next Generation Supercomputers
Valentina Salapura
IBM Thomas J. Watson Research Center, Yorktown Heights, NY

Historically, technology has been the main driver of computer performance. For many system generations, CMOS scaling has been leveraged to increase clock speed and build increasingly complex microarchitectures. As technology-driven performance gains are becoming increasingly harder to achieve, innovative system architecture must step in. In the context of the design of the Blue Gene/P supercomputer chip, we will discuss how we adopted a holistic approach to optimization of the entire hardware and software stack for a range of metrics: performance, power, power/performance, reliability and ease of use.
The new Blue Gene/P chip multiprocessor (CMP) scales node performance using a multi-core system-on-a-chip design. While in the past large symmetric multi processor (SMP) designs were sized to handle large amounts of coherence traffic, many modern CMP designs find this cost prohibitive in terms of area, power dissipation, and design complexity. As multi-core processors evolve to larger configurations, the performance loss due to handling coherence traffic must be carefully managed. Thus, to ensure high efficiency of each quad-processor node in Blue Gene/P, taming the cost of coherence of traditional SMP designs was a key requirement.
The new Blue Gene/P chip multiprocessor exploits a novel way of reducing coherence cost by filtering useless coherence actions. Each processor core is paired with a snoop filter which identifies and discards unnecessary coherence requests before they can reach the processor cores. Removing unnecessary lookups reduces the interference of invalidate requests with L1 data cache accesses, and reduces power by eliminating expensive tag array accesses. This approach results in improved power and performance characteristics.
To optimize application performance, we exploit parallelism at multiple levels: at the process-level, thread-level, data-level, and instruction-level. Hardware supported coherence allows applications to efficiently share data between threads on different processors for thread-level parallelism, while the dual floating point unit and the dual-issue out-of-order PowerPC450 processor core exploit data and instruction level parallelism, respectively. To exploit process-level parallelism, special emphasis was put on efficient communication primitives by including hardware support for the MPI protocol, such as low latency barriers, and five highly optimized communication networks. A new high performance DMA unit supports high throughput data transfers.
As the result of this deliberate design for scalability approach, Blue Gene supercomputers offer unprecedented scalability, in some cases by orders of magnitude, to a wide range of scientific applications. A broad range of scientific applications on Blue Gene supercomputers have advanced scientific discovery, which is the real merit and ultimate measure of success of the Blue Gene system family.

Efficient, Accurate Design Space Exploration
Sally McKee
School of Electrical and Computer Engineering, Computer Systems Lab, Cornell University

Effciently exploring exponential-size architectural design spaces with many interacting parameters remains an open problem: the sheer number of experiments required renders detailed simulation intractable. We attack this via an automated approach that builds accurate predictive models. We simulate sampled points, using results to teach our models the function describing relationships among design parameters. The models can be queried and are very fast, enabling efficient design tradeoffs discovery. We validate our approach via two uniprocessor sensitivity studies, predicting IPC with only 1-2% error. In an experimental study using the approach, training on 1% of a 250K- point CMP design space allows our models to predict performance with only 4-5% error. Our predictive modeling combines well with techniques that reduce the time taken by each simulation experiment, achieving net time savings of three-four orders of magnitude. We have also used the approach to predict runtimes of HPC applications with large parameter spaces and to predict the best number of processors to use on a phase-by-phase basis (concurrency throttling).

Los Enteros de Gauss como modelo para tratar ciertos problemas de codificación y transmisión de datos
Mon Beivide
Universidad de Cantabria

En esta charla se abordarán dos problemas clásicos que aparecen en múltiples aplicaciones de los computadores: códigos correctores de errores y paralelismo. Por un lado, se considerará la propuesta de códigos correctores de errores perfectos para alfabetos con símbolos multidimensionales. Por otro lado, se abordará el diseño de topologías para redes de interconexión en computadores paralelos. Ambos problemas tienen en la actualidad distintas aplicaciones prácticas. Por ejemplo, en las conexiones ADSL se emplea modulación de amplitud en cuadratura (QAM) que maneja símbolos bidimensionales. El ejemplo del otro ámbito sería el supercomputador BlueGene de IBM, con nodos etiquetados mediante símbolos de tres coordenadas que se organizan en un prisma toroidal.

En la charla veremos cómo determinados aspectos de ambos problemas pueden abordarse desde un punto de vista matemático usando anillos de enteros complejos. Especialmente, consideraremos los Enteros de Gauss que constituyen el anillo de los números complejos cuyas partes real e imaginaria son ambas enteras. También se mostrará algún ejemplo de aplicación de los enteros de Eisenstein-Jacobi y se considerarán un poco otras estructuras de enteros complejos de mayores dimensiones.

En la charla subyacen ciertos elementos teóricos que a algunos pudieran parecer complicados o tediosos. Sin embargo, el ponente, como no podría ser de otro modo, dará una visión intuitiva y no excesivamente formal de los problemas que se abordan y de la forma en que se resuelven.

Dynamic Data Driven Applications Systems
Frederica Darema
National Science Foundation (NSF)

This talk will discuss the Dynamic Data Driven Applications Systems (DDDAS) concept, driving novel directions in applications and in measurements, as well as in computer sciences and cyber-infrastructure. DDDAS entails the ability to incorporate dynamically additional data into an executing application (these data can be archival or collected on-line), and in reverse the ability of the applications will be able to dynamically steer the measurement process. The dynamic environments of concern here encompass dynamic integration of real-time data acquisition with compute and data intensive -systems. Enabling DDDAS requires advances in the application modeling methods and interfaces, in algorithms tolerant to perturbations of dynamic data injection and steering, in systems software, and in infrastructure support. Research and development of such technologies requires synergistic multidisciplinary collaboration in the applications, algorithms, software systems, and measurements systems areas, and involving researchers in basic sciences, engineering, and computer sciences. Such capabilities offer the promise of augmenting the analysis and prediction capabilities of application simulations and the effectiveness of measurement systems, with a potential major impact in many science and engineering application areas. The concept has been characterized as revolutionary and examples of areas of DDDAS impact include computer and communication systems, information science and technologies, physical, chemical, biological, medical and health systems, environmental (hazard prediction, prevention, mitigation, response), and manufacturing, transportation and critical infrastructure systems. The talk will address technology advances enabled and driven the DDDAS concept, as well as challenges and opportunities, motivating the discussion with application examples from ongoing research efforts.

Resilient Runtime Systems
Emery Berger
Deptartment of Computer Science, University of Massachusetts, Amherst

Software is imperfect. Software errors cost the US economy alone an estimated $59 billion a year due to downtime and software maintenance costs. However, many of these errors are preventable. I will describe our work on resilient runtime systems, which automatically protect C and C++ programs from programmer errors that would otherwise lead to crashes or security vulnerabilities. (Joint work with Microsoft Research.)

Análisis de imágenes hiperespectrales: extracción automática de endmembers
Javier Setoain Rodrigo
ArTeCS, Departamento de Arquitectura de Computadores y Automática de la Universidad Complutense de Madrid

El análisis de imágenes hiperespectrales es una tarea que demanda una gran capacidad de cómputo, y ciertas aplicaciones presentan incluso requisitos de tiempo real. Para dichas aplicaciones, en el mercado nos encontramos con diferentes soluciones de HPC, pero no muchas de ellas son válidas en un entorno de análisis a bordo debido a las restricciones de carga y alimentación con las que nos encontramos en satélites o aviones. Para estas últimas, las únicas soluciones válidas serían los sistemas paralelos integrados tipo manycore (tamaño reducido y consumo controlable). Las únicas soluciones viables de este tipo en el mercado actual son Cell y las unidades de procesamiento gráfico (GPU). En la charla se presentará el problema (análisis de imágenes hiperespectrales), se analizará un algoritmo (extracción automática de endmembers) y se expondrá su solución tanto en CUDA para GPUs como en CellSs para Cell BE.

NVIDIA CUDA Software and GPU Parallel Computing Architecture
David Kirk
NVIDIA Chief Scientist

A new technology is emerging which has the potential to revolutionise science and industry. It is already being used by world-leading research groups and companies to massively speed up their research and productivity. And it's based on a chip which was developed to play computer games. In the past, graphics processors were special purpose hardwired application accelerators, suitable only for conventional rasterization-style graphics applications. Modern GPUs are now fully programmable, massively parallel floating point processors. NVIDIA, the company which invented the GPU, is unlocking this technology's potential to create a new generation of affordable, accessible supercomputers, putting an unprecedented level of computational power in the hands of scientists and programmers. This talk will describe NVIDIAs massively multithreaded computing architecture and CUDA software for GPU computing. The architecture is a scalable, highly parallel architecture that delivers high throughput for data-intensive processing. Although not truly general-purpose processors, GPUs can now be used for a wide variety of compute-intensive applications beyond graphics.

Petascale Computing for Large-Scale Graph Problems
David A. Bader
Georgia Institute of Technology

Graph theoretic problems are representative of fundamental kernels in traditional and emerging computational sciences such as chemistry, biology, and medicine, as well as applications in national security. Yet they pose serious challenges for parallel machines due to non-contiguous, concurrent accesses to global data structures with low degrees of locality. Few parallel graph algorithms outperform their best sequential implementation due to long memory latencies and high synchronization costs. In this talk, we consider several graph theoretic kernels for connectivity and centrality and discuss how the features of petascale architectures will affect algorithm development, ease of programming, performance, and scalability.

Ct: C for Throughput Computing
Ghuloum, Anwar
Intel

Power consumption is the ultimate limiter to current and future processor design, leading us to focus on more power efficient architectural features such as multiple cores, more powerful vector units, and use of hardware multi-threading (in place of relatively expensive out-of-order techniques). It is (increasingly) well understood that developers face new challenges with multi-core software development. The first of these challenges is a significant productivity burden particular to parallel programming. A big contributor to this burden is the relative difficulty of tracking down data races, which manifest non-deterministically. The second challenge is parallelizing applications so that they effectively scale with new core counts and the inevitable enhancement and evolution of the instruction set. This is a new and subtle qualifier to the benefits of backwards compatibility inherent in Intel® Architecture (IA): performance may not scale forward with new micro-architectures and, in some cases, actually regress. I assert that forward-scaling is an essential requirement for new programming models, tools, and methodologies intended for multi-core software development.
We are implementing a programming model called Ct (C for Throughput Computing) that leverages the strengths of data parallel programming to help address these challenges. Ct is a C++-hosted deterministic parallel programming model integrating the nested data parallelism of Blelloch and bulk synchronous processing of Valiant (with a dash of SISAL for good measure). Ct uses meta-programming and dynamic compilation to essentially embed a pure functional programming language in impure and unsafe C++. A key objective of the Ct project is to create both high-level and low-level abstractions that forward-scale across IA. I will describe the surface API and runtime architecture that weve built to achieve this, as well as some performance results.

FPGA-based programmable computing platforms
Juanjo Noguera
Xilinx

Modern FPGA platforms are an interesting alternative for the implementation of complex computationally-intensive applications. In this talk, after a short initial introduction to the FPGA technology, we will review the major features of modern platform FPGA's. The second objective of this talk is to introduce the notion of FPGA's as programmable platforms. We will show how the FPGA implementation can be adapted (i.e., customized) to the different application scenarios. In this context, we will explain how we have used the partial reconfiguration capabilities of modern FPGA's to optimize the system-level power consumption in networking applications. In the final part of this talk, we will explain how to program FPGA's. We will start with a brief introduction to the traditional FPGA design flow. However, to extend the use of FPGA's to non-hardware engineers, new design tools, at a higher-level of abstraction, have to be developed. We will review the two main design flow paradigms (i.e., general-purpose and domain-specific) that can be found in the industry and academia to address this open research issue.

What research problems do we find by operating large grid infrastructures ?
Sebastien Goasguen
School of Computing, Clemson University

This talk will start with an overview of the two largest grid infrastructures in the United States, the TeraGrid and the Open Science Grid and the new community driven projects called science gateways that are defining their requirements. Operating principles such as account management, security and software distribution will be highlighted and we will present the challenge of defining a framework for supporting virtual organizations on multi-purpose facilities. We will then present the two main research problems we are currently working on: Early attempts at using the system modeling and architecture modeling languages to design and optimize grid systems before provisioning them on existing grids, and virtualization as a way to encapsulate virtual organization specific services. This talk should bring together nicely a coherent view of what is now known in the US as Cyberinfrastructure and ground it from a theoretical, experimental and operational standpoint.

LA Grid Research Outlook for 2008: Utilizing and Integrating Four LA Grid Projects to Support On-Demand Multi-Scale Weather Modeling
Masoud Sadjadi
School of Computing and Information Sciences at Florida International University

The impact of hurricanes is so devastating throughout different levels of society that there is a pressing need to provide a range of users with accurate and timely information that can enable effective planning for and response to potential hurricane landfalls. The Weather Research and Forecasting (WRF) code is the latest numerical model that has been adopted by meteorological services worldwide. The current version of WRF has not been designed to scale out of a single organization's local computing resources. However, the high resource requirements of WRF for fine-resolution and ensemble forecasting demand a large number of computing nodes, which typically cannot be found within one organization. Therefore, there is a pressing need for the Grid-enablement of the WRF code such that it can utilize resources available in partner organizations. In the past two years, we have been working somewhat separately on four different LA Grid projects, namely, Hurricane Mitigation, Job-Flow Management, Meta-Scheduling, and Resource Management. In this talk, I will propose a research roadmap for 2008 and will explain how we can utilize the findings and integrate the tools developed by these four projects to support on-demand multi-scale weather modeling.

Measuring, Modeling and Resolving Software Aging in SOA Applications
Artur Andrzejak
Zuse Institute Berlin, Germany

Software running image aging or software aging for short is the process of progressive degradation of application or system performance due to resource depletion (memory leaks, unreleased file handles,..), accumulation of rounding errors, and other causes. This phenomenon is frequently observed in always-on and long running applications such as web services or enterprise systems. Due to limited availability of source code, system complexity and the cost of search the root causes are frequently unknown, and so the problem is usually resolved via time-based or adaptive rejuvenation (restart). In this talk we take a look at a series of techniques to handle software aging in context of the SOA applications. First we discuss a method for low-overhead measurement of aging processes in a production environment. We describe two approaches for modeling of aging: spline-based models for deterministic aging profiles and models using statistical learning for short-term prediction which are robust against transient anomalies. Such models can be applied for adaptive rejuvenation and optimization of performance which we discuss in the last part. In particular we examine how virtualisation techniques and optimization of rejuvenation schedules can guarantee an "any-time" performance level of an aging application. The experimental data used for this work has been obtained from a (Java version) of the TPC-W benchmark instrumented with a fault injector and from the Apache Axis application server suffering under natural aging.

Virtual Private Machines: A Resource Abstraction for Multi-threaded Computer Systems
Kyle Nesbit
Department of Electrical and Computer Engineering at the University of Wisconsin-Madison

In this talk, I present the Virtual Private Machine (VPM) framework. In the VPM framework, tasks are assigned shares of a multi-core system's shared hardware resources. A complete set of resource assignments forms a Virtual Private Machine. A task assigned a VPM achieves a minimum level of performance regardless of the other tasks in the system, i.e., the VPM provides the task with performance isolation. VPM policies, implemented primarily in software, translate system-level performance requirements into VPM resource assignments, and VPM mechanisms implemented in hardware enforce the VPM assignments. To illustrate the potential of the VPM framework, a set of VPM policies are proposed. The policies translate applications' system-level QoS requirements into VPM assignments and use the system's excess service to optimize aggregate performance. I illustrate that, in combination, the proposed VPM policies and mechanisms can provide a high-degree of QoS or significantly improve aggregate performance. I show that QoS and aggregate performance objectives often conflict, and through the use of the VPM abstraction an OS developer or system administrator can tune the proposed system to achieve the desired balance of both QoS and aggregate performance.

Periscope: A Tool for Application Performance Analysis on Large-scale Supercomputers
Michael Gerndt
Technische Universitaet Muenchen, Institut fuer Informatik

Performance analysis is a crucial step in program development for HPC systems. Due to the complexity of these architectures, optimizing program performance during the design process is impossible. Thus, programs go through a cyclic optimization process of performance bottleneck detection and program tuning. A severe limitation of most performance analysis tools used in this process is scalability. For machines, such as the Altix 4700 supercomputer installed at Leibniz Compute Centre in Munich with thousands of processors, and future petaflop systems this scalability problem has to be solved. We will give an overview of the Periscope Automatic Performance Analysis Tool which is currently under development at Technische Universität München. Periscope searches performance bottlenecks during program execution. It consists of a set of distributed agents, each being responsible for a subset of the applications processes. The agents search is based on a formal specification of performance properties in the APART specification language (ASL).

An Interactive 3d Visualization Model by Live Streaming for Remote Scientific Visualization
Kazuki Joe
Nara Women's University, JAPAN

Recent improvement of high-end GPUs has made it possible to perform real-time 3D visualization such as volume rendering and 3D contour plot for scientific data locally. A web browser based remote 3D visualization by visualization servers is attractive, but data transfer overhead prevents from performing interactive operations. We propose an interactive remote 3D visualization model by live streaming for geophysical fluids research. In this model, we use live streaming flash media for the web browser based operations keeping minimum quality of data analysis and minimum bit rate for live streaming of flash media. Preliminary experiments with a prototype system validate the effectiveness of our proposing model.

Multidimensional Blocking in UPC
Christopher Barton
Department of Computing Science at the University of Alberta

Partitioned Global Address Space (PGAS) languages offer an attractive, high-productivity programming model for programming large-scale parallel machines. PGAS languages, such as Unified Parallel C (UPC), combine the simplicity of shared-memory programming with the efficiency of the message-passing paradigm by allowing users control over the data layout. PGAS languages distinguish between private, shared-local, and shared-remote memory, with shared-remote accesses typically much more expensive than shared-local and private accesses, especially on distributed memory machines where shared-remote access implies communication over a network. This presentation will briefly describe the UPC language and show a simple extension to the language that allows the programmer to distribute multiple dimensions of a shared array among the threads. We claim that this extension allows for better control of locality, and therefore performance, in the language. This presentation will also describe an analysis that allows the compiler to distinguish between local shared array accesses and remote shared array accesses. Local shared array accesses are then transformed into direct memory accesses by the compiler, saving the overhead of a locality check at runtime. The results demonstrate that the locality analysis is able to significantly reduce the number of shared accesses.

That's all folks!!!!!