Virtualization technology is driving profound changes in the way large data centers are designed and managed. Moving to a virtualized infrastructure solves many problems such as hardware utilization and power consumption. But it also creates many new problems such as virtual machine image sprawl and a new management layer in the IT stack.
From a technology perspective, the evolution of virtualization technology from a server-centric (mainframe) abstraction to today's distributed data center wide abstraction is also driving new requirements into the design of hardware platforms, OS/middleware platforms and software/services processes. This is poised to disrupt many conventional models for systems and software design.
In this talk, I will give an overview of the emerging trends in the industry adoption of virtualization, followed by an IBM Research perspective of what new technical challenges this creates and how we are addressing them.
Technology scaling has enabled tremendous growth in the computing
industry over the past few decades. However, recent trends in power
dissipation, reliability, thermal constraints, and device variability
threaten to limit the continued benefits of device scaling, curtail
performance improvements, and cause increased leakage power in future
technology generations. The temporal and spatial scales of these
effects motivate holistic solutions that span the circuit, architecture,
and software layers. In this talk, I will describe several ongoing
projects that seek to address technology scaling issues. These projects
include efforts in the areas of a) power and performance modeling and
design space optimization for future chip-multiprocessor systems, b)
variability-tolerant microarchitectures that are flexible in both
latency and localized supply voltage, and c) accelerator-based
architectures for power/performance efficiency. The talk will also
discuss our chip prototyping efforts that support this work.
Historically, technology has been the main driver of computer performance.
For many system generations, CMOS scaling has been leveraged to increase
clock speed and build increasingly complex microarchitectures. As
technology-driven performance gains are becoming increasingly harder to
achieve, innovative system architecture must step in. In the context of the
design of the Blue Gene/P supercomputer chip, we will discuss how we
adopted a holistic approach to optimization of the entire hardware and
software stack for a range of metrics: performance, power,
power/performance, reliability and ease of use.
The new Blue Gene/P chip multiprocessor (CMP) scales node performance using
a multi-core system-on-a-chip design. While in the past large symmetric
multi processor (SMP) designs were sized to handle large amounts of
coherence traffic, many modern CMP designs find this cost prohibitive in
terms of area, power dissipation, and design complexity. As multi-core
processors evolve to larger configurations, the performance loss due to
handling coherence traffic must be carefully managed. Thus, to ensure high
efficiency of each quad-processor node in Blue Gene/P, taming the cost of
coherence of traditional SMP designs was a key requirement.
The new Blue Gene/P chip multiprocessor exploits a novel way of reducing
coherence cost by filtering useless coherence actions. Each processor core
is paired with a snoop filter which identifies and discards unnecessary
coherence requests before they can reach the processor cores. Removing
unnecessary lookups reduces the interference of invalidate requests with L1
data cache accesses, and reduces power by eliminating expensive tag array
accesses. This approach results in improved power and performance
characteristics.
To optimize application performance, we exploit parallelism at multiple
levels: at the process-level, thread-level, data-level, and
instruction-level. Hardware supported coherence allows applications to
efficiently share data between threads on different processors for
thread-level parallelism, while the dual floating point unit and the
dual-issue out-of-order PowerPC450 processor core exploit data and
instruction level parallelism, respectively. To exploit process-level
parallelism, special emphasis was put on efficient communication primitives
by including hardware support for the MPI protocol, such as low latency
barriers, and five highly optimized communication networks. A new high
performance DMA unit supports high throughput data transfers.
As the result of this deliberate design for scalability approach, Blue Gene
supercomputers offer unprecedented scalability, in some cases by orders of
magnitude, to a wide range of scientific applications. A broad range of
scientific applications on Blue Gene supercomputers have advanced
scientific discovery, which is the real merit and ultimate measure of
success of the Blue Gene system family.
Effciently exploring exponential-size architectural design spaces with many
interacting parameters remains an open problem: the sheer number of experiments
required renders detailed simulation intractable. We attack this via an
automated approach that builds accurate predictive models. We simulate sampled
points, using results to teach our models the function describing relationships
among design parameters. The models can be queried and are very fast, enabling
efficient design tradeoffs discovery. We validate our approach via two
uniprocessor sensitivity studies, predicting IPC with only 1-2% error. In an
experimental study using the approach, training on 1% of a 250K- point CMP
design space allows our models to predict performance with only 4-5% error. Our
predictive modeling combines well with techniques that reduce the time taken by
each simulation experiment, achieving net time savings of three-four orders of
magnitude. We have also used the approach to predict runtimes of HPC
applications with large parameter spaces and to predict the best number of
processors to use on a phase-by-phase basis (concurrency throttling).
En esta charla se abordarán dos problemas clásicos que aparecen en múltiples
aplicaciones de los computadores: códigos correctores de errores y paralelismo.
Por un lado, se considerará la propuesta de códigos correctores de errores
perfectos para alfabetos con símbolos multidimensionales. Por otro lado, se
abordará el diseño de topologías para redes de interconexión en computadores
paralelos. Ambos problemas tienen en la actualidad distintas aplicaciones
prácticas. Por ejemplo, en las conexiones ADSL se emplea modulación de amplitud
en cuadratura (QAM) que maneja símbolos bidimensionales. El ejemplo del otro
ámbito sería el supercomputador BlueGene de IBM, con nodos etiquetados mediante
símbolos de tres coordenadas que se organizan en un prisma toroidal.
En la charla veremos cómo determinados aspectos de ambos problemas pueden
abordarse desde un punto de vista matemático usando anillos de enteros
complejos. Especialmente, consideraremos los Enteros de Gauss que constituyen
el anillo de los números complejos cuyas partes real e imaginaria son ambas
enteras. También se mostrará algún ejemplo de aplicación de los enteros de
Eisenstein-Jacobi y se considerarán un poco otras estructuras de enteros
complejos de mayores dimensiones.
En la charla subyacen ciertos elementos teóricos que a algunos pudieran parecer
complicados o tediosos. Sin embargo, el ponente, como no podría ser de otro
modo, dará una visión intuitiva y no excesivamente formal de los problemas que
se abordan y de la forma en que se resuelven.
This talk will discuss the Dynamic Data Driven Applications Systems (DDDAS)
concept, driving novel directions in applications and in measurements, as well
as in computer sciences and cyber-infrastructure. DDDAS entails the ability to
incorporate dynamically additional data into an executing application (these
data can be archival or collected on-line), and in reverse the ability of the
applications will be able to dynamically steer the measurement process. The
dynamic environments of concern here encompass dynamic integration of real-time
data acquisition with compute and data intensive -systems. Enabling DDDAS
requires advances in the application modeling methods and interfaces, in
algorithms tolerant to perturbations of dynamic data injection and steering, in
systems software, and in infrastructure support. Research and development of
such technologies requires synergistic multidisciplinary collaboration in the
applications, algorithms, software systems, and measurements systems areas, and
involving researchers in basic sciences, engineering, and computer
sciences. Such capabilities offer the promise of augmenting the analysis and
prediction capabilities of application simulations and the effectiveness of
measurement systems, with a potential major impact in many science and
engineering application areas. The concept has been characterized as
revolutionary and examples of areas of DDDAS impact include computer and
communication systems, information science and technologies, physical,
chemical, biological, medical and health systems, environmental (hazard
prediction, prevention, mitigation, response), and manufacturing,
transportation and critical infrastructure systems. The talk will address
technology advances enabled and driven the DDDAS concept, as well as challenges
and opportunities, motivating the discussion with application examples from
ongoing research efforts.
Software is imperfect. Software errors cost the US economy alone an
estimated $59 billion a year due to downtime and software maintenance
costs. However, many of these errors are preventable. I will describe
our work on resilient runtime systems, which automatically protect C
and C++ programs from programmer errors that would otherwise lead to
crashes or security vulnerabilities. (Joint work with Microsoft Research.)
El análisis de imágenes hiperespectrales es una tarea que demanda una gran capacidad de cómputo, y ciertas aplicaciones presentan incluso requisitos de tiempo real. Para dichas aplicaciones, en el mercado nos encontramos con diferentes soluciones de HPC, pero no muchas de ellas son válidas en un entorno de análisis a bordo debido a las restricciones de carga y alimentación con las que nos encontramos en satélites o aviones. Para estas últimas, las únicas soluciones válidas serían los sistemas paralelos integrados tipo manycore (tamaño reducido y consumo controlable). Las únicas soluciones viables de este tipo en el mercado actual son Cell y las unidades de procesamiento gráfico (GPU). En la charla se presentará el problema (análisis de imágenes hiperespectrales), se analizará un algoritmo (extracción automática de endmembers) y se expondrá su solución tanto en CUDA para GPUs como en CellSs para Cell BE.
A new technology is emerging which has the potential to revolutionise science and industry. It is already being used by world-leading research groups and companies to massively speed up their research and productivity. And it's based on a chip which was developed to play computer games.
In the past, graphics processors were special purpose hardwired application accelerators, suitable only for conventional rasterization-style graphics applications. Modern GPUs are now fully programmable, massively parallel floating point processors. NVIDIA, the company which invented the GPU, is unlocking this technology's potential to create a new generation of affordable, accessible supercomputers, putting an unprecedented level of computational power in the hands of scientists and programmers.
This talk will describe NVIDIAs massively multithreaded computing architecture and CUDA software for GPU computing. The architecture is a scalable, highly parallel architecture that delivers high throughput for data-intensive processing. Although not truly general-purpose processors, GPUs can now be used for a wide variety of compute-intensive applications beyond graphics.
Graph theoretic problems are representative of fundamental kernels in
traditional and emerging computational sciences such as chemistry,
biology, and medicine, as well as applications in national
security. Yet they pose serious challenges for parallel machines due
to non-contiguous, concurrent accesses to global data structures with
low degrees of locality. Few parallel graph algorithms outperform
their best sequential implementation due to long memory latencies and
high synchronization costs. In this talk, we consider several graph
theoretic kernels for connectivity and centrality and discuss how the
features of petascale architectures will affect algorithm development,
ease of programming, performance, and scalability.
Power consumption is the ultimate limiter to current and future processor
design, leading us to focus on more power efficient architectural features such
as multiple cores, more powerful vector units, and use of hardware
multi-threading (in place of relatively expensive out-of-order techniques). It
is (increasingly) well understood that developers face new challenges with
multi-core software development. The first of these challenges is a significant
productivity burden particular to parallel programming. A big contributor to
this burden is the relative difficulty of tracking down data races, which
manifest non-deterministically. The second challenge is parallelizing
applications so that they effectively scale with new core counts and the
inevitable enhancement and evolution of the instruction set. This is a new and
subtle qualifier to the benefits of backwards compatibility inherent in Intel®
Architecture (IA): performance may not scale forward with new
micro-architectures and, in some cases, actually regress. I assert that
forward-scaling is an essential requirement for new programming models, tools,
and methodologies intended for multi-core software development.
We are implementing a programming model called Ct (C for Throughput Computing)
that leverages the strengths of data parallel programming to help address these
challenges. Ct is a C++-hosted deterministic parallel programming model
integrating the nested data parallelism of Blelloch and bulk synchronous
processing of Valiant (with a dash of SISAL for good measure). Ct uses
meta-programming and dynamic compilation to essentially embed a pure functional
programming language in impure and unsafe C++. A key objective of the Ct
project is to create both high-level and low-level abstractions that
forward-scale across IA. I will describe the surface API and runtime
architecture that weve built to achieve this, as well as some performance
results.
Modern FPGA platforms are an interesting alternative for the
implementation of complex computationally-intensive applications. In
this talk, after a short initial introduction to the FPGA technology, we
will review the major features of modern platform FPGA's.
The second objective of this talk is to introduce the notion of FPGA's
as programmable platforms. We will show how the FPGA implementation can
be adapted (i.e., customized) to the different application scenarios. In
this context, we will explain how we have used the partial
reconfiguration capabilities of modern FPGA's to optimize the
system-level power consumption in networking applications.
In the final part of this talk, we will explain how to program FPGA's.
We will start with a brief introduction to the traditional FPGA design
flow. However, to extend the use of FPGA's to non-hardware engineers,
new design tools, at a higher-level of abstraction, have to be
developed. We will review the two main design flow paradigms (i.e.,
general-purpose and domain-specific) that can be found in the industry
and academia to address this open research issue.
This talk will start with an overview of the two largest grid infrastructures in the United States, the TeraGrid and the Open Science Grid and the new community driven projects called science gateways that are defining their requirements. Operating principles such as account management, security and software distribution will be highlighted and we will present the challenge of defining a framework for supporting virtual organizations on multi-purpose facilities. We will then present the two main research problems we are currently working on: Early attempts at using the system modeling and architecture modeling languages to design and optimize grid systems before provisioning them on existing grids, and virtualization as a way to encapsulate virtual organization specific services. This talk should bring together nicely a coherent view of what is now known in the US as Cyberinfrastructure and ground it from a theoretical, experimental and operational standpoint.
The impact of hurricanes is so devastating throughout different levels of society that there is a pressing need to provide a range of users with accurate and timely information that can enable effective planning for and response to potential hurricane landfalls. The Weather Research and Forecasting (WRF) code is the latest numerical model that has been adopted by meteorological services worldwide. The current version of WRF has not been designed to scale out of a single organization's local computing resources. However, the high resource requirements of WRF for fine-resolution and ensemble forecasting demand a large number of computing nodes, which typically cannot be found within one organization. Therefore, there is a pressing need for the Grid-enablement of the WRF code such that it can utilize resources available in partner organizations. In the past two years, we have been working somewhat separately on four different LA Grid projects, namely, Hurricane Mitigation, Job-Flow Management, Meta-Scheduling, and Resource Management. In this talk, I will propose a research roadmap for 2008 and will explain how we can utilize the findings and integrate the tools developed by these four projects to support on-demand multi-scale weather modeling.
Software running image aging or software aging for short is the process of progressive degradation of application or system performance due to resource depletion (memory leaks, unreleased file handles,..), accumulation of rounding errors, and other causes. This phenomenon is frequently observed in always-on and long running applications such as web services or enterprise systems. Due to limited availability of source code, system complexity and the cost of search the root causes are frequently unknown, and so the problem is usually resolved via time-based or adaptive rejuvenation (restart).
In this talk we take a look at a series of techniques to handle software aging in context of the SOA applications. First we discuss a method for low-overhead measurement of aging processes in a production environment. We describe two approaches for modeling of aging: spline-based models for deterministic aging profiles and models using statistical learning for short-term prediction which are robust against transient anomalies. Such models can be applied for adaptive rejuvenation and optimization of performance which we discuss in the last part. In particular we examine how virtualisation techniques and optimization of rejuvenation schedules can guarantee an "any-time" performance level of an aging application.
The experimental data used for this work has been obtained from a (Java version) of the TPC-W benchmark instrumented with a fault injector and from the Apache Axis application server suffering under natural aging.
In this talk, I present the Virtual Private Machine (VPM) framework.
In the VPM framework, tasks are assigned shares of a multi-core
system's shared hardware resources. A complete set of resource
assignments forms a Virtual Private Machine. A task assigned a VPM
achieves a minimum level of performance regardless of the other tasks
in the system, i.e., the VPM provides the task with performance
isolation. VPM policies, implemented primarily in software, translate
system-level performance requirements into VPM resource assignments,
and VPM mechanisms implemented in hardware enforce the VPM
assignments.
To illustrate the potential of the VPM framework, a set of VPM
policies are proposed. The policies translate applications'
system-level QoS requirements into VPM assignments and use the
system's excess service to optimize aggregate performance. I
illustrate that, in combination, the proposed VPM policies and
mechanisms can provide a high-degree of QoS or significantly improve
aggregate performance. I show that QoS and aggregate performance
objectives often conflict, and through the use of the VPM abstraction
an OS developer or system administrator can tune the proposed system
to achieve the desired balance of both QoS and aggregate performance.
Performance analysis is a crucial step in program development for HPC systems. Due to the complexity of these architectures, optimizing program performance during the design process is impossible. Thus, programs go through a cyclic optimization process of performance bottleneck detection and program tuning. A severe limitation of most performance analysis tools used in this process is scalability. For machines, such as the Altix 4700 supercomputer installed at Leibniz Compute Centre in Munich with thousands of processors, and future petaflop systems this scalability problem has to be solved.
We will give an overview of the Periscope Automatic Performance Analysis Tool which is currently under development at Technische Universität München. Periscope searches performance bottlenecks during program execution. It consists of a set of distributed agents, each being responsible for a subset of the applications processes. The agents search is based on a formal specification of performance properties in the APART specification language (ASL).
Recent improvement of high-end GPUs has made it possible to perform real-time 3D visualization such as volume rendering and 3D contour plot for scientific data locally. A web browser based remote 3D visualization by visualization servers is attractive, but data transfer overhead prevents from performing interactive operations. We propose an interactive remote 3D visualization model by live streaming for geophysical fluids research. In this model, we use live streaming flash media for the web browser based operations keeping minimum quality of data analysis and minimum bit rate for live streaming of flash media. Preliminary experiments with a prototype system validate the effectiveness of our proposing model.
Partitioned Global Address Space (PGAS) languages offer an attractive,
high-productivity programming model for programming large-scale
parallel machines. PGAS languages, such as Unified Parallel C (UPC),
combine the simplicity of shared-memory programming with the
efficiency of the message-passing paradigm by allowing users control
over the data layout. PGAS languages distinguish between private,
shared-local, and shared-remote memory, with shared-remote accesses
typically much more expensive than shared-local and private accesses,
especially on distributed memory machines where shared-remote access
implies communication over a network.
This presentation will briefly describe the UPC language and show a
simple extension to the language that allows the programmer to
distribute multiple dimensions of a shared array among the threads. We
claim that this extension allows for better control of locality, and
therefore performance, in the language. This presentation will also
describe an analysis that allows the compiler to distinguish between
local shared array accesses and remote shared array accesses. Local
shared array accesses are then transformed into direct memory accesses
by the compiler, saving the overhead of a locality check at
runtime. The results demonstrate that the locality analysis is able to
significantly reduce the number of shared accesses.
That's all folks!!!!!