Transactional memory promises to reduce the programming efforts for multi-core
processors by avoiding the serialization imposed by coarse-grain locking using
critical sections. However, transactional memory schemes that rely on lazy
conflict resolution squash one of the transactions involved in a data race and
forces it to re-execute from the very beginning which for long-running
transactions can waste lots of useful execution. Moreover, such schemes are
prone to starvation which may cause repeated misspeculations to severely
prolong the execution time of parallel programs.
In this talk, I will present a number of techniques that can help reducing the
penalty associated with misspeculations in transactional memory systems using
lazy conflict resolution. One of the techniques records what locations are
involved in data races and uses this information in future transaction
invocations to insert checkpoints to avoid re-executing the transaction from
the very beginning. Another technique records the number of misspeculations
that each transaction has encountered, where a high count is a symptom of a
potentially starved transaction. By giving such transactions priority to
commit, I show that starvation can be avoided. Experimental results demonstrate
that proposed techniques can improve the performance of parallel programs
designed for transactional memory systems significantly.
Graphics Processing Units (GPU) are now powerful and flexible systems
adapted and used for other purposes than graphics calculations (General
Purpose computation on GPU - GPGPU). We present here a prototype to be
integrated into simulation codes that estimate temperatures, velocities
and pressure to design next generations of solar receivers. Such codes
will delegate to our contribution on GPUs the accurate computation of
heat transfers due to radiations. We use Monte-Carlo line-by-line
ray-tracing through finite volumes. That means data-parallel arithmetic
transformations on large data structures. Our prototype is inspired on
the source code of GPUBench. Our performances on two recent graphics
cards (Nvidia 7800GTX and ATI RX1800XL) show some speed-up higher than
400 compared to CPU implementations leaving most of CPU computing
resources available.
Grid computing, one of the latest buzzwords in the ICT industry, is emerging as a new paradigm for Internet-based parallel and distributing computing. It enables the sharing, selection, and aggregation of geographically distributed autonomous resources, such as computers (PCs, servers, clusters, supercomputers), databases, and scientific instruments, for solving large-scale problems in science, engineering, and commerce. It leverages existing IT infrastructure to optimize compute resources and manage data and computing workloads. The developers of Grids and Grid applications need to address numerous challenges: security, heterogeneity, dynamicity, scalability, reliability, service creation and pricing, resource discovery, resource management, application decomposition and service composition, and qualify of services. A number of projects around the world are developing technologies that help address one or more of these challenges. To address some these challenges, the Gridbus Project at the University of Melbourne has developed grid middleware technologies that support rapid creation and deployment of eScience and eBusiness applications on enterprise and global Grids.
In this seminar, we present technological evolution and key challenges in building and managing Utility Grids. We place emphasis on fundamental challenges of Grid economy, how to design and develop Grid technologies and applications capable of dynamically leasing services of distributed resources at runtime depending on their availability, capability, performance, cost, and users' quality of service requirements. We then introduce Gridbus Project R&D efforts with focus on distributed computational economy for effective management of resources. We briefly present various components of the Gridbus Toolkit and then discuss, in detail, the Gridbus service broker that supports composition and deployment of applications on utility Grids. Case studies on the use of Gridbus middleware in creation of various Grid applications (such as distributed molecular docking, high energy physics, and natural language processing) and their deployment on national and international Grids will also be highlighted.
Power density of microprocessors is increasing with every new process generation resulting in higher maximum chip temperatures.
High chip temperatures greatly affects its reliability, raises leakage power and makes cooling more expensive.
Temperature projection at the design stage and during the lifetme of the chip are necessary for development of temperature aware techniques to control chip temperature.
In this talk, we first present a new transient thermal simulation algorithm, TILTS, which is much faster than conventional simulation algorithms. Based on TILTS, we developed a lightweight runtime temperature monitoring tool, called Temptor, that uses internal performance counters to estimate runtime chip temperature distributions with negligible overhead and good precision.
Then, we present a temperature-aware floorplanning algorithm that is capable of reducing the maximum chip block's temperature by modifying the placement of blocks. The floorplans of the Alpha and Pentium Pro microprocessors are used to illustrate the benefits of the floorplanning algorithm.
The ROCCC (Riverside Optimizing Configurable Computing Compiler) is an
optimizing C-to-VHDL compiler used to compile routines written in a
subset of C to an application-specific circuit on an FPGA. ROCCC
incorporates several powerful parallelizing transformations targeted
towards code generation for FPGAs and can achieve performance comparable
to hand-coded VHDL. We have demonstrated speedups ranging from 800x to
several 1,000x over the Itanium 2 1.6 GHz on an SGI Altix 4700 with one
RASC RC 100 blade.
With uniprocessor performance increases leveling off, and with the semiconductor industry
moving towards multicore processors, novel ways of parallelizing the
execution of a variety of computing applications will be needed.
To be viable, a parallel execution model for future multicore architectures should not
only mesh well with the programming styles that we expect in the future,
but perhaps even leverage the characteristics of such programming styles.
This talk will review proposed parallel execution models and
present Program Demultiplexing (PD), an execution model that creates concurrency in
sequential programs by "demultiplexing" methods (functions or subroutines).
Call sites of a demultiplexed method in the program are associated with handlers that allow
the method to be separated from the sequential program and executed on an auxillary processor.
The demultiplexed execution of a method (and its handler) is speculative and occurs when the
inputs of the method are (speculatively) available, which is typically far in advance of
when the method is actually called in the sequential execution.
A trigger, composed of predicates that are based on program counters and memory write addresses,
launches the (speculative) execution of the method on another processor.
Results from our initial experience with a simulation model for PD will be presented.
For eight integer benchmarks from the SPEC2000 suite, programs written in C with no explicit concurrency
and/or motivation to create concurrency, we achieve a harmonic mean speedup of 1.8x on four processors.
We believe that PD can achieve further speedup when opportunities for concurrency are exposed to
programmers and/or on applications that use modern object-oriented languages.
Given time, the talk will touch on some other hardware trends that are likely to
have implications for how software is written in the future.
Server virtualization opens up a range of new possibilities for datacenter management, through the availability of new automation mechanisms that can be exploited to manage software within virtual machines. Server virtualization permits innovative ways of software distribution inside pre-installed, pre-wired, and pre-configured freeze-dried disk images. This allows simplifying the process of software installation. Server virtualization also enables powerful and flexible autonomic control, through management software that maintains the system in a desired state in the face of changing workload and demand. In this talk, I will discuss a project that aims at simplifying systems management with the use of server virtualization technologies.
I will introduce the concept of Virtual Software Appliance and discuss tools we build to manage the entire life-cycle of a Virtual Software Appliance.
Fine-grained program power behavior is useful in both evaluating power
optimizations and observing power optimization opportunities. Detailed
power simulation is time consuming and often inaccurate. Physical power
measurement is faster and objective. However, fine-grained measurement
generates enormous amounts of data in which locating important features
is difficult, while coarse-grained measurement sacrifices important detail.
We present a program power behavior characterization infrastructure that
identifies program phases, selects a representative interval of execution
for each phase, and instruments the program to enable precise power
measurement of these intervals to get their time-dependent power behavior.
We show that the representative intervals accurately model the fine-grained
time-dependent behavior of the program. They also accurately estimate the
total energy of a program. Our compiler infrastructure allows for easy
mapping between a measurement result and its corresponding source code.
We improve the accuracy of our technique over previous work by using
edge vectors, i.e., counts of traversals of control-flow edges, instead
of basic block vectors, as well as incorporating event counters into our
phase classification.
We validate our infrastructure through the physical power measurement of
10 SPEC CPU 2000 integer benchmarks on an Intel Pentium 4 system. We show
that using edge vectors reduces the error of estimating total program energy
by 35% over using basic block vectors, and using edge vectors plus event
counters reduces the error of estimating the fine-grained time-dependent
power profile by 22% over using basic block vectors.
El conocimiento de un ser humano se va acumulando conforme a lo que sucede en
su entorno, las fuentes de información tienen un papel importante en este
proceso; no se arranca de cero, inclusive un animal nace con conocimiento
previo. Su conocimiento aumenta al agregar nuevos conceptos o asociarlos a los
ya posee. Aunque existe información del exterior que puede contradecir o
confundir a un ser humano, éste cuenta con las herramientas que le permite
resolverlo de alguna manera. A éste cúmulo de información se le puede llamar su
ontología.
Las ontologías también se pueden estructurar y definir en las computadoras.
Este trabajo se centra en la unión de ontologías entre computadoras, durante
esta unión pueden suceder los mismos casos que en una persona; la diferencia es
que las máquinas carecen de sentido común y los desafíos son hacer la fusión de
manera automática, que no se detenga ante los problemas (redundancias, distinto
nivel de descripción, sinónimos, homónimos…) que se presenten y que el
resultado sea lo más cercano a la fusión natural de conocimiento del ser
humano.
Hay actualmente trabajos que realizan la unión de ontologías, mas lo hacen de
manera semiautomática (un editor de ontologías que ofrece sugerencias al
usuario). También, en principio, es posible escribir algoritmos que unan
ontologías representadas en lenguajes formales (lógicos). Empero, las
ontologías deben ser consistentes para poder ser representadas.
Desafortunadamente, las ontologías reales son inconsistentes e incompletas.
Este trabajo presenta un proceso de unión de ontologías automático y robusto.
Automático porque la computadora detecta y resuelve los problemas que se
presentan durante el proceso de la unión, y robusto porque realiza la unión
pese a que las ontologías son mutuamente inconsistentes o representan la
información desde distintos ángulos, con distinto detalle, con homónimos,
usando palabras sin definir... Se demuestra la eficiencia del algoritmo de
fusión a través de varios ejemplos reales con documentos obtenidos de Internet
cuyas ontologías se construyeron manualmente y se fusionaron de manera
automática. Los resultados tuvieron un ligero margen de error en comparación
con la fusión manual de un usuario experto en el tema del documento.
Más información.
Artículo
Tesis de doctorado
In this talk I will give an overview of NoC architectural concepts and then discuss our current research for modeling and improving all
aspects of an NoC architecture. Our emphasize is to find an efficient replacement for bus based architectures for very large number of
IPs for an SoC design where scalability of the design prohibits current approaches to be practical, because of the limitations of wire delay and other factors
related to 65 nm technology and beyond.
The Linux timekeeping architecture is radically changing. In this talk we introduce briefly the timekeeping hardware devices and we show how two different versions of Linux 2.6, namely the current 2.6.20 and the forthcoming 2.6.21, are exploiting them. Furthermore, we present our personal ideas on how the Linux timekeeping architecture might evolve in a near future.
Virtual Private Machines (VPM) provide a framework for
implementing Quality of Service (QoS) policies in CMP-based
computer systems. VPMs incorporate microarchitecture mecha-
nisms that allow shares of hardware resources to be allo-
cated to concurrently executing threads. VPMs can thereby
provide applications with an upper bound on execution time
regardless of other thread activity. Virtual Private Caches
(VPCs) are an important element of VPMs, and VPC hardware
consists of two major components: a VPC Arbiter, which man-
ages shared resource bandwidth, and a VPC Capacity Manager.
Both the VPC Arbiter and VPC Capacity Manager provide mini-
mum service guarantees that, when combined, achieve QoS
within the cache.
Simulation-based evaluation shows that conventional cache
sharing policies allow concurrently executing threads to
affect each other significantly. In contrast, VPCs meet
their QoS performance objectives and have a fairness policy
for distributing excess resources that is amenable to gen-
eral purpose multi-threaded systems. On a CMP running het-
erogeneous workloads, VPCs improve throughput by eliminating
negative interference, i.e., VPCs improve average workload
performance by 14% and the performance of the slowest pro-
gram in a workload is improved by an average of 25%.
Packet forwarding is a memory-intensive application requiring multiple
accesses through a trie structure. With the requirement to process packets
at line rates high performance routers need to forward millions of packets
every second. Even with an efficient lookup algorithm like the LC-trie, each
packet needs up to 5 memory accesses. Earlier work shows that a single
cache for the nodes of an LC-trie can reduce the number of external memory
accesses. We propose a Heterogeneous Segmented Cache Architecture
(HSCA) for packet forwarding application that exploits the significantly
different locality characteristics of accesses to level-one and the
lower-level nodes of an LC-trie. Further, we improve the hit rate of the
cache for level-one nodes, which are accessed more than 80% of the time, by
introducing a novel two-level mapping based cache placement scheme to reduce
conflict misses.
Performance results reveal that our base HSCA scheme reduces the number of
misses incurred with a unified scheme by as much as 25%, leading to a 32%
speedup in average memory access time. Two-level mapping further enhances
the performance of the base HSCA by up to 13% leading to an overall
improvement of up to 40% over the unified scheme.
Environment: Technology;
-Process technology roadmap is changing: Min Vcc does not scale anymore.
-Due to business reasons, products die size will stay constant while moving from process N to process N+1.
- The above technology trends will lead to a 1.5X increase in product power (assuming usage of all die functions) when moving from process N to process N+1. Using Dynamic Voltage Scaling (DVS) reached its limit already.
- The implications of power constrains ARE: Not all die functions can be active simultaneously.
Environment: Media applications
-Media (and off course graphics) will become a major thrust in future PC business
--The environment of performance hungry media applications
--The GainBandwidth Product and its Performance/Power implications
The environment implications:
-Homogenous/Symetric (e.g. CMP) cores will provide only a temporary solution to the power problem.
-Application specific heterogynous functions are highly power efficient.
- Non-simultaneously operation of these Heterogeneous functions can be used to overcome the power problem.
Conclusions
In this talk we examine how high performance computing has changed over the
last 10-year and look toward the future in terms of trends. These changes
have had and will continue to have a major impact on our software. A new
generation of software libraries and algorithms are needed for the effective
and reliable use of (wide area) dynamic, distributed and parallel
environments. Some of the software and algorithm challenges have already
been encountered, such as management of communication and memory hierarchies
through a combination of compile--time and run--time techniques, but the
increased scale of computation, depth of memory hierarchies, range of
latencies, and increased run--time environment variability will make these
problems much harder.
We will focus on three themes:
- Automatically tuned linear algebra software
- Redesign of software to fit multicore architectures
- The importance of fault tolerance
Applying the right compiler optimizations to a particular program can
have a significant impact
on program performance. Due to the non-linear interaction of compiler
optimizations, however,
determining the best setting is non-trivial. There have been several
proposed techniques that search
the space of compiler options to find good solutions; however such
approaches can be expensive.
This paper proposes a different approach using performance counters as a
means of determining
good compiler optimization settings. This is achieved by learning a
model off-line which can then
be used to determine good settings for any new program. We show that
such an approach
outperforms the state-of-the-art and is two orders of magnitude faster
on average. Furthermore, we
show that our performance counter-based approach outperforms techniques
based on static code
features. Using our technique we achieve a 17% improvement over the
highest optimization setting
of the commercial PathScale EKOPath 2.3.1 optimizing compiler on the
SPEC benchmark suite
on a recent AMD Athlon 64 3700+ platform.
In this talk I will present our on-going work on
simple and practical automatic run-time adaptation
technique for statically compiled programs with varying context
and behavior. Unlike other complex dynamic recompilation frameworks,
our technique relies on function/loop versioning and
static low-overhead monitoring and adaptation routines.
We extend our previous work on adaption techniques for regular
numerical codes with stable phases (presented at HiPEAC 2005)
to codes with any (irregular) behavior by using time-slot
run-time performance monitoring and statistical selection of
appropriate versions. We are currently implementing this technique
in the gcc compiler and plan to make it publicly available. We use
it for continuous program optimizations, for speeding up iterative
optimizations and for auto-tuning of libraries.
We also plan to use this technique for OS kernel adaptation
and for program auto-parallelization.
Multiple core CPUs are here today. The prevailing notion is that we need to rewrite applications not originally designed to support multiple processors to make them multithreaded. Because of the difficulty of programming correct and efficient multithreaded applications (e.g., race conditions, deadlocks, and scalability bottlenecks), this is a major challenge.
This talk presents two alternative approaches that bring the power of multiple cores to today's software. The first approach focuses on building highly-concurrent client-server applications from legacy code. I present a system called Flux that allows users to take unmodified off-the-shelf *sequential* C and C++ code and build concurrent applications. The Flux compiler combines the Flux program and the sequential code to generate a deadlock-free, high-concurrency server. Flux also generates discrete event simulators that accurately predict actual server performance under load.
While the Flux language was initially targeted at servers, we have found it to be a useful abstraction for sensor networks, and I will briefly talk about our use of an energy-aware variant of Flux in a deployment on the backs of endangered turtles.
The second approach uses the extra processing power of multicore CPUs to make legacy C/C++ applications more reliable. I present a system called DieHard that uses randomization and replication to transparently harden programs against a wide range of errors, including buffer overflows and dangling pointers. Instead of crashing or running amok, DieHard lets programs continue to run correctly in the face of memory errors with high probability.
This is joint work with Brendan Burns, Kevin Grimaldi, Alex Kostadinov, Jacob Sorber, and Mark Corner (University of Massachusetts Amherst), and Ben Zorn (Microsoft Research).
Recent research in thread-level speculation (TLS) has proposed several
mechanisms for optimistic execution of difficult-to-analyze serial codes
in parallel. Though it has been shown that TLS helps to achieve higher
levels of parallelism, evaluation of the unique performance potential of
TLS, i.e., performance gain that be achieved only through speculation, has
not received much attention. In this paper, we evaluate this aspect, by
separating the speedup achievable via true TLP (thread-level parallelism)
and TLS, for the SPEC CPU20002006 benchmarks. Further, we dissect the
performance potential of each type of speculation: control speculation,
data dependence speculation and data value speculation. To the best
of our knowledge, this is the first dissection study of its kind. Assuming
an oracle TLS mechanism which corresponds to perfect speculation and zero
threading overhead whereby the execution time of a candidate program
region (for speculative execution) can be reduced to zero, our study shows
that, at the loop-level, the upper bound on the arithmetic mean and
geometric mean speedup achievable via TLS across SPEC CPU2000 is 39.16%
(standard deviation = 31.23) and 18.18% respectively.
Strassen's algorithm has practical performance
benefits for architectures with simple memory
hierarchies, because it trades computationally expensive
matrix multiplications (MM) with cheaper matrix
additions (MA). However, it presents no advantages
for modern high-performance architectures with deep memory
hierarchies, because MAs exploit limited data reuse.
We present an easy-to-use adaptive algorithm combining
Strassen's recursion and a highly-tuned version of
ATLAS MM. In fact, we introduce a last step in the
ATLAS-installation process that determines whether
Strassen's may achieve any speedup. We present a
recursive algorithm achieving up to 30% speed-up
versus ATLAS alone. We show experimental results
for a variety of systems.
That's all folks!!!!!