Modern computers often use multi-core architectures, covering clusters of homogeneous cores for high performance computing, to heterogeneous architectures typically found in embedded systems. To efficiently program such architectures, it is important to be able to partition and map programs onto the cores of the architecture. We believe that communication patterns need to become explicit in the source code to make it easier to analyze and partition parallel programs. Extraction of these patterns are difficult to automate due to limitations in compiler techniques when determining the effects of pointers.
We have proposed an OpenMP extension which allows programmers to explicitly declare the pointer based data-sharing between coarse-grain program parts. We present a dependency directive, expressing the input and output relation between program parts and pointers to shared data, as well as a set of runtime operations which are necessary to enforce declarations made by the programmer. The cost and scalability of the runtime operations are evaluated using micro-benchmarks and a benchmark from the NAS parallel benchmark suite. The measurements show that the overhead of the runtime operations is small. In fact, no performance degradation is found when using the runtime operations in the benchmark from the NAS parallel benchmark suite.
Numerous schemes for extracting the secret key out of cryptographic devices using side channel attacks have been developed.
One of the most effective side channel attacks is through maliciously injecting faults into the device and observing the
erroneous results produced by the device. In some extreme cases, a single fault injection experiment has been shown to be
sufficient for retrieving the secret key.
In this talk we describe several fault injection attacks on symmetric key and public key ciphers and outline countermeasures
that have been developed to protect cryptographic devices against such attacks. We then show that some of these
countermeasures do not provide the desired protection, and even worse, they may make other side channel attacks easier to mount.
In this talk I will present two recent works on power and thermal-aware load-balancing techniques for massive multi-core architectures.
(a) Hardware-based load balancing for massive multi-core architectures
implementing power gating
(To appear in the IEEE Transactions on Computer-Aided Design)
Abstract:
Many-core architectures provide a computation platform with high execution
throughput, enabling them to efficiently execute workloads with a significant
degree of thread-level parallelism. The burst-like nature of these workloads
allows large power savings by power gating the idle cores. In addition, the
load balancing of threads to cores also impacts the power and thermal behavior
of the processor.
Processor implementations of many-core architectures may choose to group
several cores into clusters sharing the area overhead, so that the whole
cluster is power gated as opposed to the individual cores. However, the
potential for power savings is reduced due to the coarser level of power
gating.
In this work, several hardware-based, stateless load-balancing schemes are
evaluated for these clustered homogeneous multi-core architectures in terms of
their power and thermal behavior. All these methods can be unified into a
parameterized technique that dynamically adjusts to obtain the desired goal
(lower power, higher performance, lower hotspot temperature).
(b) Trading off higher execution latency for increased reliability in
tile-based massive multi-core architectures
(Published at the 2009 IEEE International Symposium on Quality Electronics and
Design)
Abstract:
Massive multi-core architectures provide a computation platform with high
execution throughput, enabling the efficient execution of workloads with a
significant degree of thread-level parallelism, like networking, DSP and
e-commerce. The burst-like nature of these workloads render most of the cores
idle most of the time. Therefore, there is a large potential for power savings
by power gating these idle cores.
The ideal scenario from a power dissipation point of view is to execute the
requests as fast as possible so that the cores can be power gated the longest.
But due to the exponential dependency of (static) power on temperature, it may
be the case that a cluster of spatially close cores consumes more than if these
cores were farther apart from each other. The former case may certainly be best
for performance (since the cores are closer to the neighbor's caches), but in
the presence of spare cores in the die, it may be possible that by executing
the requests in distant cores the overall throughput is still maintained and at
the same time both power and hot spots are reduced, thus increasing the
processor's reliability.
In this work, the power, performance and thermal behavior of a tile-based
massive multi-core architecture is modeled and evaluated under different
workload scenarios. Under a low ingress rate of requests or low inter-core
communication traffic, both higher power savings and more uniformly chip wear
are obtained by assigning requests to physically distant cores.
Power efficiency has constrained the growth of single-threaded
performance, but will soon also constrain the scaling of multicore
chips. In this talk, I will project how Moore's Law will affect
multicore designs, and show that energy efficiency will determine the
number of cores that we can fit on a chip, leading to a model that I
call "pinhole processing." To address the efficiency of individual cores,
I will describe the TFlex microarchitecture, a class of ultra-adaptive EDGE-based cores that can
enable dynamic heterogeneity through composability, subsuming many of
the heterogeneous multicore design points. Finally, I will offer some
thoughts on what comes after multicore.
In this talk we present several recent results obtained in the design
of parallel algorithms for dense and sparse linear algebra. The
overall goal of this research is to reformulate and redesign linear
algebra algorithms so that they are optimal in an amount of the
communication they perform, while retaining the numerical stability.
The work here involves both theoretical investigation and practical
coding on diverse computational platforms. In the theoretical
investigation we identified lower bounds on communication for
different operations in linear algebra, where communication refers to
data movement between processors in the parallel case, and to data
movement between different levels of memory hierarchy in the
sequential case. The results obtained to date concern the LU and QR
factorizations of dense matrices. We present new algorithms that
attain the communication lower bounds (up to polylogarithmic factors),
and thus greatly reduce the communication relative to conventional
algorithms as implemented in the widely used libraries LAPACK and
ScaLAPACK. The implementation of the new algorithms on distributed
memory computers lead to important speedups over the algorithms in
ScaLAPACK. Our current research focuses on their adaptation to the
emergent hierarchical models of clusters of multi-core processors, as
used for example in future petascale machines.
This is joint work with J. Demmel and M. Hoemmen from UC Berkeley,
J. Langou from CU Denver and H. Xiang from University Paris 6.
A key challenge in the field of computer architecture is "balanced" system design, in which computational capability is well-adjusted to the supply of data. In both academe and industry, computer architects are increasingly drawing system roadmaps which predict many-fold increases in raw computational throughput per chip -- hundreds of cores within the next three technology generations. At the same time, CMOS technologists have been warning of the "end of scaling," particularly for six-transistor SRAM. This is a disturbing forecast, since easily 50% of microprocessor silicon area is commonly occupied by SRAM caches. Reconciling these two divergent paths is the topic of this talk.
A particularly long-standing debate has surrounded one dense, resilient, on-chip storage alternative: embedded DRAM. This talk will shed light on the technology causes of the infamous memory wall, provide a tutorial on the technology behind eDRAM, and abstract use of SRAM replacements into the system-level metrics of performance, capacity, and availability.
The parallelization of non-doall loops requires explicit synchronization
between threads of execution. In this regard, efficient placement of the
synchronization primitives - say, post, wait - plays a key role in achieving a
high degree of thread-level parallelism (TLP). We propose novel compiler
techniques to enhance the above. Specifically, given a control flow graph
(CFG), the proposed techniques place a post as early as possible and place a
wait as late as possible in the CFG, subject to dependences. We present
evidence of the efficacy of our techniques on a real machine, using real code
kernels from the SPEC CPU benchmarks, the Linux kernel and other widely used
open source codes. Our results show that the proposed techniques yield
significantly higher levels of TLP than the state-of-the-art.
Leakage power dissipation continues to be a problem
in L2 caches. Many circuit and architectural techniques
have been proposed to mitigate this. In particular,
memory cell leakage has been dealt with quite successfully.
However, considerable leakage power is still dissipated in
the so called SRAM peripheral circuits, e.g., decoders,
wordline and I/O drivers. This talk will discuss peripheral
leakage and techniques to reduce it based on stacking sleep
transistors. Two "static" architectural techniques to control
the circuit mechanism are described. An adaptive mechanism
is then proposed over the static techniques and shown to
result in high leakage reduction. A 52% average L2 leakage
reduction is achieved for SPEC2K benchmarks.
Multi-Processor Systems-on-Chip (MPSoCs) are increasingly penetrating the consumer electronics market as a powerful, yet commercially viable, solution to answer the strong and steadily growing demand for scalable and high performance systems, at limited design complexity. Nevertheless, MPSoCs are prone to alarming temperature variations on the die, which seriously decrease their expected reliability and lifetime. Thus, it is critical to develop dedicated design methodologies for multi-core architectures that seamlessly address their thermal modeling, analysis and management. In this seminar, I present modelling and analysis tools for MPSoC architectures. In particular, I describe a novel thermal exploration framework based on a combined HW-SW emulation approach exploiting Field-Programmable Gate Arrays (FPGAs), which enables the accurate characterization of the thermal behavior of MPSoCs, while being three orders of magnitude faster than state-of-the-art architectural and system simulators.
Then, using this novel thermal exploration framework, I will introduce different HW-based policies for controlling thermal runaway in MPSoCs, based on dynamic frequency and voltage scaling. Finally, I will show how thermal balancing policies can be developed for MPSoCs, combining HW-based temperature control mechanisms with task migration at the operating system level.
HP Labs' COTSon simulator based on AMD's SimNow is a full system simulation infrastructure. It allows to simulate complete systems ranging from multicore nodes up to full clusters of multicore nodes with complete network simulation. It is composed of a pluggable architecture, in which most features can be substituted for your own development, thus allowing researchers to use it as their simulation platform.
There are tons of simulators, why a new one? COTSon is not just another simulator, it is a simulation infrastructure where you can plugin your own simulation modules. Our holistic approach simulates the whole system at once, because we believe that multicore multithreaded architectures of the future can not be understood without taking into account the whole system, including devices and the whole operating system. Something similar can be said about disk and network research.
As a design principle, COTSon trades off accuracy for speed and viceversa, dynamically allowing the researcher to determine the interesting parts of their application, as well as doing large space explorations at higher speeds. Why use many tools if one suffices?
We hope COTSon becomes the de facto standard simulation infrastructure for next generation systems simulation, and that is why we are making it freely available under request. If you belong to any kind of research lab or university and you are interested in microarchitecture simulation, disk simulation, network simulation or system simulation, COTSon may be perfect for you.
In this talk, we provide a general description of COTSon and explain the different research challengues and solutions behind the development of our simulation infrastructure. More information about COTSon can be found at http://sites.google.com/site/hplabscotson.
Shrinking transistor sizes and a trend toward low-power processors have
caused increased leakage, high per-device variation and a larger number
of hard and soft errors. Maintaining precise digital behavior on these
devices grows more expensive with each technology generation. In some
cases, replacing digital units with analog equivalents allows similar
computation to be performed at higher speed and lower power. The units
that can most easily benefit from this approach are those whose results do
not have to be precise, such as various types of predictors. We introduce
the Scaled Neural Predictor, a highly accurate prediction algorithm that
is infeasible in a purely digital implementation but can be implemented
using analog circuitry. Our predictor uses current summation to replace the
expensive digital dot-product computation required in perceptron predictors.
We show that the analog predictor can outperform digital neural predictors
because of the reduced cost, in power and latency, of the key computations.
The analog neural predictor circuit is able to produce an accuracy equivalent
to an infeasible digital neural predictor that requires 128 additions
per prediction. The analog version, however, can run in 200 picoseconds,
with the analog portion of the prediction computation requiring less than
0.4 milliwatts at a 45 nm technology, which is negligible compared to the
power required for the table lookups in this and conventional predictors.
Recently the computer architecture world is shifting toward multi-core and many core systems. This "new" trend already happened in the past with only partial success. As we discuss in the talk, a major part of the past failure was the inability to efficiently use parallel systems. In order to avoid repeating past mistakes, the research community needs to provide essential solutions to critical issues.
In my talk I will provide a short historical perspective on the past similar trends, and highlight a few critical directions that the research must address in order to make the new trend result in greater success.
The Blue Gene/P system is the current leading solution in generation
of massively parallel supercomputers that IBM architected based on
orders of magnitude in system size and significant power consumption
efficiency. BG/P succeeds BG/L in the Blue Gene supercomputer line,
and it comes with many enhancements to the machine design as well as
new architectural features at the hardware and software levels. In
this talk, I will give an overview of the BG/P messaging software
stack with a focus on the Deep Computing Messaging Framework (DCMF)
and on the Component Collective Messaging Interface (CCMI). DCMF and
CCMI have been designed to easily support several programming
paradigms such as the Message Passing Interface (MPI), Aggregate
Memory Copy Interface (ARMCI), Charm++ and others. Beside the
production message passing runtime system designed for HPC
applications, I will also discuss some research explorations for
utilizing BG/P in non HPC domains like financial streaming
applications.
The shift from single to multiple core architectures means that, in
order to increase application performance, programmers must write
concurrent, multithreaded programs. Unfortunately, multithreaded
applications are susceptible to numerous errors, including deadlocks,
race conditions, atomicity violations, and order violations. These
errors are notoriously difficult for programmers to debug.
This talk presents Grace, a runtime system for safe and efficient
multithreading. Grace replaces the standard pthreads library with a
new runtime system that eliminates concurrency errors while
maintaining good scalability and high performance. Grace works with
unaltered C/C++ programs, requires no compiler support, and runs on
standard hardware platforms. I will show how Grace can ensure the
correctness of otherwise-buggy multithreaded programs, and at the
same time, achieve high performance and scalability.
That's all folks!!!!!