In 1984, three of my students (Wen-mei Hwu, Steve Melvin, and Mike
Shebanow) and I conceived of what was then a revolutionary idea for
microprocessor implementation: HPS. We advocated serious branch prediction,
wide issue, breaking complex instructions into microops (we called them
nodes), executing them out-of-order, and retiring instructions in program
order so as to achieve precise exceptions. The objective then was as it
is now to maximize performance by maximizing instruction and data supply
while preventing unnecessary delays in the processing core. Our first
papers were published in Micro-18 in 1985, followed soon after in ISCA 1986.
The first part of this talk is to take the audience through the thought
processes we went through then, since the world really has not changed much
since. ...although, tomorrow we will have 10 billion, not one million
transistors on a chip, operating at frequencies in excess of 10 GHz, not
20 MHz. Still, the objective is to make instructions and data available
as quickly as possible, and operate without unnecessary delays. The second
part of the talk is to point to what I believe will be some of the
characteristics of the microprocessor ten years from now.
A multinomial is a (non-recursive) mathematical expression in several variables consisting of a sum of product terms.
While such multinomial evaluations are typically done using software on a general-purpose processor, there are situations where custom hardware implementation maybe called for. The main motivation for the current research is the Bayesian Network Multinomial (BNM) evaluation. In recent proposals Bayesian Networks are represented using a characteristic multinomial and the most important operation done on a Bayesian Network - probabilistic inference - requires evaluation of this multinomial. BNMs are exponential in size and many real-time Bayesian Networks cannot be practically solved on a general-purpose processor.
The method to be presented in this talk, named MOLE (Multinomial On-Line Evaluation), mechanizes the procedure of building an arithmetic circuit (a hardware implementation consisting of interconnected arithmetic modules) starting from the E-graph of a multinomial (evaluation graphs are efficient representations of multinomials in a factored form).
MOLE consists of the following two steps:
1. Application of various linear-system operators to an E-graph representing a multinomial to convert it into a network of linear systems, and
2. Solving of the linear systems in an online fashion with a cycle time independent of the precision. A linear-system operator (LSO) transforms a subgraph of an E-graph into an equivalent linear system. By equivalence, we mean that one of the unknowns of the linear system (typically the unknown whose index is zero) evaluates to the same value (or a power-of-two scaled value) as the expression represented by the subgraph.
Over the last decade, we have developed and implemented a compiler - called
Compaan - for multi-core embedded systems, and a mapper - called Laura -
for reconfigurable platforms.
The compiler translates sequencially specified inested loop programs to
equivalent parallel specifications. The mapper transforms the parallel
specification to FPGA specifications.
In this talk I will briefly schets methods and tools, and go through
some appealing examples.
The WISCONSIN MULTIFACET PROJECT,
which I co-lead with David Wood, seeks to improve the multiprocessor
servers that form the computational infrastructure for Internet web
servers, databases, and other demanding applications. Work focuses
on using the transistor bounty provided by Moore's Law to improve
multiprocessor performance, cost, and fault tolerance, while also making
these systems easier to design and program. This talk quickly summarizes our work and then focuses on one result: A Flight Data Recorder.
Debugging is easier when one can faithfully replay a buggy execution.
This is not the default for multithreaded workloads, due to race
non-determinism. To facilitate deterministic debugging, we have
developed a practical low-overhead hardware recorder for cache-coherent
multiprocessors, called Flight Data Recorder (FDR). Like an aircraft
flight data recorder, FDR continuously records selected execution
events, even on deployed systems, logging the execution for post-mortem
analysis. FDR piggybacks on cache-coherence hardware and can capture
the last one second of the execution with modest (less than 2%) slowdown
References:
http://www.cs.wisc.edu/multifacet/papers/ isca03_flight_data_recorder.pdf
http://www.cs.wisc.edu/multifacet/papers/ pldi05_serializability_violation_detector.pdf
Fast and effective memory management is crucial for many applications, including web servers, database managers, and scientific codes. However, current memory managers do not provide adequate support for these applications on modern architectures, severely limiting their performance, scalability, and robustness.
In this talk, I describe how to design memory managers that support high-performance applications. First, I show how our Heap Layers infrastructure addresses the software engineering challenges of building efficient memory managers. Next, I'll show why current general-purpose memory managers do not scale on multiprocessors, cause false sharing of heap objects, and systematically leak memory. I describe Hoard, a fast, provably scalable general-purpose memory manager that solves these problems. As an example of its effectiveness, British Telecom uses Hoard to improve real application performance fivefold. Finally, I show how to sup
port server applications that must deal with runaway modules, terminated connections or aborted transactions. I present a generalization of regions and heaps called "reaps" that supports these applications with higher performance and lower memory consumption than current approaches.
As computer systems are becoming increasingly present in our life, more human activities depend on their availability. Human intervention is too costly and unacceptably slow when computer systems monitoring and repairing must be done fast and reliably regardless of scale, network availability, or system impairing.
In this talk, I propose to build defensive computer systems by augmenting their hardware or software architecture with trusted intelligent backdoors. An intelligent backdoor can be programmed to perform automated observation and intervention on a computer system without involving its operating system, and can communicate with other backdoors over private networks. Backdoors can be realized in hardware using a programmable network interface or in software over a virtual machine monitor. In this talk, I will present our results in prototyping a backdoor-based architecture for remote repairing and recovery, and I will outline our plans to further exploit its potential.
The evolution of high-performance micropocessors is fast approaching
several significant inflection points.
First, the marginal utility of additional single-core complexity is now
rapidly
diminishing due to a number of factors. The increase in instructions per
cycle from increases in sizes and numbers of functional units has plateaued.
Meanwhile the increasing sizes of functional units and cores are beginning
to have significant negative impacts on pipeline depths and the
scalability of processor clock cycle times.
Second, the power of high performance microprocessors has increased
rapidly over the last two decades, even as device switching energies
have been significantly reduced by supply voltage scaling. However future
voltage scaling will be limited by minimum practical threshold voltages.
Current high-performance microprocessors are already near market limits
of acceptable power dissipation. Thus scaling microprocessor performance
while maintaining or even reducing overall power dissipation
without the benefit of appreciable further voltage scaling
will prove especially challenging.
In this keynote talk we will discuss these issues and propose likely
scenarios for the future evolution of high-performance microprocessors.
We present RegionScout a technique for reducing power and bandwidth demand
in small-scale, snoopy coherent symmetric multiprocessors (SMPs). We will
briefly discuss the technological and market tradeoffs that make
snoopy-coherent SMPs increasingly attractive for a variety of
applications ranging from general purpose servers to embedded systems.
Memory requests in SMPs are expensive in terms of power and bandwidth
since they are broadcast to all nodes. RegionScout exploits common program
behavior to dynamically eliminate many broadcasts thus reducing power
and bandwidth demand. We define a region to be a continuous, aligned
memory area whose size is a power of two and observe that many requests
find that no other node caches a block in the same region even for regions
as large as 16K bytes. RegionScout is a family of simple filter
mechanisms that dynamically detect most accesses to non-shared regions. A
node with a RegionScout filter can determine in advance that a request will
miss in all remote nodes and thus avoid broadcasting the request. RegionScout
filters are implemented as a layered extension over existing snoop-based
coherence and they utilize very simple structures. Their operation is
completely transparent to software and the operating system. These
characteristics are made possible by utilizing imprecise information
about the regions cached in each node.
In this paper, we present two approaches to improve the execution of OpenMP applications on the IBM Cyclops multithreaded architecture. Both solutions are independent and they are focused to obtain better performance through a better management of the cache locality. The first solution is based on software modifications to the OpenMP runtime library to balance stack accesses across all data caches. The second solution is a small hardware modification to change the data cache mapping behavior, with the same goal. Both solutions help parallel applications to improve scalability and obtain better performance in this kind of architectures. In fact, they could also be applied to future multi-core processors. We have executed (using simulation) some of the NAS benchmarks to prove these proposals. They show how, with small changes in both the software and the hardware, we achieve very good scalability in parallel applications. Our results also show that standard execution environments oriented to multiprocessor architectures can be easily adapted to exploit multithreaded processors.
In this paper, we present a system capable of improving the I/O performance in an automatic way. This system is able to learn the behavior of the applications running on top and find the best data placement in the disk in order to improve the I/O performance. This system is built by three independent modules. The first one is able to learn the behavior of a workload in order to be able to reproduce its behavior later on, without a new execution. The second module is a drive modeler that is able to learn how a storage drive works taking it as a black box . Finally, the third module generates a set of placement alternatives and uses the afore mentioned models to predict the performance each alternative will achieve. We tested the system with five benchmarks and the system was able to find better alternatives in most cases and improve the performance significantly (up to 225%). Most important, the performance predicted where always very accurate (less that 10% error).
SMT processors increase performance by executing instructions
from several threads simultaneously. These
threads use the resources of the processor better by sharing
them but, at the same time, threads are competing for
these resources. The way critical resources are distributed
among threads determines the final performance. Currently,
processor resources are distributed among threads as determined
by the fetch policy that decides which threads enter
the processor to compete for resources. However, current
fetch policies only use indirect indicators of resource usage
in their decision, which can lead to resource monopolization
by a single thread or to resource waste when no thread
can use them. Both situations can harm performance and
happen, for example, after an L2 cache miss.
Traditional branch predictors exploit correlations between pattern
history and branch outcome to predict branches, but there is a stronger
and more natural correlation between path history and branch outcome.
I exploit this correlation with piecewise linear branch prediction,
an idealized branch predictor that develops a set of linear functions,
one for each program path to the branch to be predicted, that separate
predicted taken from predicted not taken branches. Taken together, all
of these linear functions form a piecewise linear decision surface.
Disregarding implementation concerns modulo a 64.25 kilobit hardware budget,
I presented this idealized branch predictor for the first Championship
Branch Predictor competition. I describe the idea of the algorithm and
as well as tricks used to squeeze it into 64.25 kilobits while maintaining
good accuracy.
In recent years, processor manufacturers have converged on two types of
register file architectures. Both IBM with its POWER series and Intel with
its Pentium series are using a central storage for all in-flight values, which
offers a high performance potential. AMD, on the other hand, uses an optimized
implementation of the Future File for its line of Opteron processors.
Both approaches have limitations that may preclude there application in future
processor implementations. The centralized register file scales poorly
in terms of power-performance. The Future File may be limited by the
requirement of distributed reservation stations and by the branch misprediction
recovery scheme.
This paper proposes to give processor designer teams another choice by
combining a traditional future file architecture with the concept of a central
physical register file. This new register file is used in the ”front end”
in combination with value storage in the instruction queue. Further, it is
shown that, contrary to what happens with a centralized R10k-like architecture,
the proposed architecture can use banking that scales well with the size
of structures in the processor. Only a 0.3% IPC degradation was observed
when using this technique. The energy savings due to banking are fully
utilized in our proposal. Further, the implementation is shown to integrate
well with a technique to early release short lived values that we call writeback
filtering. Overall, the resulting energy delay product is lower than in
previous proposals.
Keywords: Register File, Reservation Stations, Register Distribution,
IPC, Low Power
The performance of current parallel and distributed systems continues
to disappoint. This is due partially to the inability of applications,
compilers and computer systems to adapt to their own dynamic changes.
It is also due to the current compartmentalized approach to optimization:
applications, compilers, OS and hardware configurations are designed and
optimized in isolation. No global optimization is attempted and the needs
of each running application does not constitute the primary optimization
consideration.
To address these problem we propose application-centric computing, or
Smart Applications (SmartApps). In the SmartApps executable, the compiler
embeds most run-time system services, and a optimizing feedback loop that
monitors the application's performance and adaptively reconfigures the
application and the OS/system platform. At run-time, after incorporating
the code's input and determining the system's resources and state, the
SmartApps performs an instance specific optimization, which is more
tractable than a global generic optimization between application, OS and
system. The overriding philosophy of SmartApps is ``measure, compare, and
adapt if beneficial.''
SmartApps is being developed in the STAPL infrastructure. STAPL (the
Standard Template Adaptive Parallel Library) is a framework for developing
highly-optimizable, adaptable, and portable parallel and distributed
applications. It consists of a relatively new and still evolving collection
of generic parallel algorithms and distributed containers and a run-time
system (RTS) through which the application and compiler interact with the
OS and hardware.
SmartApps are developed for several important computational science
applications including a discrete-ordinates code simulating subatomic
particle transport, a molecular dynamics code and a program using a novel
method to simulate protein folding and ligand binding. We use current ASCI
shared and distributed machines, networks of workstations that run under
Linux and K42 operating systems and, as of late, the Blue Gene machine.
This talk will focus on the genesis of System X at Virginia Tech and the need to support the research and development of software for clusters, supercomputers and grids.
Silicon technology will continue to provide an exponential increase in the
availability of raw transistors. Effectively translating this resource into
application performance, however, is an open challenge. Ever increasing
wire-delay relative to switching speed and the exponential cost of circuit
complexity make simply scaling up existing processor designs futile. This talk
presents an alternative to superscalar design, WaveScalar. WaveScalar is a
dataflow instruction set architecture and execution model designed for
scalable, low-complexity/high-performance processors. WaveScalar is unique in
efficiently providing unified support for single- and multi-threaded imperative
programming models that require traditional memory semantics as well as
traditional dataflow models that can exploit fine-grained dataflow parallelism.
The WaveScalar ISA is designed to run on an intelligent memory system. Each
instruction in a WaveScalar binary executes in place in the memory system and
explicitly communicates with its dependents in dataflow fashion. WaveScalar
architectures cache instructions and the values they operate on in a WaveCache,
a simple grid of ``alu-in-cache'' nodes. By co-locating computation and data
in physical space, the WaveCache minimizes long wire, high-latency
communication.
Results for single-threaded SPEC and Mediabench applications demonstrate that
the WaveCache out-performs an aggressively configured superscalar design by 2-7
times, with ample opportunities for future optimizations. The WaveCache
achieves up to 110 IPC on Splash2 parallel programming workloads and between
100-370 IPC on dataflow kernels improving performance by a factor of 2-20 over
optimized, parallelized imperative-language versions.
I will also present preliminary area and delay results for a synthesziable RTL
model of the WaveCache's key components.
By converting thread-level parallelism to instruction level
parallelism, Simultaneous Multithreaded (SMT) processors are emerging
as effective ways to utilize the resources of modern superscalar
architectures. However, the full potential of SMT has not yet been
reached as most modern operating systems use existing single-thread or
multiprocessor algorithms to schedule threads, neglecting contention
for resources between threads. To date, even the best SMT scheduling
algorithms simply try to group threads for co-residency based on each
thread's expected resource utilization but do not take into account
variance in thread behavior. As such, we introduce architectural
support that enables new thread scheduling algorithms to group threads
for co-residency based on fine-grain memory system activity
information. The proposed memory monitoring framework centers on the
concept of a cache activity vector, which exposes runtime cache resource
information to the operating system to improve job scheduling.
Using this scheduling technique, we experimentally
evaluate the overall performance improvement of workloads on an SMT
machine
compared against the most recent Linux job scheduler.
This work is first motivated with experiments in a simulated
environment,
then validated on a Hyperthreading-enabled Intel Pentium-4 Xeon
microprocessor
running a modified version of the latest Linux Kernel.
The BlueGene/L supercomputer will consist of 65,536 dual-processor compute
nodes interconnected by two high-speed networks: a three-dimensional
torus network and a tree topology network. Each compute node can only
address its own local memory, making message passing the natural
programming model for BlueGene/L. In this talk I will present our
implementation of MPI for BlueGene/L. I will discuss how we
leveraged the architectural features of to arrive at an
efficient implementation of MPI in this machine.
The presentation introduces Microprocessor Technology Labs at Intel as
the start. Moor's law leads Intel microprocessor successfully in
decades. Transistor densities will continue to increase. However, device
speed scaling will not follow historical trends and leakage power
remains an issue. Meanwhile, new usage models and workloads will demand
greater performance. We can't run business as usual. We need to develop
uArchitecture features more effectively with power constraints. Software
will play more and more important roles in microprocessor design.
Emerging applications like recognition, mining and synthesis will become
the next generation dominated applications. System software has been
changing its landscape as well. It's challenges and opportunities for
microprocessor designers and researchers to explore new uArchitecture
features to meet the needs of the new application and system software.
The presentation will give couple examples to show how system software
enable HW design. It will show the potential direction of uArchitecture
for such emerging applications. The talk will describe the emerging
paradigms of programming systems and its impact to uArchitecture design.
That's all folks!!!!!