Intelligent Architectures for Intelligent Machines Onur Mutlu (ETH Zurich and Carnegie Mellon University)
[Abstract], [Speaker Bio]
Computing is bottlenecked by data. Large amounts of application data
overwhelm storage capability, communication capability, and
computation capability of the modern machines we design today. As a
result, many key applications' performance, efficiency and scalability
are bottlenecked by data movement. We describe three major
shortcomings of modern architectures in terms of 1) dealing with data,
2) taking advantage of the vast amounts of data, and 3) exploiting
different semantic properties of application data.
We argue that an
intelligent architecture should be designed to handle data well. We
show that handling data well requires designing architectures based on
three key principles: 1) data-centric, 2) data-driven, 3)
We give several examples for how to exploit each of these
principles to design a much more efficient and high performance
computing system. We will especially discuss recent research that aims
to fundamentally reduce memory latency and energy, and practically
enable computation close to data, with at least two promising novel
directions: 1) performing massively-parallel bulk operations in memory
by exploiting the analog operational properties of memory, with
low-cost changes, 2) exploiting the logic layer in 3D-stacked memory
technology in various ways to accelerate important data-intensive
We discuss how to enable adoption of such fundamentally
more intelligent architectures, which we believe are key to
efficiency, performance, and sustainability. We conclude with some
guiding principles for future computing architecture and system
Onur Mutlu is a Professor of Computer Science at ETH Zurich. He is
also a faculty member at Carnegie Mellon University, where he
previously held the Strecker Early Career Professorship. His current
broader research interests are in computer architecture, systems,
hardware security, and bioinformatics. A variety of techniques he,
along with his group and collaborators, has invented over the years
have influenced industry and have been employed in commercial
microprocessors and memory/storage systems.
He obtained his PhD and MS
in ECE from the University of Texas at Austin and BS degrees in
Computer Engineering and Psychology from the University of Michigan,
Ann Arbor. He started the Computer Architecture Group at Microsoft
Research (2006-2009), and held various product and research positions
at Intel Corporation, Advanced Micro Devices, VMware, and Google. He
received the IEEE Computer Society Edward J. McCluskey Technical
Achievement Award, ACM SIGARCH Maurice Wilkes Award, the inaugural
IEEE Computer Society Young Computer Architect Award, the inaugural
Intel Early Career Faculty Award, US National Science Foundation
CAREER Award, Carnegie Mellon University Ladd Research Award, faculty
partnership awards from various companies, and a healthy number of
best paper or "Top Pick" paper recognitions at various computer
systems, architecture, and hardware security venues. He is an ACM
Fellow "for contributions to computer architecture research,
especially in memory systems", IEEE Fellow for "contributions to
computer architecture research and practice", and an elected member of
the Academy of Europe (Academia Europaea).
His computer architecture
and digital logic design course lectures and materials are freely
available on YouTube, and his research group makes a wide variety of
software and hardware artifacts freely available online. For more
information, please see his webpage at
Edge Computing: a New Disruptive Force Mahadev (Satya) Satyanarayanan (Carnegie Mellon University)
[Abstract], [Speaker Bio]
At the height of its success, Cloud Computing is yielding to Edge Computing. Why?
What is the unique value proposition of Edge Computing? As real-world deployment
of Edge Computing appear, how will the lives of end users be improved?
What new applications and capabilities will they see? Which are the
applications that run best at the edge, which run best in the cloud,
and which should straddle the edge and cloud? How do we build systems
that are seamless to the user, but leverage all the available tiers of
computing to best effect? Based on my team's decade-long exploration
of Edge Computing, I will share my insights on these questions.
Satya's multi-decade research career has focused on the challenges of performance, scalability, availability and trust in information systems that reach from the cloud to the mobile edge of the Internet. In the course of this work, he has pioneered many advances in distributed systems, mobile computing, pervasive computing, and the Internet of Things (IoT). As described in "How we created edge computing", Satya's seminal 2009 publication "The Case for VM-based Cloudlets in Mobile Computing" and the ensuing research has led to the emergence of Edge Computing (also known as "Fog Computing"). Satya is the Carnegie Group Professor of Computer Science at Carnegie Mellon University. He received the PhD in Computer Science from Carnegie Mellon, after Bachelor's and Master's degrees from the Indian Institute of Technology, Madras. He is a Fellow of the ACM and the IEEE.
Serverless in Seattle: Toward Making Serverless the Future of the Cloud Ricardo Bianchini (Microsoft Research)
[Abstract], [Speaker Bio]
The serverless computing paradigm has attractive properties, such as pay-per-use and fast scale-out. Unfortunately, it also has some key shortcomings that have so far limited its wide applicability. For example, current approaches for cold-start management either incur high latency or high resource overheads. As another example, running data-intensive applications in serverless platforms is currently slow or requires additional machinery. In this talk, I will describe our efforts towards understanding current serverless workloads, optimizing their cold start performance and efficiency, and broadening the scope of applications that can run efficiently in serverless platforms. Some of these efforts have already started transitioning to productionin Azure Functions. I will conclude the talk with some open challenges going forward.
Dr. Ricardo Bianchini received his PhD degree in Computer Science from
the University of Rochester. He then joined the faculty at the Federal
University of Rio de Janeiro, and later at Rutgers University. He is
currently a Distinguished Engineer at Microsoft, where he leads efforts
to improve the efficiency of the company's online services and
datacenters. He also manages the Systems Research Group at Microsoft
Research in Redmond. His main research interests include cloud computing,
datacenter efficiency, and leveraging machine learning to improve systems.
He has published nine award papers and received the CAREER award from
the National Science Foundation. He has given several conference keynote
talks and served on numerous program committees, including as Program
Co-Chair of ASPLOS, EuroSys, and ICDCS. He is an ACM Fellow and an IEEE Fellow.
Day 1: Tuesday, October 13, 2020
Keynote #1: Intelligent Architectures for Intelligent Machines - Onur Mutlu, ETH Zurich and Carnegie Mellon University
Session 1: Memory and AVX
Session 2: Potpourri
Day 1 Adjournment
Day 2: Wednesday, October 14, 2020
Keynote #2: Edge Computing: a New Disruptive Force - Mahadev Satyanarayanan (Satya), Carnegie Mellon University
Session 3: Filesystems and Non-Volatile Memory
Keynote #3: Serverless in Seattle: Toward Making Serverless the Future of the Cloud - Ricardo Bianchini, Microsoft Research
Session 1: Memory and AVX
Session Chair: Larry Rudolph (Two Sigma)
Memory Elasticity Benchmark Liran Funaro (Technion - Israel Institute of Technology); Orna Agmon Ben-Yehuda (University of Haifa and Technion); Assaf Schuster (Technion - Israel Institute of Technology) [Abstract], [Slides]
Cloud computing handles a vast share of the world's computing, but it is not as efficient as it could be due to its lack of support for memory elasticity. An environment that supports memory elasticity can dynamically change the size of the application's memory while it's running, thereby optimizing the entire system's use of memory. However, this means at least some of the applications must be memory-elastic. A memory elastic application can deal with memory size changes enforced on it, making the most out of all of the memory it has available at any one time. The performance of an ideal memory-elastic application would not be hindered by frequent memory changes. Instead, it would depend on global values, such as the sum of memory it receives over time.
Memory elasticity has not been achieved thus far due to a circular dependency problem. On the one hand, it is difficult to develop computer systems for memory elasticity without proper benchmarking, driven by actual applications. On the other, application developers do not have an incentive to make their applications memory-elastic, when real-world systems do not support this property nor do they incentivize it economically.
To overcome this challenge, we propose a system of memory-elastic benchmarks and an evaluation methodology for an application's memory elasticity characteristics. We validate this methodology by using it to accurately predict the performance of an application, with a maximal deviation of 8% on average. The proposed benchmarks and methodology have the potential to help bootstrap computer systems and applications towards memory elasticity.
Due to the slowdown of Moore's Law, systems designers have begun integrating non-cache-coherent heterogeneous computing elements in order to continue scaling performance. Programming such systems has traditionally been difficult - developers were forced to use programming models that exposed multiple memory regions, requiring developers to manually maintain memory consistency. Previous works proposed distributed shared memory (DSM) as a way to achieve high programmability in such systems. However, past DSM systems were plagued by low-bandwidth networking and utilized complex memory consistency protocols, which limited their adoption. Recently, new networking technologies have begun to change the assumptions about which components are bottlenecks in the system. Additionally, many popular shared-memory programming models utilize memory consistency semantics similar to those proposed for DSM, leading to widespread adoption in mainstream programming.
In this work, we argue that it is time to revive DSM as a means for achieving good programmability and performance on non-cache-coherent systems. We explore optimizing an existing DSM protocol by relaxing memory consistency semantics and exposing new cross-node barrier primitives. We integrate the new mechanisms into an existing OpenMP runtime, allowing developers to leverage cross-node execution without changing a single line of code. When evaluated on an x86 server connected to an ARMv8 server via InfiniBand, the DSM optimizations achieve an average of 11% (up to 33%) improvement versus the baseline DSM implementation.
Advanced Vector Extension (AVX) instructions operate on wide SIMD vectors. Due to the resulting high power consumption, recent Intel processors reduce their frequency when executing complex AVX2 and AVX-512 instructions. Following non-AVX code is slowed down by this frequency reduction in two situations: When it executes on the sibling hyperthread of the same core in parallel or - as restoring the non-AVX frequency is delayed - when it directly follows the AVX2/AVX-512 code. As a result, heterogeneous workloads consisting of AVX-512 and non-AVX code are frequently slowed down by 10% on average.
In this work, we describe a method to mitigate the frequency reduction slowdown for workloads involving AVX-512 instructions in both situations. Our approach employs core specialization and partitions the CPU cores into AVX-512 cores and non-AVX-512 cores, and only the former execute AVX-512 instructions so that the impact of potential frequency reductions is limited to those cores. To migrate threads to AVX-512 cores, we configure the non-AVX-512 cores to raise an exception when executing AVX-512 instructions. We use a heuristic to determine when to migrate threads back to non-AVX-512 cores. Our approach is able to reduce the frequency reduction overhead by 70% for an assortment of common benchmarks.
Session 2: Potpourri
Session Chair: Ethan Miller (University of California Santa Cruz and Pure Storage)
Genome sequences contain hundreds of millions of DNA base pairs. Finding the degree of similarity between two genomes requires executing a compute-intensive dynamic programming algorithm, such as Smith-Waterman. Traditional von Neumann architectures have limited parallelism and cannot provide an efficient solution for large-scale genomic data. Approximate heuristic methods (e.g. BLAST) are commonly used. However, they are suboptimal and still compute-intensive.
In this work, we present BioSEAL, a biological sequence alignment accelerator. BioSEAL is a massively parallel non-von Neumann processing-in-memory architecture for large-scale DNA and protein sequence alignment. BioSEAL is based on resistive content addressable memory, capable of energy-efficient and highperformance associative processing.
We present an associative processing algorithm for entire database sequence alignment on BioSEAL and compare its performance and power consumption with state-of-art solutions. We show that BioSEAL can achieve up to 57× speedup and 156× better energy efficiency, compared with existing solutions for genome sequence alignment and protein sequence database search.
Defense techniques such as Data Execution Prevention (DEP) and Address Space Layout Randomization (ASLR) were role models in preventing early return-oriented programming (ROP) attacks by keeping performance and scalability in the forefront, making them widely-adopted. As code reuse attacks evolved in complexity, defenses have lost touch with pragmatic defense design to ensure security, either being narrow in scope or providing unrealistic overheads.
We present MARDU, an on-demand system-wide re-randomization technique that maintains strong security guarantees while providing better overall performance and having scalability most defenses lack. We achieve code sharing with diversification by implementing reactive and scalable, rather than continuous or one-time diversification. Enabling code sharing further minimizes needed tracking, patching, and memory overheads. The evaluation of MARDU shows low performance overhead of 5.5% on SPEC and minimal degradation of 4.4% in NGINX, proving its applicability to both compute-intensive and scalable real-world applications.
Modern applications use storage systems in complex and often surprising ways. Tracing system calls is a common approach to understanding applications' behavior, allowing offline analysis and enabling replay in other environments. But current system-call tracing tools have drawbacks: (1) they often omit some information---such as raw data buffers---needed for full analysis; (2) they have high overheads; (3) they often use non-portable trace formats; and (4) they may not offer useful and scalable analysis and replay tools.
We have developed Re-Animator, a powerful system-call tracing tool that focuses on storage-related calls and collects maximal information, capturing complete data buffers and writing all traces in the standard DataSeries format. We also created a prototype replayer that focuses on calls related to file-system state. We evaluated our system on long-running server applications such as key-value stores and databases. Our tracer has an average overhead of only 1.8-2.3×, but the overhead can be as low as 5% for I/O-bound applications. Our replayer verifies that its actions are correct, and faithfully reproduces the logical file system state generated by the original application.
Session 3: Filesystems and Non-Volatile Memory
Session Chair: Gala Yadgar (Technion - Israel Institute of Technology)
With the emerging of NVM (Non-Volatile Memories) technologies, NVMM-based (Non-Volatile Main Memories) file systems have attracted more and more attention. Compared to traditional file systems, most NVMM-based file systems bypass the page cache and the I/O software stack. With the new mmap interface known as the DAX-mmap interface (DAX: direct access), the CPU can access the NVMM much faster by loading from/storing to it directly. However, the existing file system benchmark tools are designed for traditional file systems and do not support the new features of NVMM-based file systems, so the returned results are very often not accurate. In this paper, a new benchmark tool called NVMFS-IOzone is proposed. The behavior of the tool is redesigned to reflect the new features of NVMM-based file systems. The NVM-lib from Intel is used instead of traditional msync() to keep data consistent when evaluating the performance of the DAX-mmap interface. Experimental results show that the new benchmark tool can reveal a hidden improvement of 1.4~2.1 times in NVMM-based file systems, which cannot be seen by the traditional evaluation tools. The data paths of direct load/store to NVMM and bypassing CPU cache are also provided to support the new features of NVMM-based file systems for multidimensional evaluation. Furthermore, embedded cleaning-ups has also been added to NVMFS-IOzone to support convenient evaluation consistency, which benefits both NVMM-based and non-NVMM-based file system benchmarking even for quick and casual tests. The whole experimental evaluation is based on real physical NVMs rather than simulated NVMs, and the experimental results confirm the effectiveness of our design.
More applications nowadays use network and cloud storage; and modern network file system protocols support compounding operations---packing more operations in one request (e.g., NFSv4, SMB). This is known to improve overall throughput and latency by reducing the number of network round trips. It has been reported that by utilizing compounds, NFSv4 performance, especially in high-latency networks, can be improved by orders of magnitude. Alas, with more operations packed into a single message, partial failures become more likely---some server-side operations succeed while others fail to execute. This places a greater challenge on client-side applications to recover from such failures. To solve this and simplify application development, we designed and built TC-NFS, an NFSv4-based network file system with transactional compound execution. We evaluated TC-NFS with different workloads, compounding degrees, and network latencies. Compared to an already existing NFSv4 system that fully utilizes compounds, our end-to-end transactional support adds as little as ~1.1% overhead but as much as ~25× overhead for some intense micro- and macro-workloads.
In-memory database systems adopting a columnar storage model play a crucial role with respect to data analytics. While data is completely kept in-memory by these systems for efficiency, data has to be stored on a non-volatile medium for persistence and fault tolerance as well. Traditionally, slow block-level devices like HDDs or SSDs are used which, however, can be replaced by fast byte-addressable NVRAM nowadays. Thus, hybrid memory systems consisting of DRAM and NVRAM offer a great opportunity for column-oriented database systems to persistently store and to efficiently process columnar data exclusively in main-memory. However, possible DRAM and NVRAM failures still necessitate the protection of primary data. While data replication is a suitable means, it increases the NVRAM endurance problem through increased write activities. To tackle that challenge and to reduce the overhead of replication, we propose a novel Polymorphic Compressed Replication (PCR) mechanism representing replicas using lightweight compression algorithms to reduce NVRAM writes, while supporting different compressed formats for the replicas of one column to facilitate different database operations during query processing. To show the feasibility and applicability, we developed an inmemory column-store prototype transparently employing PCR through an abstract user-space library. Based on this prototype, our conducted experiments show the effectiveness of our proposed PCR mechanism.