Full Citation in the ACM Digital Library.
Day 1: Monday, May 22, 2017
09:00 Welcome and registration
09:30 Opening addresses
10:00 Keynote #1:
The Trouble with Hardware,
Timothy Roscoe (ETH Zürich)
[Abstract], [Speaker Bio]
Computer hardware, from datacenters through rackscale computing down to mobile device systems-on-chip, is increasingly easy to design. A combination of advanced CAD systems, the rise of Moore's law, and now the fall of Moore's law, has resulted in a huge diversity of hardware platforms whose complexity is immense.
One downside of this is the monumental software engineering challenge in building and maintaining correct, robust, and portable systems software. This is an open secret in many pockets of industry but receives little attention in research. We ran against this problem full square (and continue to do so) while developing the Barrelfish research OS.
In part of my talk I'll discuss about what can be done to address this in the design of systems software, by importing ideas from formal verification, knowledge representation, and program synthesis to the C-dominated world of low-level code.
This, however, begs broader questions: given that custom hardware is getting easier to design, what should it look like to system software? How can systems researchers influence such hardware? And in an age where more corporations are building custom hardware but academia is mostly restricted to commodity systems, what can be done to conduct relevant, impactful research in this space outside of industry? I'll try and suggest some answers.
Timothy Roscoe is a Full Professor in the Systems Group of the
Computer Science Department at ETH Zurich. of Technology. He
received a PhD from the Computer Laboratory of the University of
Cambridge, where he was a principal designer and builder of the
Nemesis operating system, as well as working on the Wanda microkernel
and Pandora multimedia system. After three years working on web-based
collaboration systems at a startup company in North Carolina, Mothy
joined Sprint's Advanced Technology Lab in Burlingame, California,
working on cloud computing and network monitoring. He then joined
Intel Research at Berkeley in April 2002 as a principal architect of
PlanetLab, an open, shared platform for developing and deploying
planetary-scale services. In September 2006 he spent four months as a
visiting researcher in the Embedded and Real-Time Operating Systems
group at National ICT Australia in Sydney, before joining ETH Zurich
in January 2007. His current research interests include monitoring,
modelling, and managing complex enterprise datacenters, and system
software for modern hardware, including the Barrelfish research
operating system. He was recently elected Fellow of the ACM for
contributions to operating systems and networking research.
11:00 Coffee Break
11:30 Session 1: Resources
Session Chair: Orna Agmon Ben-Yehuda (Technion)
Multidimensional Resource Allocation in Practice
Authors: Danny Raz (Technion), Itai Segall (Nokia Bell Labs), Maayan Goldstein (Nokia Bell Labs)
One of the main motivations for the shift to the Cloud (and the more recent shift of telco operators into NFV) is cost reduction due to high utilization of infrastructure resources. However, achieving high utilization in practical scenarios is complex since the term “resources” covers different orthogonal aspects, such as server CPU, storage (or disk) usage and network capacity, and the workload characterization varies over time and over different users.
In this paper we study the placement of Virtual Machines (VMs) that implement services over the physical infrastructure, trying to understand what makes a placement scheme better than others in the overall utilization of the various resources. We show that the multidimensional case is inherently different from the single dimension case, and develop novel placement heuristics to address the specific challenges. We then show, by extensive evaluation over real data, that operators can significantly improve their resource utilization by selecting the most appropriate placement policy, according to their system specifications and the deployed services. In particular, two of our new heuristics that dynamically change the placement logic according to the amount of available (unused) resources are shown to perform very well in many practical scenarios.
Heterogeneous- and NUMA-aware Scheduling for Many-core Architectures
Authors: Panayiotis Petrides (University of Cyprus), Pedro Trancoso (University of Cyprus)
As the number of cores increases in a single chip processor, several challenges arise: wire delays, contention for out-of-chip accesses, and core heterogeneity. In order to address these issues and the applications demands, future large-scale many-core processors are expected to be organized as a collection of NUMA clusters of heterogeneous cores. In this work we propose a scheduler that takes into account the non-uniform memory latency, the heterogeneity of the cores, and the contention to the memory controller to find the best matching core for the application’s memory and compute requirements. Scheduler decisions are based on an on-line classification process that determines applications requirements either as memory- or compute-bound. We evaluate our proposed scheduler on the 48-core Intel SCC using applications from SPEC CPU2006 benchmark suite. Our results show that even when all cores are busy, migrating processes to cores that match better the requirements of applications results in overall performance improvement. In particular we observed a reduction of the execution time from 15% to 36% compared to a random static scheduling policy.
12:25 Session 2: SSDs and Flash
Session Chair: Gala Yadgar (Technion)
AutoStream: Automatic Stream Management for Multi-streamed SSDs
Authors: Jingpei Yang (Samsung Semiconductor), Rajinikanth Pandurangan (Samsung Semiconductor), Changho Choi (Samsung), Vijay Balakrishnan (Samsung Semiconductor)
Multi-stream SSDs can isolate data with different life time to disparate erase blocks, thus reduce garbage collection overhead and improve overall SSD performance. Applications are responsible for management of these device-level steams such as stream open/close and data-to-stream mapping. This requires application changes, and the engineer deploying the solution needs to be able to individually identify the streams in their workload. Furthermore, when multiple applications are involved, such as in VM or containerized environments, stream management becomes more complex due to the limited number of streams a device can support, for example, allocating streams to applications or sharing streams across applications will cause additional overhead.
To address these issues and reduce the overhead of stream management, this paper proposes automatic stream management algorithms that operate under the application layer. Our stream assignment techniques, called AutoStream, is based on run time workload detection and independent of the application(s). We implement our AutoStream prototype in NVMe Linux device driver and our performance evaluation shows up to 60% reduction on WAF (Write Amplification Factor) and up to 237% improvement on performance compared to a conventional SSD device.
Freewrite: Creating (Almost) Zero-Cost Writes to SSD in Applications (short paper)
Authors: Chunyi Liu (Northwestern Polytechnical University), Fan Ni (University of Texas at Arlington), Xingbo Wu (University of Texas at Arlington), Xiao Zhang (Northwestern Polytechnical University), Song Jiang (University of Texas at Arlington)
While flash-based SSDs have much higher access speed than hard disks, they have an Achilles heel, which is the service of write requests. Not only is writing slower than reading, but also it can incur expensive garbage collection operations and reduce SSDs’ lifetime. The deduplication technique can help to avoid writing data objects whose contents have been on the disk. A typical object is the disk block, for which a block-level deduplication scheme can help identify duplicate ones and avoid their writing. For the technique to be effective, data written to the disk must not only be the same as those currently on the disk but also be block-aligned.
In this work, we will show that many deduplication opportunities are lost due to block misalignment, leading to a substantially large number of unnecessary writes. As case studies, we develop a scheme to retain alignments of the data that are read from the disk in the file modifications by using small additional spaces for two important applications, a log-based key-value store (e.g., FAWN) and an LSM-tree based key-value store (e.g., LevelDB). Our experiments show that the proposed scheme can achieve up to 4.5X and 26% of throughput improvement for FAWN and LevelDB systems, respectively, with a less than 5% space overhead.
Relieving Self-Healing SSDs of Heal Storms (short paper)
Authors: Li-Pin Chang (National Chiao-Tung University), Sheng-Min Huang (National Chiao-Tung University), Kun-Lin Chou (National Chiao-Tung University)
Building self-healing SSDs is proven feasible by recent studies. When the stress of a block becomes critical, it can be healed to remove part of the stress. However, with wear leveling, all blocks are evenly worn and have similar stress, and all blocks could undergo the healing process within a short period of time. The intensive heal operations, called heal storms, cause highly unpredictable I/O performance and storage reliability. Inspired by the even distribution of erase counts under wear leveling, we propose to operate wear leveling on virtual erase counts instead of real erase counts. When the balance among virtual erase counts is achieved through wear leveling, all real erase counts become evenly dispersed in a controlled interval. In this way, blocks will undergo healing at different times. Virtual erase counts are progressively adjusted such that all blocks reach their endurance limit when the SSD permanently retires. Our results show that our approach successfully resolved the heal storm problem without impacting on the SSD lifespan.
13:20 Lunch Break
14:20 Session 3: Observation
Session Chair: Larry Rudolph (Two Sigma)
LoGA: Low-overhead GPU accounting using events
Authors: Jens Kehne (Karlsruhe Institute of Technology), Stanislav Spassov (Karlsruhe Institute of Technology), Marius Hillenbrand (Karlsruhe Institute of Technology), Marc Rittinghaus (Karlsruhe Institute of Technology), Frank Bellosa (Karlsruhe Institute of Technology)
Over the last few years, GPUs have become common in computing. However, current GPUs are not designed for a shared environment like a cloud, creating a number of challenges whenever a GPU must be multiplexed between multiple users. In particular, the round-robin scheduling used by today’s GPUs does not distribute the available GPU computation time fairly among applications. Most of the previous work addressing this problem resorted to scheduling all GPU computation in software, which induces high overhead. While there is a GPU scheduler called NEON which reduces the scheduling overhead compared to previous work, NEON’s accounting mechanism frequently disables GPU access for all but one application, resulting in considerable overhead if that application does not saturate the GPU by itself.
In this paper, we present LoGA, a novel accounting mechanism for GPU computation time. LoGA monitors the GPU’s state to detect GPU-internal context switches, and infers the amount of GPU computation time consumed by each process from the time between these context switches. This method allows LoGA to measure GPU computation time consumed by applications while keeping all applications running concurrently. As a result, LoGA achieves a lower accounting overhead than previous work, especially for applications that do not saturate the GPU by themselves. We have developed a prototype which combines LoGA with the pre-existing NEON scheduler. Experiments with that prototype have shown that LoGA induces no accounting overhead while still delivering accurate measurements of applications’ consumed GPU computation time.
Dexter: Faster Troubleshooting of Misconfiguration Cases Using System Logs
Author: Rukma Ameet Talwadker (NetApp)
Misconfigurations in the storage systems can lead to business losses due to system downtime with substantial people resources invested into troubleshooting. Hence, faster troubleshooting of software misconfigurations has been critically important for the customers as well as the vendors.
This paper introduces a framework and a tool called Dexter, which embraces the recent trend of viewing systems as data to derive the troubleshooting clues. Dexter provides quick insights into the problem root cause and possible resolution by solely using the storage system logs. This differentiates Dexter from other previously known approaches which complement log analysis with source code analysis, execution traces etc.. Furthermore, Dexter analyzes command history logs from the sick system after it has been healed and predicts the exact command(s) which resolved the problem. Dexter’s approach is simple and can be applied to other software systems with diagnostic logs for immediate problem detection without any pre-trained models.
Evaluation on 600 real customer support cases shows 90% accuracy in root causing and over 65% accuracy in finding an exact resolution for the misconfiguration problem. Results show up to 60% noise reduction in system logs and at least 10x savings in case resolution times, bringing down the troubleshooting times from days to minutes at times. Dexter runs 24x7 in the NetApp’s support data center.
The paper also presents insights from study on thousands of real customer support cases over thousands of deployed systems over the period of 1.5 years. These investigations uncover facts that cause potential delays in customer case resolutions and influence Dexter’s design.
Simulation-Based Tracing and Profiling for System Software Development (short paper)
Authors: Anselm Busse (Technische Universität Berlin), Reinhardt Karnapke (Technische Universität Berlin), Helge Parzyjegla (Universität Rostock)
Tracing and profiling low-level kernel functions (e.g. as found in the process scheduler) is a challenging task, though, necessary in both research and production in order to acquire detailed insights and achieve peak performance. Several kernel functions are known to be not traceable because of architectural limitations, whereas tracking other functions causes side effects and skews profiling results.
In this paper, we present a novel, simulation-based approach to analyze the behavior and performance of kernel functions. Kernel code is executed on a simulated hardware platform avoiding the bias caused by collecting the tracing data within the system under observation. From the flat call trace generated by the simulator, we reconstruct the entire call graph and enrich it with detailed profiling statistics. Specifying regions of interest enables developers to systematically explore the system behavior and identify performance bottlenecks. As case study, we analyze the process scheduler of the Linux kernel. We are interested in quantifying the synchronization overhead caused by a growing number of CPU cores in a custom, semi-partitioned scheduler design. Conventional tracing methods were not able to obtain measurements with the required accuracy and granularity.
15:30 Coffee Break
15:55 Session 4: Security
Session Chair: Anselm Busse (Technische Universität Berlin)
SafeFS: A Modular Architecture for Secure User-Space File Systems (One FUSE to rule them all) [Best Student Paper]
Authors: Rogério Pontes (INESC TEC & University of Minho), Dorian Burihabwa (University of Neuchatel, Switzerland), Francisco Maia (INESC TEC & University of Minho), João Paulo (INESC TEC & University of Minho), Valerio Schiavoni (University of Neuchatel, Switzerland), Pascal Felber (University of Neuchatel, Switzerland), Hugues Mercier (University of Neuchatel, Switzerland), Rui Oliveira (INESC TEC & University of Minho)
The exponential growth of data produced, the ever faster and ubiquitous connectivity, and the collaborative processing tools lead to a clear shift of data stores from local servers to the cloud. This migration occurring across different application domains and types of users—individual or corporate—raises two immediate challenges. First, outsourcing data introduces security risks, hence protection mechanisms must be put in place to provide guarantees such as privacy, confidentiality and integrity. Second, there is no “one-size-fits-all” solution that would provide the right level of safety or performance for all applications and users, and it is therefore necessary to provide mechanisms that can be tailored to the various deployment scenarios.
In this paper, we address both challenges by introducing SafeFS, a modular architecture based on software-defined storage principles featuring stackable building blocks that can be combined to construct a secure distributed file system. SafeFS allows users to specialize their data store to their specific needs by choosing the combination of blocks that provide the best safety and performance tradeoffs. The file system is implemented in user space using FUSE and can access remote data stores. The provided building blocks notably include mechanisms based on encryption, replication, and coding. We implemented SafeFS and performed in-depth evaluation across a range of workloads. Results reveal that while each layer has a cost, one can build safe yet efficient storage architectures. Furthermore, the different combinations of blocks sometimes yield surprising tradeoffs.
Eleos: ExitLess OS Services for SGX Enclaves (Highlight Paper – EuroSys 2017)
Authors: Meni Orenbach (Technion), Pavel Lifshits (Technion), Marina Minkin (Technion), Mark Silberstein (Technion)
Intel Software Guard eXtensions (SGX) enable secure and trusted execution of user code in an isolated enclave to protect against a powerful adversary. Unfortunately, running I/O-intensive, memory-demanding server applications in enclaves leads to significant performance degradation. Such applications put a substantial load on the in-enclave system call and secure paging mechanisms, which turn out to be the main reason for the application slowdown. In addition to the high direct cost of thousands-of-cycles long SGX management instructions, these mechanisms incur the high indirect cost of enclave exits due to associated TLB flushes and processor state pollution.
We tackle these performance issues in Eleos by enabling exit-less system calls and exit-less paging in enclaves. Eleos introduces a novel Secure User-managed Virtual Memory (SUVM) abstraction that implements application-level paging inside the enclave. SUVM eliminates the overheads of enclave exits due to paging, and enables new optimizations such as sub-page granularity of accesses.
We thoroughly evaluate Eleos on a range of microbenchmarks and two real server applications, achieving notable system performance gains. memcached and a face verifi- cation server running in-enclave with Eleos, achieves up to 2.2× and 2.3× higher throughput respectively while working on datasets up to 5× larger than the enclave’s secure physical memory.
Jumpstarting BGP Security with Path-End Validation (Highlight Paper – SIGCOMM 2016)
Authors: Avichai Cohen (Hebrew University), Yossi Gilad (Boston University and MIT), Amir Herzberg (Bar Ilan University), Michael Schapira (Hebrew University)
Extensive standardization and R&D efforts are dedicated to
establishing secure interdomain routing. These efforts focus
on two mechanisms: origin authentication with RPKI,
and path validation with BGPsec. However, while RPKI is
finally gaining traction, the adoption of BGPsec seems not
even on the horizon due to inherent, possibly insurmountable,
obstacles, including the need to replace today's routing
infrastructure and meagre benefits in partial deployment.
Consequently, secure interdomain routing remains a distant
dream. We propose an easily deployable, modest extension
to RPKI, called "path-end validation", which does not entail
replacing/upgrading today's BGP routers. We show, through
rigorous security analyses and extensive simulations on empirically
derived datasets, that path-end validation yields significant
benefits even in very limited partial adoption. We
present an open-source, readily deployable prototype implementation
of path-end validation.
17:15 Poster session and refreshments
Day 2: Tuesday, May 23, 2017
09:00 Welcome and registration
09:30 Keynote #2:
Emery Berger (University of Massachusetts Amherst)
[Abstract], [Speaker Bio]
Performance clearly matters to users. The most common software update on the AppStore *by far* is "Bug fixes and performance enhancements." Now that Moore's Law Free Lunch has ended, programmers have to work hard to get high performance for their applications. But why is performance so hard to deliver?
I will first explain why our current approaches to evaluating and optimizing performance don't work, especially on modern hardware and for modern applications. I will then present two systems that address these challenges. Stabilizer is a tool that enables statistically sound performance evaluation, making it possible to understand the impact of optimizations and conclude things like the fact that the -O2 and -O3 optimization levels are indistinguishable from noise (unfortunately true).
Since compiler optimizations have largely run out of steam, we need better profiling support, especially for modern concurrent, multi-threaded applications. Coz is a novel "causal profiler" that lets programmers optimize for throughput or latency, and which pinpoints and accurately predicts the impact of optimizations. Coz's approach unlocks numerous previously unknown optimization opportunities. Guided by Coz, we improved the performance of Memcached by 9%, SQLite by 25%, and accelerated six Parsec applications by as much as 68%; in most cases, these optimizations involved modifying under 10 lines of code.
This talk is based on work with Charlie Curtsinger published at ASPLOS 2013 (Stabilizer) and SOSP 2015 (Coz), which received a Best Paper Award and was selected as a CACM Research Highlight.
Emery Berger is a Professor in the College of Information and Computer Sciences at the University of Massachusetts Amherst, the flagship campus of the UMass system. He graduated with a Ph.D. in Computer Science from the University of Texas at Austin in 2002. Professor Berger has been a Visiting Scientist at Microsoft Research and at the Universitat Politècnica de Catalunya (UPC) / Barcelona Supercomputing Center (BSC). Professor Berger's research spans programming languages, runtime systems, and operating systems, with a particular focus on systems that transparently improve reliability, security, and performance. He and his collaborators have created a number of influential software systems including Hoard, a fast and scalable memory manager that accelerates multithreaded applications (used by companies including British Telecom, Cisco, Crédit Suisse, Reuters, Royal Bank of Canada, SAP, and Tata, and on which the Mac OS X memory manager is based); DieHard, an error-avoiding memory manager that directly influenced the design of the Windows 7 Fault-Tolerant Heap; and DieHarder, a secure memory manager that was an inspiration for hardening changes made to the Windows 8 heap. His honors include a Microsoft Research Fellowship, an NSF CAREER Award, a Lilly Teaching Fellowship, the Distinguished Artifact Award for PLDI 2014, the Most Influential Paper Award at OOPSLA 2012, the Most Influential Paper Award at PLDI 2016, three CACM Research Highlights, a Google Research Award, a Microsoft SEIF Award, and Best Paper Awards at FAST, OOPSLA, and SOSP; he was named an ACM Senior Member in 2010. Professor Berger is currently serving as an elected member of the SIGPLAN Executive Committee; he served for a decade (2007-2017) as Associate Editor of the ACM Transactions on Programming Languages and Systems, and was Program Chair for PLDI 2016.
10:30 Session 5: GPUs
Session Chair: Binoy Ravindran (Virginia Tech)
GPrioSwap: Towards a Swapping Policy for GPUs
Authors: Jens Kehne (Karlsruhe Institute of Technology), Jonathan Metter (Karlsruhe Institute of Technology), Martin Merkel (Karlsruhe Institute of Technology), Marius Hillenbrand (Karlsruhe Institute of Technology), Mathias Gottschlag (Karlsruhe Institute of Technology), Frank Bellosa (Karlsruhe Institute of Technology)
Over the last few years, Graphics Processing Units (GPUs) have become popular in computing, and have found their way into a number of cloud platforms. However, integrating a GPU into a cloud environment requires the cloud provider to efficiently virtualize the GPU. While several research projects have addressed this challenge in the past, few of these projects attempt to properly enable sharing of GPU memory between multiple clients: To date, GPUswap is the only project that enables sharing of GPU memory without inducing unnecessary application overhead, while maintaining both fairness and high utilization of GPU memory. However, GPUswap includes only a rudimentary swapping policy, and therefore induces a rather large application overhead.
In this paper, we work towards a practicable swapping policy for GPUs. To that end, we analyze the behavior of various GPU applications to determine their memory access patterns. Based on our insights about these patterns, we derive a swapping policy that includes a developer-assigned priority for each GPU buffer in its swapping decisions. Experiments with our prototype implementation show that a swapping policy based on buffer priorities can significantly reduce the swapping overhead.
Crane: Fast and Migratable GPU Passthrough for OpenCL applications
Authors: James Gleeson (University of Toronto), Daniel Kats (University of Toronto), Charlie Mei (University of Toronto), Eyal de Lara (University of Toronto)
General purpose GPU (GPGPU) computing in virtualized environments leverages PCI passthrough to achieve GPU performance comparable to bare-metal execution. However, GPU passthrough prevents service administrators from performing virtual machine migration between physical hosts.
Crane is a new technique for virtualizing OpenCL-based GPGPU computing that achieves within 5.25% of passthrough GPU performance while supporting VM migration. Crane interposes a virtualization-aware OpenCL library that makes it possible to reclaim and subsequently reassign physical GPUs to a VM without terminating the guest or its applications. Crane also enables continued GPU operation while the VM is undergoing live migration by transparently switching between GPU passthrough operation and API remoting.
11:25 Coffee Break
11:55 Session 6: Storage Systems
Session Chair: Ioana Giurgiu (IBM Research - Zürich)
TripS: Automated Multi-tiered Data Placement in a Geo-distributed Cloud Environment
Authors: Kwangsung Oh (University of Minnesota Twin Cities), Abhishek Chandra (University of Minnesota Twin Cities), Jon Weissman (University of Minnesota Twin Cities)
Exploiting the cloud storage hierarchy both within and across data-centers of different cloud providers empowers Internet applications to choose data centers (DCs) and storage services based on storage needs. However, using multiple storage services across multiple data centers brings a complex data placement problem that depends on a large number of factors including, e.g., desired goals, storage and network characteristics, and pricing policies. In addition, dynamics e.g., changing user locations and access patterns, make it impossible to determine the best data placement statically. In this paper, we present TripS, a lightweight system that considers both data center locations and storage tiers to determine the data placement for geo-distributed storage systems. Such systems make use of TripS by providing inputs including SLA, consistency model, fault tolerance, latency information, and cost information. With given inputs, TripS models and solves the data placement problem using mixed integer linear programming (MILP) to determine data placement. In addition, to adapt quickly to dynamics, we introduce the notion of Target Locale List (TLL), a pro-active approach to avoid expensive re-evaluation of the optimal placement. The TripS prototype is running on Wiera, a policy driven geo-distributed storage system, to show how a storage system can easily utilize TripS for data placement. We evaluate TripS/Wiera on multiple data centers of AWS and Azure. The results show that TripS/Wiera can reduce cost 14.96% ∼ 98.1% based on workloads in comparison with other works’ approaches and can handle both short- and long-term dynamics to avoid SLA violations.
Understanding Storage Traffic Characteristics on Enterprise Virtual Desktop Infrastructure
Authors: Chunghan Lee (Fujitsu Laboratories), Tatsuo Kumano (Fujitsu Laboratories), Tatsuma Matsuki (Fujitsu Laboratories), Hiroshi Endo (Fujitsu Laboratories), Naoto Fukumoto (Fujitsu Laboratories), Mariko Sugawara (Fujitsu Laboratories)
[Abstract], [Slides], [VDI FC Traces]
Despite the growing popularity of enterprise virtual desktop infrastructure (VDI), little is known about its storage traffic characteristics. In addition, no prior work has considered the detailed characteristics of virtual machine (VM) behavior on VDI. In this paper, we analyze the enterprise storage traffic on commercial office VDI using designated VMs. For 28 consecutive days, we gathered various types of traces, including a usage questionnaire and active and passive measurements. To characterize the storage traffic, we focused on two perspectives: fibre channel (FC) traffic and VM behavior. From the FC traffic perspective, we found that read traffic is dominant, although the applications are similar to those in a previous small-scale VDI. In particular, the write response time of large transactions, e.g., 128 KiB, is strongly affected by a slight decrease in cache hits during an update storm. From the VM behavior, we found that all active user VMs generate only 25% of traffic. Although a few VMs generate massive traffic, their impact is small. These characteristics are unique in comparison with the small-scale VDI. Our results have significant implications for designing the next generation of VDI and improving its performance.
vNFS: Maximizing NFS Performance with Compounds and Vectorized I/O (Highlight Paper – FAST 2017)
Authors: Ming Chen (Stony Brook University), Dean Hildebrand (IBM Research-Almaden), Henry Nelson (Ward Melville High School), Jasmit Saluja (Stony Brook University), Ashok Sankar Harihara Subramony (Stony Brook University), Erez Zadok (Stony Brook University)
Modern systems use networks extensively, accessing both services and storage across local and remote networks. Latency is a key performance challenge, and packing multiple small operations into fewer large ones is an effective way to amortize that cost, especially after years of significant improvement in bandwidth but not latency. To this end, the NFSv4 protocol supports a compounding feature to combine multiple operations. Yet compounding has been underused since its conception because the synchronous POSIX file-system API issues only one (small) request at a time.
We propose vNFS, an NFSv4.1-compliant client that exposes a vectorized high-level API and leverages NFS compound procedures to maximize performance. We designed and implemented vNFS as a user-space RPC library that supports an assortment of bulk operations on multiple files and directories. We found it easy to modify several UNIX utilities, an HTTP/2 server, and Filebench to use vNFS. We evaluated vNFS under a wide range of workloads and network latency conditions, showing that vNFS improves performance even for low-latency networks. On high-latency networks, vNFS can improve performance by as much as two orders of magnitude.
13:15 Lunch Break
14:00 Social Event: Visit to Caesarea National Park
Day 3: Wednesday, May 24, 2017
09:00 Welcome and registration
09:30 Keynote #3:
Research in an Open Cloud Exchange,
Orran Krieger (Boston University)
[Abstract], [Speaker Bio], [Slides]
While cloud computing is transforming society, today's public clouds are black boxes, implemented and operated by a single provider that makes all business and technology decisions. Can we architect a cloud that enables a broad industry and research community to participate in the business and technology innovation? Do we really need to blindly trust the provider for the security of our data and computation? Can we expose rich operational data to enable research and to help guild users of the cloud?
The Massachusetts Open Cloud (MOC) is a new public cloud project based on alternative marketplace-driven model of a public cloud - that of an Open Cloud eXchange (OCX) - where many stakeholders (including the research community) participate in implementing and operating an open cloud. Our vision is to create an ecosystem that brings the innovation of a broader community to bear on a healthier and more efficient cloud marketplace, where anyone can standup up a new hardware or software service, and users can make informed decisions between them. The OCX model effectively turns the cloud into a production-scale laboratory for cloud research and innovation.
The MOC is a collaboration between the Commonwealth of Massachusetts, universities (Boston University, Northeastern, MIT, Harvard and UMass), and industry (in particular Brocade, Cisco, Intel, Lenovo, Red Hat, and TwoSigma). In this talk I will give an overview of the vision of this project, its enabling technologies and operational status, and some of the different research projects taking place.
Orran Krieger is the lead on the Massachusetts Open Cloud, Founding Director for the Cloud Computing Initiative (CCI) at BU, Resident Fellow of the Hariri Institute for Computing and Computational Science & Engineering, and a Professor of the practice at the Department of Electrical and Computer Engineering Boston University. Before coming to BU, he spent five years at VMware starting and working on vCloud. Prior to that he was a researcher and manager at IBM T. J. Watson, leading the Advanced Operating System Research Department. Orran did his PhD and MASc in Electrical Engineering at the University of Toronto.
10:30 Session 7: CPU and Memory
Session Chair: Gürkan Gür (Boğaziçi University, Istanbul)
Erasure Coding for Small Objects in In-Memory KV Storage
Authors: Matt M. T. Yiu (The Chinese University of Hong Kong), Helen H. W. Chan (The Chinese University of Hong Kong), Patrick P. C. Lee (The Chinese University of Hong Kong)
We present MemEC, an erasure-coding-based in-memory key-value (KV) store that achieves high availability and fast recovery while keeping low data redundancy across storage servers. MemEC is specifically designed for workloads dominated by small objects. By encoding objects in entirety, MemEC is shown to incur 60% less storage redundancy for small objects than existing replication- and erasure-coding-based approaches. It also supports graceful transitions between decentralized requests in normal mode (i.e., no failures) and coordinated requests in degraded mode (i.e., with failures). We evaluate our MemEC prototype via testbed experiments under read-heavy and update-heavy YCSB workloads. We show that MemEC achieves high throughput and low latency in both normal and degraded modes, and supports fast transitions between the two modes.
Breaking the Boundaries in Heterogeneous-ISA Datacenters (Highlight Paper – ASPLOS 2017)
Authors: Antonio Barbalace (Virginia Tech), Rob Lyerly (Virginia Tech), Christopher Jelesnianski (Virginia Tech), Anthony Carno (Virginia Tech), Ho-ren Chuang (Virginia Tech), Binoy Ravindran (Virginia Tech)
Energy efficiency is one of the most important design considerations in running modern datacenters. Datacenter operating systems rely on software techniques such as execution migration to achieve energy efficiency across pools of machines. Execution migration is possible in datacenters today because they consist mainly of homogeneous-ISA machines. However, recent market trends indicate that alternate ISAs such as ARM and PowerPC are pushing into the datacenter, meaning current execution migration techniques are no longer applicable. How can execution migration be applied in future heterogeneous-ISA datacenters?
In this work we present a compiler, runtime, and an operating system extension for enabling execution migration between heterogeneous-ISA servers. We present a new multi-ISA binary architecture and heterogeneous-OS containers for facilitating efficient migration of natively-compiled applications. We build and evaluate a prototype of our design and demonstrate energy savings of up to 66% for a workload running on an ARM and an x86 server interconnected by a high-speed network.
11:25 Coffee Break
11:55 Session 8: More Flash
Session Chair: Song Jiang (University of Texas at Arlington)
FlashNet: Flash/Network Stack Co-Design [Best Paper]
Authors: Animesh Trivedi (IBM Research, Zürich), Nikolas Ioannou (IBM Research, Zürich), Bernard Metzler (IBM Research, Zürich), Patrick Stuedi (IBM Research, Zürich), Jonas Pfefferle (IBM Research, Zürich), Ioannis Koltsidas (IBM Research, Zürich), Kornilios Kourtis (IBM Research, Zürich), Thomas R. Gross (ETH, Zürich)
During the past decade, network and storage devices have undergone rapid performance improvements, delivering ultra-low latency and several Gbps of bandwidth. Nevertheless, current network and storage stacks fail to deliver this hardware performance to the applications, often due to the loss of IO efficiency from stalled CPU performance. While many efforts attempt to address this issue solely on either the network or the storage stack, achieving high-performance for networked-storage applications requires a holistic approach that considers both.
In this paper, we present FlashNet, a software IO stack that unifies high-performance network properties with flash storage access and management. FlashNet builds on RDMA principles and abstractions to provide a direct, asynchronous, end-to-end data path between a client and remote flash storage. The key insight behind FlashNet is to co-design the stack’s components (an RDMA controller, a flash controller, and a file system) to enable cross-stack optimizations and maximize IO efficiency. In micro-benchmarks, FlashNet improves 4kB network IOPS by 38.6% to 1.22M, decreases access latency by 43.5% to 50.4 µsecs, and prolongs the flash lifetime by 1.6-5.9× for writes. We illustrate the capabilities of FlashNet by building a Key-Value store, and porting a distributed data store that uses RDMA on it. The use of FlashNet’s RDMA API improves the performance of KV store by 2×, and requires minimum changes for the ported data store to access remote flash devices.
NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation
Authors: Zvika Guz (Samsung), Harry Li (Samsung), Anahita Shayesteh (Samsung), Vijay Balakrishnan (Samsung)
Storage disaggregation separates compute and storage to different nodes in order to allow for independent resource scaling and thus, better hardware resource utilization. While disaggregation of hard-drives storage is a common practice, NVMe-SSD (i.e., PCIe-based SSD) disaggregation is considered more challenging. This is because SSDs are significantly faster than hard drives, so the latency overheads (due to both network and CPU processing) as well as the extra compute cycles needed for the offloading stack become much more pronounced.
In this work we characterize the overheads of NVMe-SSD disaggregation. We show that NVMe-over-Fabrics (NVMf) – a recently-released remote storage protocol specification – reduces the overheads of remote access to a bare minimum, thus greatly increasing the cost-efficiency of Flash disaggregation. Specifically, while recent work showed that SSD storage disaggregation via iSCSI degrades application-level throughput by 20%, we report on negligible performance degradation with NVMf – both when using stress-tests as well as with a more-realistic KV-store workload.
LightNVM: The Linux Open-Channel SSD Subsystem (Highlight Paper – FAST 2017)
Authors: Matias Bjørling (CNEX Labs, Inc. and IT University of Copenhagen), Javier Gonzalez (CNEX Labs, Inc.), Philippe Bonnet (IT University of Copenhagen)
As Solid-State Drives (SSDs) become commonplace in data-centers and storage arrays, there is a growing demand for predictable latency. Traditional SSDs, serving block I/Os, fail to meet this demand. They offer a high-level of abstraction at the cost of unpredictable performance and suboptimal resource utilization. We propose that SSD management trade-offs should be handled through Open-Channel SSDs, a new class of SSDs, that give hosts control over their internals. We present our experience building LightNVM, the Linux Open-Channel SSD subsystem. We introduce a new Physical Page Address I/O interface that exposes SSD parallelism and storage media characteristics. LightNVM integrates into traditional storage stacks, while also enabling storage engines to take advantage of the new I/O interface. Our experimental results demonstrate that LightNVM has modest host overhead, that it can be tuned to limit read latency variability and that it can be customized to achieve predictable I/O latencies.
13:15 Closing Remarks
13:30 Conference adjourns. Lunch
14:30 – 16:30 Performance Evaluation and Analysis Meetup, co-located with SYSTOR. Talks by Prof. Erez Zadok (Stony Brook University) and Prof. Avi Mendelson (Technion).