We are delighted to announce the papers that have been accepted to ACM SYSTOR’23.
Major efforts on the design of a persistent hash table on a non-volatile byte-addressable memory focus on efficient support of crash consistency with fence/flush primitives as well on non-disruptive table rehashing operations. When a data entry in a hash bucket cannot be updated with one atomic write, out-of-place update, instead of in-place update, is required to avoid data corruption after a failure. This often causes extra fences/flushes. Meanwhile, when open addressing techniques, such as linear probing, are adopted for high load factor, the scope of search for a key can be large. Excessive use of fence/flush and extended key search paths are two major sources of performance degradation with hash tables in a persistent memory.
To address the issues, we design a persistent hash table, named TurboHash, for building high-performance key-value store. TurboHash has a number of much desired features all in one design. (1) It supports out-of-place update with a cost equivalent to that of an in-place write to provide lock-free reads. (2) Long-distance linear probing is minimized (only when necessary). (3) It conducts only shard resizing for expansion and avoids expensive directory-level rehashing; And (4) it exploits hardware features for high I/O and computation efficiency, including Intel’s Optane DC’s performance characteristics and Intel AVX instructions. In particular, all of the features are enabled with the consideration of a critical performance characteristic of the emergent Intel Optane DC persistent memory, which is its internal 256-byte block access. We have implemented TurboHash on the memory and conducted extensive evaluations. Experiment results show that TurboHash improves state-of-the-arts by 2x-8x in terms of throughput and latency.
Lei Lei, Decai Pan and Li Lin. Chongqing University; Peng Ouyang, TsingMicro Co. Ltd.; Xuelaing Du, Beijing Baidu Netcom Science Technology Co., Ltd.; Dajiang Liu, Chongqing University
Spatial accelerators, such as spatially-programmed Coarse-Grained Reconfigurable Arrays (CGRA), enable a pervasive use of high energy-efficient solutions for computation-intensive applications. In the mapping of spatial accelerators, a large kernel usually needs to be partitioned into multiple subgraphs to satisfy resource constraints, leading to more memory accesses and access conflicts. To minimize the access conflicts, existing works either neglect the interference of multiple subgraphs or put little attention to data’s life cycle along the execution order, leading to a sub-optimum. To this end, this paper proposes an optimized memory allocation approach for multi-subgraph mapping on spatial accelerators by constructing an optimization problem using Integer Linear Programming (ILP). The experimental results demonstrate that our work can find conflict-free solutions for most kernels and achieve 1.15X speedup, as compared to the state-of-the-art approach.
RDMA (Remote Direct Memory Access) is widely studied in building key-value stores to achieve ultra low latency. In RDMA-based key-value stores, the indexing time takes a large fraction of the overall operation latency as RDMA enables fast data accesses. However, the single index structure used in existing RDMA-based key-value stores, either hash-based or sorted index, fails to support range queries efficiently while achieving high performance for single-point operations. In this paper, we explore the adoption of hybrid index in the key-value stores based on RDMA, especially under the memory disaggregation architecture, to combine the benefits of hash table and sorted index. We propose HStore, an RDMA-based key-value store that uses hash table for single-point lookups and leverages skiplist for range queries to index the values stored in the memory pool. Guided by previous work on using RDMA for key-value services, HStore dedicatedly chooses different RDMA verbs to optimize the read and write performance. To efficiently keep the index structures within a hybrid index consistent, HStore asynchronously applies the updates to the sorted index by shipping the update log via two-sided verbs. Compared to Sherman and Clover, HStore improves the throughput by up to 54.5% and 38.5% respectively under the YCSB benchmark.
As we move towards hyper-converged cloud solutions, the efficiency and overheads of distributed file systems at the cloud tenant side (i.e., client) become of paramount importance. Often, the client-side driver of a cloud file system is complex and CPU intensive, deeply coupled with the backend implementation, and requires optimizing multiple intrusive knobs. In this work, we propose to decouple the file system client from its backend implementation by virtualizing it with an off-the-shelf DPU using the Linux virtio-fs software stack. The decoupling allows us to offload the file system client execution to a DPU, which is managed and optimized by the cloud provider, while freeing the host CPU cycles. DPFS, our proposed framework, is 4.4X more host CPU efficient per I/O, delivers comparable performance to a tenant with zero-configuration and without modification of their host software stack, while allowing workload and hardware specific backend optimizations. The DPFS framework and its artifacts are publically available at GitHub – IBM/DPFS: DPU-Powered File System Virtualization over virtio-fs
Anomalous events can signal a variety of problems in any system. As such, robust, fast detection of anomalies is important so issues can be fixed before they cascade to create larger problems. In this paper we focus on IBM Z systems, although most of the problems addressed and techniques used are broadly applicable. For example, anomalies can signal issues such as disk malfunctions, slow or unresponsive modules, crashes and latent bugs, lock contention, excessive retries, the need to allocate more resources to reduce contention, etc. Although there are specific techniques for addressing individual issues, anomaly detection is useful in its broad spectrum utility, and its ability to identify combinations of problems for which there may not be a specific approach implemented. In addition, anomaly detection serves as a backstop: truly anomalous events suggest that normal mechanisms did not work.
Our input for detecting anomalies is low-level, summarized information available in time series to the zOS operating system. Although such information lacks some high-level context, it does provide an operating system awareness that benefits from universal applicability to any zOS system and any code running on such a system. The data is also quite rich with 100 – 100,000 metrics per sample depending on how “metric” is defined. As might be expected the data contains metrics such as CPU utilization, execution priorities, internal locking behavior, bytes read and written by an executing process, etc. It also contains higher-level information such as the executing process names or transactional identifiers from online transactional processing facilities. Names are useful not only in detecting anomalies, but in conveying context to users trying to isolate and fix problems.
Our techniques build on KL divergence and learn continuously without supervision and with low overhead. Continuous learning is important. The first instance of an aberrant behavior is an anomaly. The 10th instance probably is not. This point also illustrates the utility of anomaly detection in pinpointing root cause: early detection is essential and broad-spectrum anomaly detection provides excellent capability to do just that.
This paper outlines these techniques and demonstrates their efficacy in detecting and resolving key problems.
Alex Merenstein, Stony Brook University; Vasily Tarasov, IBM Research; Ali Anwar, University of Minnesota; Scott Guthridge, IBM Research; Erez Zadok, Stony Brook University
Serverless platforms offer on-demand computation and represent a significant shift from previous platforms that typically required resources to be pre-allocated (e.g., virtual machines). As serverless platforms have evolved, they have become suitable for a much wider range of applications than their original use cases. However, storage access remains a pain point that holds serverless back from becoming a completely generic computation platform. Existing storage for serverless typically uses an object interface. Although object APIs are simple to use, they lack the richness, versatility, and performance of POSIX-based file system APIs. Additionally, there is a large body of existing applications that relies on file-based interfaces. The lack of file based storage options prevents these applications from being ported to serverless environments. In this paper, we present F3, a file system that offers features to improve file access in serverless platforms: (1) efficient handling of ephemeral data, by placing ephemeral and non-ephemeral data on storage that exists at a different points along the durability-performance tradeoff continuum, (2) locality-aware data scheduling, and (3) efficient reading while writing. We modified OpenWhisk to support attaching file-based storage and to leverage F3’s features using hints. Our prototype evaluation of F3 shows improved performance of up to 1.5–6.5x compared to existing storage systems.
Inho Song and Myounghoon Oh, Dankook University; Bryan S. Kim, Syracuse University; Seehwan Yoo, Jaedong Lee and Jongmoo Choi, Dankook University
The ZNS (Zoned NameSpace) interface shifts much of the storage maintenance responsibility to the host from the underlying SSDs (Solid-State Drives). In addition, it opens a new opportunity to exploit the internal parallelism of SSDs at both hardware and software levels. By orchestrating the mapping between zones and SSD-internal resources and by controlling zone allocation among threads, ZNS SSDs provide a distinct performance trade-off between parallelism and isolation. To understand and explore the design space of ZNS SSDs, we present ConfZNS (Configurable ZNS), an easy-to-configure and timing-accurate emulator based on QEMU. ConfZNS allows users to investigate a variety of ZNS SSD’s internal architecture and how it performs with existing host software. We validate the accuracy of ConfZNS using real ZNS SSDs and explore performance characteristics of different ZNS SSD designs with real-world applications such as RocksDB, F2FS, and Docker environment.
This paper studies how RAID (redundant array of independent disks) could take full advantage of modern SSDs (solid-state drives) with built-in transparent compression. In current practice, RAID users are forced to choose a specific RAID level (e.g., RAID 10 or RAID 5) with a fixed storage cost vs. speed performance trade-off. Commercial market is witnessing the emergence of a new family of SSDs that can internally perform hardware-based lossless compression on each 4KB LBA (logical block address) block, transparent to host OS and user applications. Beyond straightforwardly reducing the RAID storage cost, such modern SSDs make it possible to relieve RAID users from being locked into a fixed storage cost vs. speed performance trade-off. The key idea is simple: RAID systems opportunistically leverage higher-than-expected runtime user data compressibility to enable dynamic RAID level conversion to improve the speed performance without compromising the effective storage capacity. This paper presents design techniques to enable and optimize the practical implementation of such elastic RAID systems. For the purpose of demonstration, we implemented a Linux software-based elastic RAID prototype that supports dynamic conversion between RAID 5 and RAID 10. Compared with a baseline software-based RAID 5, under sufficient runtime data compressibility that enables the conversion from RAID 5 to RAID 10 over 60% user data, the elastic RAID could improve the 4KB random write IOPS (IO per second) by 42% and 4KB random read IOPS in degraded mode by 46%, while maintaining the same effective storage capacity.
The disregarded performance of erase operation in low-latency NAND flash memory relatively increases the impact of garbage collection interferences, which is a challenge when considering the use for storage class memory applications.
Conventional erase-related techniques do not sufficiently conceal the impact of erase operation and incur a performance trade-off between throughput and latency.
In this paper, we propose a boost and stop erase (BOOSTER) technique consisting of a multi-block erase technique and an adaptive erase suspension technique. The proposed technique improves both throughput and latency of low-latency NAND flash-based SSDs.
Our experiments show up to 2.31x higher throughput and up to 82.8x lower latency than the conventional techniques in the small random workload. The proposed technique also passes the certification test with 1.8x to 4.9x higher load in Aerospike Certification Tool evaluations.
Seungjin Lee, Chang-Gyu Lee and Donghyun Min, Sogang University; Inhyuk Park and Woosuk Chung, SK hynix; Anand Sivasubramaniam, Pennsylvania State University; Youngjae Kim, Sogang University
Key-Value SSD (KVSSD) has shown much potential for several important classes of emerging data stores with its high throughput and low latency. In designing a key-value store with range queries, LSM-tree is considered a better choice over a hash table due to its key ordering. However, design space for range queries in LSM-tree-based KVSSDs has yet to be explored despite the range query being one of the most demanding features. In this paper, we investigate design constraints in LSM-tree-based KVSSD from the perspective of the range query and propose three design principles. Keeping these principles, we present IterKVSSD, an Iterator interface extended LSM-tree-based KVSSD for range queries. IterKVSSD is implemented on OpenSSD Cosmos+, and its range query throughput increases at most x4.13 and x7.22 in random and sequential key distribution compared to existing KVSSDs.
Public cloud providers offer a diverse collection of storage types and configurations with different costs and performance SLAs. As a consequence, it is difficult to select the most cost-efficient allocations for storage backends, while satisfying a given workload’s performance requirements, when moving data-heavy applications to the cloud. We present Mimir, a tool for automatically finding a cost-efficient virtual storage cluster configuration for a customer’s storage workload and performance requirements. Importantly, Mimir considers all block storage types and configurations, and even heterogeneous mixes of them. In our experiments, compared to state-of-the-art approaches that consider only one storage type, Mimir finds configurations that reduce cost by up to 81% for substantial storage workloads.
Heting Liu and Zhichao Li, ByteDance Inc.; Cheng Tan, Northeastern University; Rongqiu Yang, ByteDance Inc.; Guohong Cao, Pennsylvania State University; Zherui Liu and Chuanxiong Guo, Bytedance Inc.
Graphics processing units (GPUs) are the de facto standard for processing deep learning (DL) tasks. In large-scale GPU clusters, GPU failures are inevitable and may cause severe consequences. For example, GPU failures disrupt distributed training, crash inference services, and result in service level agreement violations. In this paper, we study the problem of predicting GPU failures using machine learning (ML) models to mitigate their damages.
We train prediction models on a four-month production dataset with 350 million entries at CompanyXYZ. We observe that classic prediction models (GBDT, MLP, LSTM, and 1D-CNN) do not perform well — they are inaccurate for predictions and unstable over time. We propose several techniques to improve the precision and stability of predictions, including parallel and cascade model-ensemble mechanisms and a sliding training method. We evaluate the performance of our proposed techniques. The results show that our proposed techniques improve the prediction precision from 46.3% to 85.4% on production workloads.