Registration is open.

Sponsored by

in cooperation with

Sponsored by

Help Promote

Past Events

How to become
a SYSTOR supporter

Connect with us


Full Program

Full Citation in the ACM Digital Library.

Day 1: Monday, June 3, 2019

09:00 Welcome and Registration

09:30 Opening Session

09:45 Keynote #1: Caches Are Not Your Friends: Programming Non-Volatile Memory
James Larus (EPFL)   [Abstract], [Speaker Bio]




James Larus is Professor and Dean of the School of Computer and Communication Sciences (IC) at EPFL (École Polytechnique Fédérale de Lausanne). Prior to joining IC in October 2013, Larus was a researcher, manager, and director at Microsoft Research for over 16 years and an assistant and associate professor in the Computer Sciences Department at the University of Wisconsin, Madison.
Larus has been an active contributor to numerous communities. He has published over 100 papers (with 9 best and most influential paper awards) and received over 40 US patents. Larus received a National Science Foundation Young Investigator award in 1993 and became an ACM Fellow in 2006. Larus received his MS and PhD in Computer Science from the University of California, Berkeley in 1989, and an AB in Applied Mathematics from Harvard in 1980.

10:45 Break

11:05 Session 1: Caching
Session Chair: Eddie Bortnikov (Yahoo Research, Verizon Media)

CLOCK-Pro+: Improving CLOCK-Pro Cache Replacement with Utility-Driven Adaptation
Cong Li (Intel Corporation)


CLOCK-Pro is the low-overhead approximation of the state-of-the-art cache replacement policy, Low Inter-Reference Recency Set (LIRS). It also improves the static cache space allocation in LIRS with simple heuristics to adapt to LRU-friendly workloads. However, the heuristics do not perform well in certain cases. Inspired by the idea of utility-driven adaptation from another state-of-the-art policy, CLOCK for Adaptive Replacement (CAR), we propose a new CLOCK-Pro+ policy. The new policy directly evaluates the utility of growing the number of cold pages against that of growing hot pages. It then dynamically adjusts the cache space allocation driven by the utility comparison. Experiments are performed on traces from the UMass Trace Repository as well as a synthetic trace drawn from a stack-depth distribution. While sometimes CLOCK-Pro substantially outperforms CAR and sometimes vice versa, the new CLOCK-Pro+ policy consistently performs close to the winner between the two in all the cases.

Adaptive Software Cache Management, Middleware’18 (Highlight)
Gil Einziger (Ben Gurion University of the Negev); Ohad Eytan, Roy Friedman (Technion); Benjamin Manes (Independent)

11:40 Break

12:00 Session 2: Storage / Persistent Memory
Session Chair: Aviad Zuck (Technion)

Getting More Performance with Polymorphism from Emerging Memory Technologies
Iyswarya Narayanan (Penn State); Aishwarya Ganesan (UW-Madison); Anirudh Badam, Sriram Govindan (Microsoft); Bikash Sharma (Facebook); Anand Sivasubramaniam (Penn State)


Storage-intensive systems in data centers rely heavily on DRAM and SSDs for the performance of reads and persistent writes, respectively. These applications pose a diverse set of requirements, and are limited by fixed capacity, fixed access latency, and fixed function of these resources as either memory or storage. In contrast, emerging memory technologies like 3D-Xpoint, battery-backed DRAM, and ASIC-based fast memory-compression offer capabilities across several dimensions. However, existing proposals to use such technologies can only improve either read or write performance but not both without requiring extensive changes to the application, and the operating system. We present PolyEMT, a system that employs an emerging memory technology based cache to the SSD, and transparently morphs the capabilities of this cache across several dimensions - persistence, capacity, latency - to jointly improve both read and write performance. We demonstrate the benefits of PolyEMT using several large-scale storage-intensive workloads from our datacenters.

Write Optimization of Log-structured Flash File System for Parallel I/O on Manycore Servers
Chang-Gyu Lee, Hyunki Byun, Sunghyun Noh, Hyeongu Kang, Youngjae Kim (Sogang University)


In Manycore server environment, we observe the performance degradation in parallel writes and identify the causes as follows - (i) When multiple threads write to a single file simultaneously, the current POSIX-based F2FS file system does not allow this parallel write even though ranges are distinct where threads are writing. (ii) The high processing time of Fsync at file system layer degrades the I/O throughput as multiple threads call Fsync simultaneously. (iii) The file system periodically checkpoints to recover from system crashes. All incoming I/O requests are blocked while the checkpoint is running, which significantly degrades overall file system performance. To solve these problems, first, we propose file systems to employ a fine-grained file-level Range Lock that allows multiple threads to write on mutually exclusive ranges of files rather than the course-grained inode mutex lock. Second, we propose NVM Node Logging that uses NVM as an extended storage space to store file metadata and file system metadata at high speed during Fsync and checkpoint operations. In particular, the NVM Node Logging consists of (i) a fine-grained inode structure to solve the write amplification problem caused by flushing the file metadata in block units and (ii) a Pin Point NAT (Node Address Table) Update, which can allow flushing only modified NAT entries. We implemented Range Lock and NVM Node Logging for F2FS in Linux kernel 4.14.11. Our extensive evaluation at two different types of servers (single socket 10 cores CPU server, multi-socket 120 cores NUMA CPU server) shows significant write throughput improvements in both real and synthetic workloads.

Fine-Grain Checkpointing with In Cache Line Logging, ASPLOS’19 (Highlight)
Nachshon Cohen (Amazon); David Aksun (EPFL); Hillel Avni (Huawei); James R. Larus (EPFL)

A Persistent Lock-Free Queue for Non-Volatile Memory, PPoPP’18 (Highlight)
Michal Friedman (Technion); Maurice Herlihy (Brown University CS); Virendra Marathe (Oracle); Erez Petrank (Technion)

13:20 Lunch Break

14:30 Session 3: SGX / Security
Session Chair: Danny Harnik (IBM Research)

Trust More, Serverless
Stefan Brenner, Rüdiger Kapitza (TU Braunschweig)


The increasingly popular and novel Function-as-a-Service (FaaS) clouds allow users the deployment of single functions. Compared to Infrastructure-as-a-Service or Platform-as-a-Service, this enables providers even more aggressive and rigorous resource sharing and liberates customers from tedious maintenance tasks. However, as a crucial factor of cloud adoption, FaaS clouds need to provide security and privacy guarantees in order to allow sensitive data processing.

In this paper, we investigate securing FaaS clouds for sensitive data processing, while respecting their new features, capabilities and benefits in a technology-aware manner. We start with the proposal of a generic approach for a JavaScript-based secure FaaS platform, then get more specific and discuss the implementation of two distinct approaches based on (a) a lightweight and (b) a high performance JavaScript engine. Our prototype implementation shows promising performance while efficiently utilising resources, thereby keeping the penalties of the added security low.

Clemmys: Towards Secure Remote Execution in FaaS
Bohdan Trach, Oleksii Oleksenko, Franz Gregor (TU Dresden); Pramod Bhatotia (University of Edinburgh); Christof Fetzer (TU Dresden)


We introduce Clemmys, a security-first serverless platform that ensures confidentiality and integrity of users' functions and data as they are processed on untrusted cloud premises, while keeping the cost of protection low. We provide a design for hardening FaaS platforms with Intel SGX---a hardware-based shielded execution technology. We explain the protocol that our system uses to ensure confidentiality and integrity of data, and integrity of function chains. To overcome performance and latency issues that are inherent in SGX applications, we apply several SGX-specific optimizations to the runtime system: we use SGXv2 to speed up the enclave startup and perform batch EPC augmentation. To evaluate our approach, we implement our design over Apache Open-Whisk, a popular serverless platform. Lastly, we show that Clemmys achieved same throughput and similar latency as native Apache OpenWhisk, while allowing it to withstand several new attack vectors.

Apps Can Quickly Destroy Your Mobile's Flash: Why They Don't, and How to Keep It That Way, MobiSys’19 (Highlight)
Tao Zhang (The University of North Carolina at Chapel Hill); Aviad Zuck (Technion -- Israel Institute of Technology); Donald E. Porter (The University of North Carolina at Chapel Hill); Dan Tsafrir (Technion -- Israel Institute of Technology and VMware)

15:30 Break

15:50 Session 4: Potpourri
Session Chair: André Brinkmann (Universität Mainz)

Cross-ISA Execution of SIMD Regions for Improved Performance
Yihan Pang, Rob Lyerly, Binoy Ravindran (Virginia Tech)


We investigate the effectiveness of executing SIMD workloads on multiprocessors with heterogeneous Instruction Set Architecture (ISA) cores. Heterogeneous ISAs offer an intriguing clock speed/parallelism tradeoff for workloads with frequent usage of SIMD instructions. We consider dynamic migration of SIMD and non-SIMD workloads across ISA-different cores to exploit this trade-off. We present the necessary modifications for a general compiler/run-time infrastructure to transform the dynamic program state of SIMD regions at run-time from one ISA format to another for cross-ISA migration and execution. Additionally, we present a SIMD-aware scheduling policy that makes cross-ISA migration decisions that improve system throughput. We prototype a heterogeneous-ISA system using an Intel Xeon x86-64 server and a Cavium ThunderX ARMv8 server and evaluate the effectiveness of our infrastructure and scheduling policy. Our results reveal that cross-ISA execution migration within SIMD regions can yield throughput gains up to 36% compared to traditional homogeneous ISA systems.

x86-64 Instruction Usage among C/C++ Applications
Amogh Akshintala, Bhushan P. Jain (University of North Carolina at Chapel Hill); Chia-che Tsai (Texas A&M University); Michael Ferdman (Stony Brook University); Donald E. Porter (University of North Carolina at Chapel Hill)


This paper presents a study of x86-64 instruction usage across 9,337 C/C++ applications and libraries in the Ubuntu 16.04 GNU/Linux distribution. We present metrics for reasoning about the relative importance of instructions weighted by the popularity of applications that contain them. From this data, we systematize and empirically ground conventional wisdom regarding the relative importance of various components of an ISA, with particular focus on building binary translation tools. We also verify the representativity of two commonly used benchmark suites, and highlight areas for improvement.

Time-multiplexed Parsing in Marking-based Network Telemetry (short)
Alon Riesenberg, Yonnie Kirzon, Michael Bunin, Elad Galili (Technion - Israel Institute of Technology); Gidi Navon (Marvell); Tal Mizrahi (Huawei Network.IO Innovation Lab)


Network telemetry is a key capability for managing the health and efficiency of a large-scale network. Alternate Marking Performance Measurement (AM-PM) is a recently introduced approach that accurately measures the packet loss and delay in a network using a small overhead of one or two bits per data packet. This paper introduces a novel time-multiplexed parsing approach that enables a practical and accurate implementation of AM-PM in network devices, while requiring just a single bit per packet. Experimental results are presented, based on a hardware implementation, and a software P4-based implementation.

16:45 - 18:30 Poster Session

Day 2: Tuesday, June 4, 2019

09:00 Welcome and registration

09:30 Keynote #2: Biological Data Is Coming to Destroy Your Storage System
Bill Bolosky (Microsoft Research)   [Abstract], [Speaker Bio]




Bill Bolosky started his professional career working on the Mach Operating system at CMU in the mid-80s. Subsequently, he received a Computer Science Ph.D. from the University of Rochester and started working at Microsoft Research in 1992. His first quarter century of work involved various systems problems: virtual memory, NUMA, video file servers (when that was hard), storage, and distributed systems in general. Figuring that much time in one area was enough, a few years ago he started working on bioinformatics software, which turned out to be a gateway drug that led to doing hard-core cancer research. Recently he began studying the mechanisms that underlie acute myeloid leukemia, as well as the general underpinnings of all cancers.

10:30 Break

11:00 Session 5: Deduplication
Session Chair: Liuba Shrira (Brandeis University)

SS-CDC: A Two-stage Parallel Content-defined Chunking for Deduplicating Backup Storage
Fan Ni (University of Texas at Arlington); Xing Lin (NetApp); Song Jiang (University of Texas at Arlington)


Data deduplication has been widely used in storage systems to improve storage efficiency and I/O performance. In particular, content-defined variable-size chunking (CDC) is often used in data deduplication systems for its capability to detect and remove duplicate data in modified files. However, the CDC algorithm is very compute-intensive and inherently sequential. Efforts on accelerating it by segmenting a file and running the algorithm independently on each segment in parallel come at a cost of substantial degradation of deduplication ratio.

In this paper, we propose SS-CDC, a two-stage parallel CDC, that enables (almost) full parallelism on chunking of a file without compromising deduplication ratio. Further, SS-CDC exploits instruction-level SIMD parallelism available in today's processors. As a case study, by using Intel AVX-512 instructions, SS-CDC consistently obtains superlinear speedups on a multi-core server. Our experiments using real-world datasets show that, compared to existing parallel CDC methods which only achieve up to a 7.7X speedup on an 8-core processor with the deduplication ratio degraded by up to 40%, SS-CDC can achieve up to a 25.6X speedup with no loss of deduplication ratio.

Sketching Volume Capacities in Deduplicated Storage, FAST’19 (Highlight)
Danny Harnik, Moshik Hershkovitch (IBM Research); Yosef Shatsky (IBM Systems); Amir Epstein (Citi Innovation Lab); Ronen Kat (IBM Research)

11:40 Break

12:00 Session 6: Distributed Systems
Session Chair: Roy Friedman (Technion)

Storm: A Fast Transactional Dataplane for Remote Data Structures
Stanko Novakovic (Microsoft); Yizhou Shan (Purdue University); Aasheesh Kolli (Penn State and VMware); Michael Cui (VMware); Yiying Zhang (Purdue University); Haggai Eran (Mellanox and Technion); Boris Pismenny, Liran Liss (Mellanox); Michael Wei (VMware); Dan Tsafrir (Technion and VMware); Marcos Aguilera (VMware)


RDMA technology enables a host to access the memory of a remote host without involving the remote CPU, improving the performance of distributed in-memory storage systems. Previous studies argued that RDMA suffers from scalability issues, because the NIC's limited resources are unable to simultaneously cache the state of all the concurrent network streams. These concerns led to various software-based proposals to reduce the size of this state by trading off performance.

We revisit these proposals and show that they no longer apply when using newer RDMA NICs in rack-scale environments. In particular, we find that one-sided remote memory primitives lead to better performance as compared to the previously proposed unreliable datagram and kernel-based stacks. Based on this observation, we design and implement Storm, a transactional dataplane utilizing one-sided read and write-based RPC primitives. We show that Storm outperforms eRPC, FaRM, and LITE by 3.3x, 3.6x, and 17.1x, respectively, on an InfiniBand cluster with Mellanox ConnectX-4 NICs.

Taking Omid to the Clouds: Fast, Scalable Transactions for Real-Time Cloud, VLDB’18 (Highlight)
Ohad Shacham, Yonatan Gottesman (Yahoo Research, Verizon Media); Aran Bergman (Technion); Edward Bortnikov, Eshcar Hillel (Yahoo Research, Verizon Media); Idit Keidar (Technion and Yahoo Research, Verizon Media)

Kurma: Secure Geo-distributed Multi-cloud Storage Gateways
Ming Chen, Erez Zadok (Stony Brook University)


Cloud storage is highly available, scalable, and cost-efficient. Yet, many cannot store data in cloud due to security concerns and legacy infrastructure such as network-attached storage (NAS). We describe Kurma, a cloud storage gateway system that allows NAS-based programs to seamlessly and securely access cloud storage. To share files among distant clients, Kurma maintains a unified file-system namespace by replicating metadata across geo-distributed gateways. Kurma stores only encrypted data blocks in clouds, keeps file-system and security metadata on-premises, and can verify data integrity and freshness without any trusted third party. Kurma uses multiple clouds to prevent cloud outage and vendor lock-in. Kurma's performance is 52--91% that of a local NFS server while providing geo-replication, confidentiality, integrity, and high availability.

13:00 Lunch Break

14:00 Social Event

Day 3: Wednesday, June 5, 2019

09:00 Welcome and registration

09:30 Keynote #3: Is There Virtualization Beyond Containers? And Is It Useful to the Cloud?
James Bottomley (IBM Research)   [Abstract], [Speaker Bio]




James Bottomley is a Distinguished Engineer at IBM Research where he works on cloud and container technology. Bottomley is also Linux Kernel maintainer of the SCSI subsystem. He has served as Director on the Board of the Linux Foundation and Chair of its Technical Advisory Board. He went to university at Cambridge for both his undergraduate and doctoral degrees after which he joined AT&T Bell labs to work on distributed lock manager technology for clustering. In 2000 he helped found SteelEye Technology, a high availability company for Linux and Windows, becoming Vice President and CTO. He joined Novell in 2008 as a Distinguished Engineer at Novell's SUSE Labs, Parallels (later Odin) in 2011 as CTO of Server Virtualization and IBM Research in 2016.

10:30 Break

11:00 Session 7: OS
Session Chair: Orna Agmon Ben-Yehuda (Technion)

MEGA: Overcoming Traditional Problems with OS Huge Page Management
Theodore Michailidis, Alex Delis, Mema Roussopoulos (University of Athens)


Modern computer systems now feature memory banks whose aggregate size ranges from tens to hundreds of GBs. In this context, contemporary workloads can and do often consume vast amounts of main memory. This upsurge in memory consumption routinely results in increased virtual-to-physical address translations, and consequently and more importantly, more translation misses. Both of these aspects collectively do hamper the performance of workload execution. A solution aimed at dramatically reducing the number of address translation misses has been to provide hardware support for pages with bigger sizes, termed huge pages. In this paper, we empirically demonstrate the benefits and drawbacks of using such huge pages. In particular, we show that it is essential for modern OS to refine their software mechanisms to more effectively manage huge pages. Based on our empirical observations, we propose and implement MEGA, a framework for huge page support for the Linux kernel. MEGA deploys basic tracking mechanisms and a novel memory compaction algorithm that jointly provide for the effective management of huge pages. We experimentally evaluate MEGA using an array of both synthetic and real workloads and demonstrate that our framework tackles known problems associated with huge pages including increased page fault latency, memory bloating as well as memory fragmentation, while at the same time it delivers all huge pages benefits.

A Critical RCU Safety Property Is... Ease of Use!
Paul E. McKenney (IBM Linux Technology Center)


Some might argue that read-copy update (RCU) is too low-level to be targeted by hackers, but the advent of Row Hammer [19] demonstrated the naïveté of such views. After all, if black-hat hackers are ready, willing, and able to exploit hardware bugs such as Row Hammer, they are assuredly ready, willing, and able to exploit bugs in RCU. Nor is it any longer the case that RCU's involvement in exploitable Linux-kernel bugs is strictly theoretical. However, this bug involved not RCU's correctness, but rather its ease of use. Nevertheless, it was a real bug that really needed fixing. This paper describes this bug and the road to its eventual fix.

11:40 Break

12:00 Session 8: Storage / Persistent Memory
Session Chair: Ethan L. Miller (University of California Santa Cruz)

Towards Building a High-performance, Scale-in Key-value Storage System
Yangwook Kang, Rekha Pitchumani, Pratik Mishra, Yang-suk Kee, Francisco Londono, Sangyoon Oh, Jupyung Lee, Jongyeol Lee, Daniel D. G. Lee (Samsung Electronics)


Key-value stores are widely used as storage backends, due to their simple, yet flexible interface for cache, storage, file system, and database systems. However, when used with high performance NVMe devices, their high compute requirements for data management often leave the device bandwidth under-utilized. This leads to a performance mismatch of what the device is capable of delivering and what it actually delivers, and the gains derived from high speed NVMe devices is nullified. In this paper, we introduce KV-SSD (Key-Value SSD) as a key technology in a holistic approach to overcome such performance imbalance. KV-SSD provides better scalability and performance by simplifying the software storage stack and consolidating redundancy, thereby lowering the overall CPU usage and releasing the memory to user applications. We evaluate the performance and scalability of KV-SSDs over state-of-the-art software alternatives built for traditional block SSDs. Our results show that, unlike traditional key-value systems, the overall performance ofKV-SSD scales linearly, and delivers 1.6 to 57x gains depending on the workload characteristics.

WARCIP: Write Amplification Reduction by Clustering I/O Pages
Jing Yang, Shuyi Pei, Qing Yang (University of Rhode Island; Shenzhen DAPU Microelectronics Co., Ltd)


The storage volume of SSDs has been greatly increased recently with emerging multi-layer 3D triple-level cell and quad-level cell. However, one critical overhead of any flash memory SSD is the garbage collection (GC) process that is necessary due to the inherent physical property of flash memories. GC is a time consuming process that slows down I/O performance and decreases endurance of SSD. To minimize the negative impact of GC, we introduce Write Amplification Reduction by Clustering I/O Pages (WARCIP). The idea is to use a clustering algorithm to minimize the rewrite interval variance of pages in a flash block. As a result, pages in a flash block tend to have a similar lifetime, minimizing write amplification during a garbage collection. We have implemented WARCIP on an enterprise NVMe SSD. Both simulation and measurement experiments have been carried out. Real world I/O traces and standard I/O benchmarks are used in our experiments to assess the potential benefit of WARCIP. Experiment results show that WARCIP reduces write amplification dramatically and the number of block erasures by 4.45 times on average, implying extended lifetimes of flash SSDs.

FADaC: A Self-adapting Data Classifier for Flash Memory
Kevin Kremer, André Brinkmann (Johannes Gutenberg University Mainz)


Solid state drives (SSDs) implement a log-structured write pattern, where obsolete data remains stored on flash pages until the flash translation layer (FTL) erases them. erase() operations, however, cannot erase a single page, but target entire flash blocks. Since these victim blocks typically store a mix of valid and obsolete pages, FTLs have to copy the valid data to a new block before issuing an erase() operation. This process therefore increases the latencies of concurrent I/Os and reduces the lifetime of flash memory.

Data classification schemes identify data pages with similar update frequencies and group them together. FTLs can use this grouping to design garbage collection strategies to find victim blocks that have less valid data with respect to having no data classification, and therefore to significantly reduce the number of additional I/Os.

Previous data classification algorithms have been designed without leveraging special features of flash memory and often rely on workload-specific configurations. Our classifier FADaC tunes its parameters online and operates on any given amount of memory by storing additional information within the metadata of flash pages. Additional read() requests for the classification are so few that FADaC reduces the internal flash overhead by up to 45% compared to the best classifier from previous work.

On Fault Tolerance, Locality, and Optimality in Locally Repairable Codes, USENIX ATC’18 (Highlight)
Oleg Kolosov (Tel Aviv University); Gala Yadgar, Matan Liram (Technion - Israel Institute of Technology); Alexander Barg (University of Maryland); Itzhak Tamo (Tel Aviv University)

13:20 Closing Remarks

13:35 Conference Adjournment and Lunch

14:30 – 16:45
The Systems Behind AI
Meetup co-located with SYSTOR 2019. Talks by Ranit Aharonov and Yoav Katz (IBM), Edward Bortnikov (Yahoo Research, Verizon Media), Alon Lubin (Taboola), and Idan Levi (RedHat) + panel discussion.
Participation is free.





IBM Two Sigma Facebook Technion NetApp Ravello Systems i-core NUTANIX Nokia Bell Labs Hewlett Packard Enterprise Dell STRATOSCALE redhat ACM SIGOPS USENIX TCE