[an error occurred while processing this directive]

SYSTOR 2009
The Israeli Experimental Systems Conference

May 4-6, 2009
Organized by IBM R&D Labs in Israel

image: IBM and Haifa

Abstracts


Supercomputing: from Petascale to Exascale
Marc Snir (University of Illinois)
In a couple of years, the Blue Waters system will be installed at the University of Illinois. This system will provide sustained Petaflop/s performance on applications of interest, using many hundreds of thousands of cores. Meanwhile, research is gearing up for the design and implementation of the next generations of supercomputers, with an expectation that Exascale performance will be achieved in little more than a decade. The talk will survey the evolution of supercomputing technology; we shall discuss the software advances needed to to maintain the rate of progress in supercomputing performance and survey the research pursued at Illinois in this area.


Optimistic Concurrency for Clusters: Transactional Memory vs. Speculative Locking
Konstantin Shagin (IBM, IL), Michael Factor (IBM, IL), Assaf Schuster (Technion, IL) and Tal Zamir (Technion, IL)
Transactional memory and speculative locking are optimistic concurrency control mechanisms, whose goal is to enable highly concurrent execution while reducing the programming effort. The same basic idea lies in the heart of both methods: optimistically execute a critical code segment, determine whether there have been data conflicts and roll back in case validation fails. Transactional memory is widely considered to have advantages over lock-based synchronization on shared memory multiprocessors. Several recent works suggest employment of transactional memory in a distributed environment. However, being derived from traditional shared-memory design space, these schemes seem to be not "optimistic" enough for this setting. Each thread must validate the current transaction before proceeding to the next. Hence, blocking remote requests whose purpose is to detect/avoid data conflicts are placed on the critical path and thus delay execution.
In this paper, we investigate whether in light of the above shortcomings speculative locking can be a suitable alternative for transactional memory in a distributed environment. We present a distributed speculative locking scheme and compare its properties to the existing distributed transactional memory protocols. Despite the conceptual similarity to transactional memory, the distributed implementation of speculative locking manages to overlap communication with computation. It allows a thread to speculatively acquire multiple locks simultaneously, which is analogous to executing one transaction before validating the previous.


Transactifying Apache's Cache Module
Haggai Eran, Ohad Lutzky, Zvika Guz and Idit Keidar (Technion, IL)
Apache is a large-scale industrial multi-process and multi-threaded
application, which uses lock-based synchronization. We report on our
experience in modifying Apache to employ transactional memory instead of
locks, a process we refer to as transactification; we are not aware of any previous efforts to transactify legacy software of such a large scale. Along the way, we learned some valuable lessons about which tools one should use, which parts of the code one should transactify and which are better left untouched, as well as on the intricacy of commit handlers. We also stumbled across weaknesses of existing software transactional memory (STM) toolkits, leading us to identify desirable features they are currently lacking.

Finally, we present performance results from running Apache on a 32-core machine, showing that, there are scenarios where the performance of the STM-based version is close to that of the lock-based version.  These results suggest that there are applications for which the overhead of using a software-only implementation of transactional memory is insignificant.


SCSI Referral -- Adding Cluster Support to the SCSI Protocol
Andrew Spry and Ross Zwisler (LSI, US )
Networked storage provides significant benefits in storage consolidation and simplified management.  Initial networked storage was limited to the capacity and performance provided by a single storage system.  Many techniques including clustered file systems, object based storage, and clustered block storage have been developed to cluster storage systems to overcome these limits.  Although these techniques can create storage volumes larger than a single cluster node capacity, they all have tradeoffs in added complexity, cost, or performance.  All of these techniques have been developed in an effort to solve the basic problem in clustered storage of telling the initiator where data is located in the storage cluster.

We propose a simple change to SCSI operation, called SCSI Referrals, which directly solves the basic problem with clustered storage.  The operation of the SCSI Referral concept is explained.  The use of SCSI referrals to implement other storage cluster capabilities such as load balancing and multipath support is examined.  Initial prototype results are presented to show that SCSI Referrals have performance that is equal to or better than storage clusters created using initiator based virtual volume managers.


Object-based pNFS in Linux
Benny Halevy (Panasas, IL)
Parallel NFS (pNFS) extends NFSv4 to allow clients to directly access file data on the storage used by the NFSv4 server.  This ability to bypass the server for data access can increase both performance and parallelism, but requires additional client functionality for data access, some of which is dependent on the class of storage used (a.k.a. layout type.). The specification of pNFS is part of minor version 1 of the NFSv4 protocol that is now approved by the IETF for publication as a Proposed Standard.
In the pNFS architecture, the NFSv4.1 server is used primarily as a Metadata server (MDS) which role, on top of its legacy NFSv4.1 functionality, is to provide layout and device information to the pNFS clients and to coordinate their access to storage.  The protocol specifies the operations used between the pNFS clients and MDS; however it does not specify the type of storage, how the client connects to it, and the protocol used to access it.
These attributes are part of the layout type. The main pNFS operations and data types in NFSv4.1 specify a layout-type-independent layer; layout-type-specific information is conveyed using opaque data structures which internal structure is further defined by the particular layout type specification. At the moment, there are three distinct layout types specified: NFSv4.1 files, using NFSv4.1 file servers as data servers. Blocks, using SAN-attached block storage, and Objects, using Object-based Storage Devices (OSDs).

The linux pNFS developers team have prototyped extensible support for all three layout types.  This article will provide an overview of the generic
infrastructure for pNFS in the Linux kernel and then focus on the object-based pNFS implementation, including the different components introduced such as the OSD protocol stack, the exofs filesystem (based on IBM's OSDFS, written by Avishay Traeger), the OSD target, and the pnfs-obj layout driver.

At this time, we can present experimental results achieved over this architecture where a Panasas object-based system is exported over pNFS and the pNFS clients access the Panasas OSDs directly using the panlayout layout driver (which is the prototype for the pnfs-obj layout driver).

The end-to-end implementation of pNFS over objects in linux, using the standard OSD protocol and available open-source components is still in the works. We expect to have preliminary experimental results with the prototype implementation that will show the baseline performance achievable in this model and its scalability when large files are striped over multiple OSDs.


Storage Modeling for Power Estimation

Miriam Allalouf, Yuriy Arbitman, Michael Factor, Ronen Kat, Kalman Meth and Dalit Naor (IBM,IL)
Power consumption is a major issue in today's datacenters. Storage typically comprises a significant percentage of datacenter power. Thus, understanding, managing, and reducing storage power consumption is an essential aspect of any efforts that address the total power consumption of datacenters. We developed a scalable power modeling method that estimates the power consumption of storage workloads. The modeling concept is based on identifying the major workload contributors to the power consumed by the disk arrays.

To estimate the power consumed by a given host workload, our method translates the workload to the primitive activities induced on the disks. In addition, we identified that I/O queues have a fundamental influence on the power consumption. Our power estimation results are highly accurate, with only 2%  deviation for typical random workloads with small transfer sizes (up to 8K), and a deviation of up to 8% for workloads with large transfer sizes. We successfully integrated STAMP into a power-aware capacity planning tool to predict system power requirements and integrated it into an online storage system to provide online estimation for the power consumed.


Energy and Performance Evaluation of Lossless File Data Compression on Server Systems
Rachita Kothiyal, Vasily Tarasov, Priya Sehgal and Erez Zadok (Stony Brook University, US)
Data compression has been claimed to be an attractive solution to save energy consumption in high-end servers and data centers.  However, there has not been a study to explore this.  In this paper, we present a comprehensive evaluation of energy consumption for various file compression techniques implemented in software.  We apply various compression tools available on Linux to a variety of data files, and we try them on server class and workstation class systems.  We compare their energy and performance results against raw reads and writes. Our results reveal that software based data compression cannot be considered as a universal solution to reduce energy consumption. Various factors like the type of the data file, the compression tool being used, the read-to-write ratio of the workload, and the hardware configuration of the system impact the efficacy of this technique.  In some cases, however, we found compression to save substantial energy and improve performance.


Designing a Power Efficiency Framework for Battery Powered Systems
Dacian Tudor and Marius Marcu (Politehnica University of Timisoara, RO)
In the context of the continuous number increase of battery powered systems, the problem of extending battery life together with supporting more resource demanding features has been drawing a lot of attention. Besides the advances in power efficiency for the lower layers and hardware components, another dimension is power efficiency has been opened by energy-aware higher layers. In this paper we briefly survey some of the most recent directions in supporting power efficiency for battery powered devices. We propose a new power efficiency framework which addresses some of the major shortcomings of existing solutions. We describe the specific concepts of the new framework together with a new evaluation dimension at both low and high level. Last but not least, we define the architecture of the framework which we aim to realize and deploy in several battery powered mobile systems.


The Next Generation Data Center Architecture
Gilad Shainer and Michael Kagan (Mellanox, IL)
A data center can be defined as a facility that contains concentrated equipment to perform one or more of the following functions: Store, manage, process, and exchange digital data and information. Such digital data and information is typically applied in one of two ways: support the informational needs of large institutions (such as corporations and educational institutions), and provide application services or management for various types of data processing (such as web hosting, Internet, intranet, telecommunication, and information technology).

Recently, data centers have grown in size and numbers due to the increased use of online banking and electronic stock trading, the adoption of Internet-based communications and entertainment, the shift to using electronic medical records for healthcare, and due to government regulations requiring digital records retention.

With this growth in IT capacity, data centers have become major consumers of electrical energy to power and cool the equipment. Furthermore, data center networks have increased in size and managing a growing multi-infrastructure has become a daunting task. Enterprise data centers currently use three different networks– Storage Area Networks using Fibre Channel transport for storage access, Local Area Network using Ethernet transport for standard network access and System Area Networks using InfiniBand transport for inter-process communication and high-performance clustering.

In order reduce energy, real-state, management and infrastructure costs of modern data centers, a new field of data center architecture was defined – the efficient data center. The new architecture leverages from server and storage virtualization, and I/O virtualization and consolidation, to enable green, simple-managed, highly-utilized modern data centers.

The ability to consolidate data center I/O is not a simple task. On one hand, it is required to maintain the ability to run any application and any transport, and on the other hand to use a “one wire” high-throughput approach. Another requirement is to be able to support adaptive networking and adjustability to different application loads and services. This session will review the various data center interconnect technologies, the recent research topics with regards to enabling efficient data center design and true I/O consolidation.


The Design of a Similarity Based Deduplication System
Lior Aronovich (IBM Diligent), Ron Asher(IBM Diligent), Eitan Bachmat (Ben Gurion Univeristy, IL), Haim Bitner (Marvell, IL), Michael Hirsch (IBM Diligent) and Shmuel Tomi Klein (Bar Ilan University, IL)
We describe some of the design choices that were made during the development of the IBM TS7650G ProtecTier, a fast, scalable, inline, deduplication device. The system's design goals and how they were achieved are presented. This is the first and only deduplication device that uses similarity matching. The paper provides the following original research contributions: we show how similarity signatures can serve in a deduplication scheme; a novel type of similarity signatures is presented and its advantages in the context of deduplication requirements are explained.
It is also shown how to combine similarity matching schemes with hash based identity schemes.


The Effectiveness of Deduplication on Virtual Machine Disk Images
Keren Jin and Ethan Miller (University of California, Santa Cruz,US)
Virtualization is becoming widely deployed in servers to efficiently provide many logically separate execution environments while reducing the need for physical servers.  While this approach saves physical CPU resources, it still consumes large amounts of storage because each virtual machine instance requires its own
multi-gigabyte disk image.  Moreover, existing systems do not support ad hoc block sharing between disk images, instead relying on techniques such as overlays to build multiple virtual machines from a single "base" image.

Instead, we propose the use of deduplication to both reduce the total storage required for virtual machine disk images and increase the ability of virtual machines to share disk blocks.  To test the effectiveness of deduplication, we conducted extensive evaluations on different sets of virtual machine disk images with ifferent
chunking strategies.  Our experiments found that the amount of unique data grows very slowly after the first few virtual disk images if only the locale or software configuration is changed, with the rate of compression suffering when different versions of an operating system or different operating systems are included.  We
also show that fixed-length chunks work well, achieving nearly the same compression rate as variable-length chunks.  Finally, we show that simply identifying zero-filled blocks, even in ready-to-use virtual machine disk images available online, can provide significant savings in storage.


Multi-Level Comparison of Data Deduplication in a Backup Scenario
Dirk Meister and Andre Brinkmann (University of Paderborn, DE)
Data deduplication systems detect redundancies between data blocks to either reduce storage needs or to reduce network traffic. A class of deduplication systems splits the data stream into data blocks (chunks) and then finds exact duplicates of these blocks.

This paper compares the influence of different chunking approaches on multiple levels. On a macroscopic level, we compare the chunking approaches based on real-live user data in a weekly full backup scenario, both at a single point in time as well as over several weeks.  

In addition, we analyze how small changes affect the deduplication ratio for different file types on a microscopic level for chunking approaches and delta encoding. An intuitive assumption is that small semantic changes on documents cause only small modifications in the binary representation of files, which would imply a high ratio of deduplication. We will show that this assumption is not valid for many important file types and that application specific chunking can help to further decrease storage capacity demands.


Tivoli Storage Manager Fast Back (FilesX) Technical Overview
Raichstein Eran (IBM FilesX, IL)
IBM has acquired FilesX, Inc. a storage software company specializing in continuous data protection and near-instant recovery of applications and data. The presentation will give introduction to the acquisition, and the advanced technologies that exist inside the FastBack product suite.


DHIS: Discriminating Hierarchical Storage
Chaitanya Yalamanchili (Stony Brook University, US) , Gopalan Sivathanu (Google, US) , Kiron Vijayasankar (Stony Brook University, US)  and Erez Zadok(Stony Brook University, US)
We propose and evaluate a hierarchical storage system, DHIS, that uses application-level hints to discriminate between data with different access characteristics, and then customizing its layout and caching policies to each type. DHIS provides a new design choice for making optimal data placement decisions in a storage hierarchy comprising of components with varying performance and cost characteristics. The data placement decisions are made in an online fashion, as and when the data are created. Most existing solutions require moving data around based on access characteristics. DHIS uses two kinds of information to make its decisions.

First, it uses knowledge about higher-level pointers between blocks (for example, file system pointers) to understand the relationship between blocks and consequently, their importance. Second, DHIS defines a set of generic attributes that the higher layers can use to annotate data, conveying various properties such as importance, access pattern, etc. Based on these attributes, DHIS dynamically decides to place the data in the format best suited for its requirements. By doing so, DHIS solves a critical problem faced by storage vendors and developers of higher level storage software in terms of choosing the most efficient policy from among many alternatives. Through several benchmarks, we show that DHIS’s data placement decisions are optimal and have significant performance benefits.


Write Amplification Analysis in Flash-Based Solid State Drives

Xiao-Yu Hu, Evangelos Eleftheriou, Robert Haas, Ilias Iliadis and Roman Pletka (IBM, SW)
Write amplification is a critical factor limiting the random write performance and write endurance in storage devices based on NAND-Flash memories such as
solid-state drives (SSD). In this paper, we present a novel probabilistic model of write amplification for log-structured Flash-based SSDs. The impact of
garbage collection on write amplification is influenced by the level of over-provisioning and the choice of reclaiming policy: we quantify the impact of over-provisioning on write amplification analytically and by simulation assuming workloads of uniformly-distributed random short writes; we propose modified versions of the greedy garbage-collection reclaiming policy and compare their performance. Finally, we analytically evaluate the benefits of separating static and dynamic data in reducing write amplification, and how to address endurance with proper wear leveling.


IBM SAN Volume Controller (SVC)
Barry Whyte (IBM UK) Ian Jud ((IBM UK)
The IBM SVC was the first successful storage virtualization appliance. This session would cover the design decisions behind using commodity system hardware, the software stack and how it allows for flexible development, simplistic advanced function additions and the scale out clustered nature of the product.

SVC is unique in the industry, supporting heterogenous SAN attachment of stroage and is key to IBM's SSD strategy moving into the future. The tiering, ILM and HSM nature of SVC also allows for true online migration of data between tiers of storage.

This session would cover how SVC was born, its key design ethics and where it can take storage virtualization into the future.


Architecture of The Internet Archive
Elliot Jaffe and Scott KirkPatrick (Hebrew University of Jerusalem, IL)
The Internet Archive is a live production system supporting close to a petabyte of data and delivering an average of 2.3Gb/sec of data to Internet users.  We describe the architecture of this system with an emphasis on its robustness and how it is managed by a very small team of systems personnel.  Notably, the current system does not employ a cache.  We analyze the reasons for this decision and show that an effective cache could not be built until now.  However, new solid state disk technology may offer promising new cache implementations.


Blade Center Open Fabric Manager
Ben-Ami Yassour, Martin Tross, Zvi Dubitzky, David Kohen and Orit Wasserman (IBM, IL)
Reducing management costs is one of the major challenges of data centers today. In practice significant part of these costs comes from the need to update the  configuration of the existing servers, network and storage switches. For example, configuration changes are required when servers are replaced, because some configurations are dependant on the addresses that are burned into the storage and network adapters in these servers during manufacturing. For example, the Fibre Channel (FC) zoning configuration and the access grant in the storage (LUN mnasking) controller is based on the World Wide Port Name (WWPN) that is burned into the FC HBA. If the server fails and needs to be replaced by a new server, the FC switches and storage controller needs to be reconfigured to allow the new server to access the storage that was used by the server that failed.
Our innovative approach in the IBM BladeCenter Open Fabric Manager (BOFM) solution is to virtualize the adapter addressed and decouple the dependency between the addresses that are burned on the adapters and the configuration of the network, storage switches, and storage controllers. BOFM enables the administrator to define a virtual profile for each blade. The virtual profile defines the virtual addresses for each type of adapter on the blade. When a blade is interested to the blade center, the blade automatically inherits the virtual profile and uses it before the OS starts loading. With this capability, replacing a failed blade, is as simple as taking out the failed blade, and inserting a new one. The adapters on the new blade will have the same identities as those on the failed blade, as was defined by the virtual profile.
By allowing these addresses to be changed under control of management software, network and storage resources can be pre-provisioned before the server hardware/blades arrive on site and servers/blades can be replaced easily without the need to reconfigure network and storage resources.  Another important challenge in the data center is to provide availability of the services provided by the blade servers. Hardware availability, in the case of a server failure requires an automatic failover to a standby server. Traditionally, granting access to the standby server, to be able to replace the failed one, requires changes to the network switches, storage switches, and storage configuration. These configuration changes, are usually different for every type of switch and storage controller and therefore it is very difficult to maintain an automatic tool to perform this reconfiguration. With BOFM this is no longer a problem, all it takes is a transition of the virtual profile from one blade to another. No changes need to be done in the switches or storage controller. Advanced BladeCenter Open Fabric Manager is using the basic BOFM functionality to provide automatic failover support.
The basic BOFM functionality consists of the technique to apply a virtual profile to a slot of a BladeCenter. The profile itself is stored on the BladeCenter Advanced Management Module which is a management utility controlling all resources of the BladeCenter and residing with in the BladeCenter. When a blade is inserted into a slot, the virtual profile is sent from the management module to the blade’s service processor. When the blade is powered up the blade BIOS retrieves the profile from and assigns the relevant address to each of the adapters. This technique enables an out of band method for controlling the addresses that does not require any changes to the OS. It also is compatible with any type of network or storage switches. BOFM supports virtualization of the Ethernet, Fibre Channel and SAS addresses.

BOFM introduces a new slot based approach, by which the identity of the server is predefined by the BladeCenter slot, when the blade is moved from on slot to another it will swap its identity. With such an approach replacing blades and failing over from one blade to another becomes an easy task.
BOFM provides a method to define the configuration of up to a 100 chassis from a single BladeCenter. The administrator can define the virtual profile for each blade in the domain in a single configuration file and apply the configuration for the entire domain. The profile of each BladeCenter is stored locally on the Advanced Management Module.

The process of changing the BOFM configuration requires some delicate steps to validate that there are no duplication of address at any time, even if the process of changing the configuration partially fails. The danger of duplicate addresses, is that it can disrupt routing in the case of Ethernet and might lead to data corruption in the case of Fibre Channel in case two blades are accessing the same LUN.

When applying a new configuration file, the first step in preventing address duplication taken by the Management module, is to verify that there are no duplications in the configuration file of the entire domain. But this check alone is not enough, for example, if an address is transferred from blade a to blade b and blade a is currently using it, both blades a and b, can end up using the same address. Therefore, the process must validate that an address that is in use can not be transferred. Finally, special care must be taken to validate that even if there is a failure during the process of applying a new configuration, at no time there can be a case of duplicate addresses. This is achieved by reclaiming the address from the blades before transferring it to a different blade.

Another potential cause of address duplication is the case in which a blade is moved to a different chassis. In this case if the chassis supports BOFM and a virtual profile is defined for that slot, then the blade will swap to the new profile. However, if the chassis does not support BOFM or there is no virtual profile defined for that slot, the blade will revert to using its original burned in addresses.

In Summary, BOFM provides a method to virtualize the Ethernet, Fibre Channel and SAS addresses for blades, which significantly increases management efficiency for blade replacement and enables automatic failover. This valuable functionality removes the constraints enforced by the traditional approach which relies on addresses burned in on the adapters and leads the way to better and more flexible management possibilities.

BOFM has been available as an option for IBM BladeCenter users as of December 2007. More product information can be found here: http://www-03.ibm.com/systems/bladecenter/hardware/openfabric/openfabricmanager.html


Design Principles of the Neocleus Client Hypervisor

Etay Bogner (Neocleus, IL)
When considering the design principles of client hypervisors and client virtualization solutions, one needs to examine the variety of available solutions to understand how to make the quantum leap in order to bring a better product to the market.  When considering all use cases, most companies are aware of client-hosted, Type 2, client virtualization solutions.  Yet these solutions suffer from architectural limitations that prevent them from being used as a ubiquitous client hypervisor.

The alternative is a Type 1, “bare metal” virtualization solution that offers a commonly agreed-upon advantage in its security architecture.  Other not-so-easy to realize advantages of this approach are performance and hardware compatibility.  But a Type 2 client-hosted hypervisor has one architectural advantage over Type 1 client hypervisors: it is non-intrusive.  The primary disadvantage associated with Type 1 is that it is intrusive, in that it must be permanently installed on the bare-metal hardware, below the OS.

From an architectural point of view, Type 2 client-hosted hypervisors’ shortcomings cannot be resolved, and a Type 1 Client Hypervisor is intrusive by design. The question is then, how to design a best-of-breed client hypervisor? It is evident that architecturally, a Type 1 client hypervisor has the most promise. Therefore, the critical success factor would be to overcome the intrusiveness and add “transparency” to the mix.

This presentation will overview the design philosophy behind the Neocleus client-hosted hypervisor, which supports full device pass-through to the “guest” virtual machines while also supporting dynamic assignment and “switching” of devices between different guests.   Etay Bogner, chief technology officer for Neocleus, will outline why providing full device pass-through capabilities enables the company to provide a best-of-breed Type 1 client hypervisor that allows end-users and organizations the freedom to mix and match different device models in order to best fit their expectations and requirements.


Towards High-Quality I/O Virtualization

Eddie Dong (Intel, CH), Jason Dai (Intel, CH), Zhiteng Huang (Intel, CH), Alei Liang (Shanghai Jiao Tong University, CH), Kevin Tian (Intel, CH) and Yunhong Jiang (Intel, CH)
High-quality I/O virtualization (that is, complete device semantics, full-feature set, close-to-native performance and real-time response) is critical to both server and client virtualizations. Existing solutions for I/O virtualization (e.g., full device emulation, paravirtualization and direct I/O) cannot meet the requirements of high-quality I/O virtualization due to high overheads, lack of complete semantic or full-feature set support.
We have developed new techniques for high-quality I/O virtualization (including device semantic preservation, essential principles for avoiding device virtualization holes, and real-time VMM scheduler extensions), using direct I/O with hardware IOMMU. It not only meets the requirements of high quality I/O virtualization, but also is the basis for PCI-SIG I/O Virtualization (IOV). Experimental results show that our implementation can achieve up-to 98% of the native performance and up to 3.6X of the paravirtualization performance. In addition, it can improve the real-time-ness of the latency-sensitive application by up to 4.8X with the scheduler extensions.


The Effect of Unrolling and Inlining for Python Bytecode Optimizations

Nadav Rotem and Yosi Ben Asher (University of Haifa, IL)
In this study, we consider bytecode optimizations for Python, a programming language which combines object-oriented concepts with features of scripting languages, such as dynamic dictionaries. Due to its design nature, Python is relatively slow compared to other languages. It operates through compiling the code into powerful bytecode instructions that are executed by an interpreter. Python's speed is limited due to its interpreter design, and thus there is a significant need to optimize the language. In this paper, we discuss one possible approach and limitations in optimizing Python based on bytecode transformations. In the first stage of the proposed optimizer, the bytecode is expanded using function inline and loop unrolling. The second stage of transformations simplifies the bytecode by applying a complete set of data-flow optimizations, including constant propagation, algebraic simplifications, dead code elimination, copy propagation, common sub expressions elimination, loop invariant code motion and strength reduction. While these optimizations are known and their implementation mechanism (data flow analysis) is well developed, they have not been successfully implemented in Python due to its dynamic features which prevent their use.

In this work we attempt to understand the dynamic features of Python and how these features affect and limit the implementation of these optimizations. In particular, we consider the significant effects of first unrolling and then inlining on the ability to apply the remaining optimizations. The results of our experiments indicate that these optimizations can indeed be implemented and dramatically improve execution times.


Cache-Aware Scheduling of Multi-Dimensional Iteration Spaces for Efficient Load Balancing on Multi-Core Systems

Arun Kejariwal (Yahoo!, US) and Alexandru Nicolau (University of California, Irvine, US)
The need for high performance per watt has led to development of multi-core systems such as the Intel Core 2 Duo processor and the Intel quad-core Kentsfield processor. Maximal exploitation of the hardware parallelism supported by such systems necessitates the development of concurrent software. This, in part, entails automatic parallelization of programs and efficient mapping of the parallelized program onto the different cores. The latter affects the load balance between the different cores which in turn has a direct impact on performance. In light of the fact that, parallel loops, such as a parallel DO loop in Fortran, account for a large percentage of the total execution time, we focus on the problem of how to efficiently partition the iteration space of (possibly) nested perfect/non-perfect parallel loops. In this regard, one of the key aspects is how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware scheduling of iteration spaces of such loops. Specifically, we propose a technique for iteration space scheduling which captures the effect of variation in the number of cache misses across the iteration space. Subsequently, we propose a general approach to capture the variation of both the number of cache misses and computation across the iteration space. We demonstrate the efficacy of our approach on a dedicated 4-way Intel Xeon based multiprocessor using several kernels from the industry-standard SPEC CPU2000 and CPU2006 benchmarks achieving speedups up to 62.5%.


Improving Communication-Phase Completion Times in HPC Clusters Through Congestion Mitigation

Vladimir Zdornov and Yitzhak Birk (Technion, IL)
Abstract In large computer clusters, e.g., most supercomputers, congestion arising from oversubscription of communication resources is an inherent problem. This paper presents novel mechanisms for adaptive routing and rate control, and studies them under performance goals specific to cluster networks. Our adaptive routing uses virtual circuits to guarantee in-order packet delivery. The path setup process is performed by switches locally, and moreover avoids blocking and backtracking. For random permutations in a slightly enriched fat-tree topology, maximum contention is reduced by up to 50%. Unlike most current rate control schemes such as those used in TCP and in InfiniBand, our proposed rate control scheme relies on explicit rate calculation. For a single phase-based application, we then present distributed rate-calculation algorithms that minimize the duration of the communication phase. Simulation results for synthetic benchmarks moreover establish that the calculated rates can be implemented even with very small buffers, and that joint application of the two schemes reduces communication time by up to 50%.


Hardware-less Testing for RAS Software

Aviad Zlotnick and Orna Raz (IBM, IL)
RAS (Reliability Accessibility and Serviceability) software has to deal with hardware related processes that typically include manual operations such as replacement of components. The necessity to perform manual operations inhibits automated tests, reduces the scope of unit testing, and makes it challenging to create a regression test suite for RAS. We describe our experience in creating, deploying, using, and maintaining a small scale simulation system for testing of the RAS subsystem of an enterprise storage controller. By replacing physical operations with logical commands, this small scale simulation system enabled early release of code related to new hardware features, and the creation of an automatic regression test suite.


IP Mobility to Support Live Migration of Virtual Machines Across Subnets
Ezra Silvera, Dean Lorenz, Gilad Sharaby and Inbar Shapira (IBM, IL)
User-transparent live migration is one of the most interesting features of Virtual Machine (VM) environments. Current live-migration technologies require that the VM retains its IP network address; therefore, are typically restricted to movement within an IP subnet. The growing number of portable computing devices has led to the development of IP mobility solutions that enable uninterrupted network connectivity while moving between different IP subnets. In this paper we study the application of current network mobility approaches to VM cross-subnet live-migration. We show that although the core problems are similar, there are significant differences between these domains, in terms of both assumptions and requirements. We present a specific solution for live migration of a VM across IP subnets, and introduce a new framework for synchronizing migration and network configuration, which allows better optimization for different scenarios of live migration.


Virtualization security
Erez Berkner (Check Point, IL)
As server virtualization increases in popularity, the question of how to secure such environments becomes critical. In the presentation we will review the different security threats in virtualization environments, present Check Point’s solution for securing virtual networks, discuss the architecture of the next generation security applications in the virtualized world and take an architectural deep dive into the Check Point’s hypervisor integrated security solution, which is currently under development.


Noinvasive Java Concurrency with Deuce STM
Guy Korland (TAU), Pascal Felber (U of Neuchatel) and Nir Shavit (TAU)
In the last few years multicore machines have become a commodity, making multithreaded programming a standard for harnessing these new processing capabilities.
One of the most fruitful research topics in this area is Software Transactional Memory (STM), a paradigm for programming concurrent applications using simple yet scalable mechanisms based on optimistic synchronization. Most of the research efforts in this area have focused on defining the right semantic and constructing the most efficient algorithm.
In this work we present a complete Java STM framework implementation, called Deuce, designed to be generally available as a platform for developing scalable concurrent applications and as a research tool for designing new STM algorithms.
Deuce provides several benefits over existing Java STM frameworks: it avoids any changes or additions to the JVM, does not require language extensions or intrusive APIs, and it does not impose any memory footprint or GC overhead. To support legacy libraries, Deuce dynamically instruments classes at load time and uses an original ``field-based'' locking strategy to improve concurrency.
Deuce also provides a simple internal API allowing different STMs algorithms to be plugged in. We show empirical results that highlight the scalability of our framework running benchmarks with hundreds of concurrent threads.


SELF-* PROGRAMMING: Run-Time Parallel Control Search for Reflection Box
Olga Brukman and Shlomi Dolev (BGU)
Real life situations may require an automatic fast update of the control of a plant, whether the plant is an airplane that needs to overcome an emergency situation due to drastic environment change, or a process that needs to continue executing an application in spite of a change in operating system behavior. Settings for run-time control search are defined, assuming the environment may be totally dynamic, but is reentrant and history oblivious for long enough periods. On-line experiments on the environment assist in the implementation of unrealizable specifications given the behavior restriction of the current environment. A successful search for a control implicitly identifies the weakly realizable specifications, and explicitly identifies the implementation that respects the specifications. Various settings and capabilities of the plant are investigated. In particular, (i) plant state reflection that allows observation of the current state of the plant, (ii) plant state set that generalizes the reset capability, allowing setting the plant to each of its states, and (iii) (static or dynamic) plant replication that allows instantiation of plant replicas or use of pre-existing plant replicas for parallelizing testing algorithms. The presented algorithms show that the above capabilities enable a polynomial search for a new control upon a drastic change of the environment.


Low-Overhead Error Detection for Networks-on-Chip
Amit Berman and Idit Keidar (Technion)
In the current deep sub-micron age, interconnect reliability is a subject of major concern, and is crucial for a successful product. Coding is a widely-used method to achieve communication reliability, which can be very useful in a Network-on-Chip (NoC). A key challenge for NoC error detection is to provide a defined detection level, while minimizing the number of redundant parity bits, using small encoder and decoder circuits, and ensuring shortest path routing.
We present Parity Routing (PaR), a novel method to reduce the number of redundant bits transmitted. PaR exploits NoC path diversity to reduce the number of redundant parity bits. Our analysis shows that, for example, on a 4x4 NoC with a demand of one parity bit, PaR reduces the redundant information transmitted by 75%, and the savings increase asymptotically to 100% with the size of the NoC. Furthermore, PaR utilizes low complexity, small-area circuits.


Distributed Clustering for Robust Aggregation in Large Networks

Ittay Eyal, Idit Keidar and Raphael Rom (Technion)
Large-scale networked services are now being increasingly deployed in computation clouds and Internet-based overlay networks. Such networks need constant monitoring for management purposes and in order to detect failures and other anomalous situations, e.g., poor load balance. In years to come, we can also expect to see sensor networks with thousands of light-weight nodes monitoring conditions like seismic activity or temperature. The sizes of these networks, together with bandwidth limitations, prohibit a centralized solution in which the monitored data is accumulated at a single location. Instead, there is a need for more succinct data aggregation.
Most existing solutions are restricted to aggregating scalar values such as means and sums. However, it is often necessary to remove some erroneous samples. Consider for example a temperature monitoring sensor network in a datacenter where a single erroneous data read of 1000 degrees centigrade may outweigh thousands of correct ones. Other problems require learning a more detailed picture of the data distribution. For example, a DDoS attack on machines in a grid computer network like PlanetLab may cause load probes to show that one third of the computers are working at full load while the rest are idle. Knowing the average load of 33% is useless. Instead, we would like to learn that there are two clusters of load values, with averages of 1% and 99%.
We present a robust protocol that aggregates the distribution of samples by describing them as data clusters. This enables both the removal of data errors and the aggregation of elaborate data. We use a gossip communication scheme to achieve a fast yet robust dissemination.
Initial simulations show promising results. Run with a complete topology of n nodes, convergence time is O(1) for some inputs, and at most O(log n). Elaborate data is efficiently summarized and aggregated. Node crashes do not impair the result and outliers are removed successfully, providing a significantly improved accuracy compared to previous protocols.


Many-Core vs. Many-Thread Machines:  Stay Away From the Valley
Zvika Guz (Technion), Evgeny Bolotin (Intel), Idit Keidar, Avinoam Kolodny (Technion), Avi Mendelson (Microsoft) and Uri Weiser (Technion)
This work studies the tradeoffs between Many-Core (MC) machines like Intels Larrabee and Many-Threads (MT) machines like Nvidia and AMD GPGPUS. These two paradigms represent two separate paths along which current processors progress: On one end of the spectrum, MC machines continue the natural line of progress from uni-processor to multi-cores, thus using large caches to mask the latency of memory access and reduce out of die bandwidth. At the other end, MT machines, evolved from graphics processors, usually do not employ caches but instead use threads level parallelism to mask memory latency. These ma-chines rely on a very high thread count and are able to switch to other threads when some are stalled, waiting for memory.
To date, the tradeoffs (and even the boundaries) between the two approaches are not well formalized. In this work, we take a step towards understanding the tradeoffs between MC and MT machines, and the domains where each is more appropriate. To this end, we define a simple unified model describing a super-position of the two architectures. The model captures an imaginary hybrid machine, comprised of many simple processing elements and a large shared cache. When instantiated with a relatively modest number of independent threads (say, up to a few hundreds), the model approximates MC machines, where the cache is large enough to cater to all threads. With a very large number of independent threads (in the thousands), the same model more closely describes an MT machine, since the cache is no longer effective and the memory access latency is masked by the increased thread-level parallelism.
We use the unified model to identify operation zones for which each machine is more suitable. More-over, we found that while both MC and MT machines may shine when given a suitable workload, there is an intermediate zone in which both machine deliver inferior performance -- in this performance valley both MC and MT machines perform poorly compared to their achievable peak. We discuss how several key characteristics of both the workload and the hardware impact performance, and study the shape of the performance valley. Finally, we present insights on how processor designers can stay away from the valley.


qmCover - Code Coverage Analysis and Reporting
Dacian Tudor (UPT) qmCover is an easy to use solution which addresses code coverage acquisition, analysis and reporting. qmCover is targeting gcc-compiled projects and requires the presence of gcov tool which is part of the GNU tool-chain. Being architecture independent, qmCover supports code coverage analysis from a large spectrum of platforms (x86, MIPS, ARM etc). Last but not least, by overriding glibc library, qmCover supports coverage data collection from closed embedded systems via platform specific transport layers (trace, jtag).


Integrated Functional Solutions for Multi-core Programming
Nathaniel Azuelos (Technion), Warren Gross and Zeljko Zilic (McGill)
Recent efforts in microprocessor development tend to the coexistence of several Central Processing Units(CPUs) on a single chip. The Cell Broadband Engine (CBE) integrates IBMs legacy PowerPC CPU with a new set of simple cores, all of which communicate through a high speed bus. The multiple cores on the CBE allow users to exploit the parallel nature of their programs. However, it is often difficult to efficiently extract the parallelism from an application and to distribute tasks in a suitable fashion. We introduce a dataflow approach to CBE computing where the compiler is in charge of task partitioning and of the infrastructure for runtime distribution of tasks. We then elaborate on other areas where this paradigm may be applicable.
We thus present the NCC programming language, Squid compiler and runtime environment. NCC is a strict functional dataflow language that forces explicit variable dependencies, in order to exploit parallelism in the application. NCC code is thus written by the user without specifying parallelism explicitly. The Squid Compiler draws a virtual data flow graph from the NCC source. This graph is then partitionned according to implementation-specific criteria into tasks and supertasks. In our CBE-oriented implementation, the individual tasks are then translated to ANSI-C, and supertasks are analyzed and transformed into scheduling structures.The Squid Runtime Environment (SRE) interacts with the generated scheduler to order tasks execution, running the supertasks scheduling, and managing garbage collection. The SRE runs on the CBEs PowerPC core as a separate thread to implement a host-device paradigm, and as resident code on the simple cores.


Capacity planning with flash drives

Elad Shmidov, Hadar Mor, Adi Littman and Tal Ron (BGU)
Recently, flash drives have entered the arena of enterprise storage systems. There have been several studies that evaluate the use of flash drives, however, these studies require storage system traces, which are very rare. We will present a software project which uses very common counter data which is produced by generic performance monitoring tools, to suggest analyze and suggest configurations which involve, STA, SCSI and flash drives. The software uses an analytic model to assess performance and can be used To make storage purchasing and management decisions based on historical usage data from their own sites.
In addition, as secondary applications, the software takes advantage of the same data to suggest power savings in remotely mirrored systems, to cluster and organize LUNs and to assess the performance aspects of using thin provisioning. We will present the basic design of the software project, which is currently being implemented.


Implementing Generelization Sets
Moshe Weinstock and Guy Wiener (BGU)
Generalization Sets are UML's way of specifying parallel states of an object. For example: a clock can be a wrist-watch or on a clock tower, while at the same time having digital or analog display. It can also be an alarm, or a non-alarm clock as well. A digital display alarm wrist-watch is a legal instance of this specification, as is a clock tower with analogdisplay and no alarm settings.
The problem: most software engineering literature does not offer design patterns that map gensets to code in object oriented programming languages. A noteworthy exception is "Object-Oriented Analysis and Design'' by Martin & Odell. In it, Martin & Odell offer a few patterns for genset implementation. Currently, there are no automatic tools that map gensets to code.
Our proposed solution: a search-based agent that uses the patterns from Martin Odell to generate design code for gensets. Using search techniques will allow us to adhere to certain preferences and best practices through the use of custom evaluation functions in the search itself, while making stateful calculations over the design model simpler. Deciding on implementations for the types in the gensets is in fact a grounded search problem. At each step in the search, a type and an implementation for it will be chosen.


Authenticity and Provenance in Long Term Digital Preservation: Modeling and Implementation in Preservation Aware Storage

Michael Factor, Ealan Henis, Dalit Naor, Simona Rabinovici-Cohen, Petra Reshef, Shahar Ronen (IBM, IL), Giovanni Michetti, Maria Guercio (University of Urbino, IT)
A growing amount of digital objects is designated for long term preservation - a time scale during which technologies, formats and communities are very likely to change. Specialized approaches, models and technologies are needed to guarantee the long-term understandability of the preserved data. Maintaining the authenticity (trustworthiness) and provenance (history of creation, ownership, accesses and changes) of the preserved objects for the long term is of great importance, since users must be confident that the objects in the changed environment are authentic. We present a novel model for managing authenticity in long term digital preservation systems and a supporting archival storage component. The model and archival storage build on OAIS, the leading standard in the area of long-term digital preservation. The preservation aware storage layer handles provenance data, and documents the relevant events. It collocates provenance data (and other metadata) together with the preserved data in a secure environment, thus enhancing the chances of their co-survival. Handling authenticity and provenance at the storage layer reduces both threats to authenticity and computation times. This work addresses core issues in long-term digital preservation in a novel and practical manner. We present an example of managing authenticity of data objects during data transformation at the storage component.

Proceedings Publication

In Cooperation With

ACM

The proceedings of the conference will be published as a volume in the ACM International Conference Proceedings Series made available through the ACM Digital Library.

Sponsors







[an error occurred while processing this directive]