Research Interests

Web-scale computing

Yuehai Xu, Eitan Frachtenberg, Song Jiang and Mike Paleczny. Characterizing Facebook's Memcached Workload. In IEEE Internet Computing 18 (4) : 41–49, March 2014.
Abstract
Key-value caches play an important role in the infrastructure of many Web sites. Analyzing their workloads can lead not only to a better understanding of how these unique tools are used in production, but can also lead to better design decisions in future implementations. Abstract This paper analyzes the workload of Memcached at Facebook, one of the world's largest key-value deployments. We look at server-side performance, request composition, caching efficacy, and key locality. Our observations lead to several design insights and new research directions for key value caches, such as the relative inadequacy of the least-recently-used replacement policy.
Preliminary version (PDF 310KB) Reviews (Text 15KB) Source (22836KB) BibTeX

Maayan Zhitomirsky-Geffet, Dror Feitelson, Eitan Frachtenberg and Yair Wiseman. A Unified Strategy for Search And Result Representation for an Online Bibliographical Catalog. In Online Information Review 33 (3) : 511–536, 2009.
Abstract
One of the biggest concerns of modern information retrieval systems is reducing the user effort required for manual traversal and filtering of long matching document lists. Thus, the first goal of this research is to propose an improved scheme for representation of search results. Further, it aims to explore the impact of various user information needs on the searching process with the aim of finding a unified searching approach well suited for different query types and retrieval tasks.
Preprint (PDF 434KB) Reviews (Text 1256KB) Source (MS-Word 914KB) BibTeX

Eitan Frachtenberg. Reducing Query Latencies in Web Search Using Fine-Grained Parallelism. In World Wide Web 12 (4) : 441–460, December 2009.
Abstract
Semantic Web search is a new application of recent advances in information retrieval (IR), natural language processing, artificial intelligence, and other fields. Our group (Powerset) develops a semantic search engine that aims to answer queries not only by matching keywords, but by actually matching meaning in queries to meaning in Web documents. Compared to typical keyword search, semantic search can pose additional engineering challenges for the back-end and infrastructure designs. Of these, the main challenge addressed in this paper is how to lower query latencies to acceptable, interactive levels. Index-based semantic search can include numerous synonyms, hypernyms, multiple linguistic readings, and other semantic information, both on queries and in the index. In addition, some of the algorithms can be super-linear, such as matching co-references across a document. Consequently, many semantic queries can run significantly slower than the same keyword query. Users, however, have grown to expect Web search engines to provide near-instantaneous results, and a slow search engine could be deemed unusable even if it provides highly relevant results. It is therefore imperative for any search engine to meet its users' interactivity expectations, or risk losing them. Our approach to tackle this challenge to exploit data parallelism in slow search queries to reduce their latency in multi-core systems. Although all search engines are designed to exploit parallelism, at the single-node level this usually translates to throughput-oriented task parallelism. This paper focuses on the engineering of two latency-oriented approaches (coarse- and fine-grained) and compares them to the task-parallel approach. We evaluate on Powerset's deployed search engine the various factors that affect parallel performance: workload, overhead, load balancing, and resource contention. We also discuss heuristics to selectively control the degree of parallelism and resulting overhead on a query-by-query level. Our experimental results show that using fine-grained parallelism with these dynamic heuristics can significantly reduce query latencies compared to fixed, coarse-granularity parallelization schemes. Although these results were obtained on, and optimized for, Powerset's semantic search, they can be readily generalized to a wide class of inverted-index search engines.
Preprint (PDF 273KB) Reviews (Text) Source (Compressed tar 163KB) BibTeX

Maayan Zhitomirsky-Geffet, Eitan Frachtenberg, Yair Wiseman and Dror Feitelson. Seeking Optimal Search Strategy and Result Representation in BoW, 2008.
Abstract
One of the biggest concerns of modern information retrieval systems is reducing the user effort required for manual traversal and ﬁltering of long matching document lists. In this paper we propose an alternative approach for compact and concise representation of search results, which we implemented in the BoW on-line bibliographical repository. The BoW repository is based on an hierarchical concept index to which entries are linked. The key idea is that searching in the hierarchical repository should take advantage of the repository structure and return matching topics from the hierarchy, rather than just a long list of entries. Likewise, when new entries are inserted, a search for relevant topics to which they should be linked is required. Therefore, a similar hierarchical scheme for query-topic matching can be applied for both tasks. However, our experiments show that different query types used for these tasks are best treated by different topic ranking functions. For example, keyword search which is typically based on short (1-3 word) queries requires a weight-based (rather than Boolean) ranking approach. The underlying rationale of weight-based ranking is that for a truly relevant topic all (or almost all) the query terms should appear in its vector representation and with approximately even high weights. Applying this reasoning to the topic ranking method is shown to signiﬁcantly increase the precision and the F1 (by over 30%) for short keyword queries compared to the baseline Boolean ranking metric.
BibTeX

Dror G. Feitelson, Eitan Frachtenberg and Kent L. Beck. Development and Deployment at Facebook. In IEEE Internet Computing 17 (4) : 8–16, July 2013.
Abstract
More than one billion users log in to Facebook at least once a month to connect and share content with each other. Among other activities, these users upload over 2.5 billion content items every day. In this article we describe the development and deployment of the software that supports all this activity, focusing on the site's primary codebase for the Web front-end.
Preliminary version (PDF 131KB) Reviews (Text 22KB) Source (52208KB) BibTeX

Caden Corontzos and Eitan Frachtenberg. Direct-Coding DNA with Multilevel Parallelism. In IEEE Computer Architecture Letters 2 (2) : 81–90, January 2024. Github repo with sources.
Preprint (PDF 229KB) Reviews (Text) Source (167594KB) BibTeX

Rolando P. Hong-Enriquez, Rosa Badia, Barbara Chapman, Kirk Bresniker, Aditya Dhakal, Eitan Frachtenberg, Ninad Hogade, Gourav Rattihalli, Pedro Bruel, Alok Mishra and Dejan Milojicic. Estimating energy-efficiency in quantum optimization algorithms, In Cray User Group (CUG'23), May 2023.
Abstract
Since the dawn of Quantum Computing (QC), theoretical developments like Shor’s algorithm, proved the conceptual superiority of QC over traditional computing. However, such quantum supremacy claims are difficult to achieve in practice due to the technical challenges of realizing noiseless qubits. In the near future, QC applications will need to rely on noisy quantum devices that offload part of their work to classical devices. A way to achieve this is by using Parameterized Quantum Circuits (PQCs) in optimization or even machine learning tasks. The energy consumption of quantum algorithms has been poorly studied. Here we explore several optimization algorithms using both, theoretical insights and numerical experiments, to understand their impact on energy consumption. Specifically, we highlight why and how algorithms like Quantum Natural Gradient Descent, Simultaneous Perturbation Stochastic Approximations or Circuit Learning methods, are at least 2× to 4× more energy efficient than their classical counterparts. Why Feedback-Based Quantum Optimization is energy-inefficient and how a technique like Rosalin, could boost the energy-efficiency of other algorithms by a factor of ≥ 20×
Source (ZIP 2467KB) BibTeX

Eitan Frachtenberg. High Efficiency at Web Scale. In The 9th ACM International Conference on Autonomic Computing ICAC'12, San Jose, CA, September 2012. Keynote.
Abstract
Every day, over half a billion people log in to Facebook to communicate with their contacts. They exchange more than 300 million photos and more than 3 billion likes and comments each day. And almost every day, Facebook releases new code with new features and products to all these users. This staggering amount of information and processing is served from dozens of clusters in four geographical regions. The keys to operating successfully at this almost incomprehensibly large scale are efficiency and automation. Efficiency starts at the hundreds Facebook engineers and the processes they use to develop, test, and deploy code; it continues with scalable models of distributing and constantly monitoring the software on tens of thousands of servers on a daily basis; and ends at the very hardware and datacenters that serves this data, bringing capital and operational expenditures down to make the economic model viable. Automation is the leverage behind each of these relatively few engineers. It lets them focus on quick iteration and experimentation, catching problems early and solving many automatically. This talk will describe the challenges of developing and operating a product that serves a significant percentage of the worldwide internet population. Through several examples, we will see how efficiency and automation drive and enable operation at Web scale.
Source (85285KB)

Eitan Frachtenberg. Holistic Datacenter Design in the Open Compute Project. In IEEE Computer 45 (7) : 83–85, July 2012.
Abstract
Facebook's Open Compute Project lets the community benefit from and contribute to improvements in power and water usage effectiveness, cost, and operation.
Preliminary version (PDF 305KB) Source (1734KB) BibTeX

Mateusz Berezecki, Eitan Frachtenberg, Mike Paleczny and Kenneth Steele. Power and Performance Evaluation of Memcached on the TILEPro64 Architecture. In Sustainable Computing: Informatics and Systems 2 (2) : 81–90, June 2012.
Abstract
Power consumption of data centers had become an important factor in the economy and sustainability of large-scale Web services. Researchers and practitioners are spending considerable effort to characterize Web-scale workloads and evaluating their applicability to alternative, more power-efficient architectures. One such workload in particular is the caching layer, which stores expensive-to-regenerate data in fast storage to reduce service times. In this paper we look at one such application, the Memcached key-value store, which is widely deployed at Facebook and other Web services, and one alternative architecture, the TILEPro64 many-core system-on-chip. We explore the performance and power characteristics of Memcached under a variety of workload variations, implementation choices, and communication protocol, and compare them to a traditional implementation on x86-based servers. Our results suggest that the TILEPro64 architecture can significantly outperform x86-based architectures in terms of throughput per Watt for the evaluated version of Memcached.
Preprint (PDF 829KB) Reviews (Text) Source (9460KB) BibTeX

Eitan Frachtenberg, Dan Lee, Marco Magarelli, Veerendra Mulay and Jay Park. Thermal Design in the Open Compute Datacenter, In 13th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems ITHERM'12, May 2012.
Abstract
The advent of Web-based services and cloud computing has instigated an explosive growth in demand for datacenters. Traditionally, Internet companies would lease datacenter space and servers from vendors that often emphasize flexibility over efficiency. But as these companies grew larger, they sought to reduce acquisition and operation costs by building their own datacenters. Facebook reached this stage earlier in 2011 when it unveiled its first customized datacenter in Prineville, Oregon. In designing this datacenter, Facebook took a blank-slate approach where all aspects were rethought for maximum efficiency. Although the resulting datacenter is optimized for Facebook's workload, it is general enough to be appeal to a wide variety of applications. This paper describes our choices and innovations in the thermal design of the datacenter building, which employs 100% outside-air economization. The efficiency of this design is manifest in an average infrastructure energy use reduction of 86% compared to leased space, and an overall energy use reduction of 29%. This reduction in turn translates to a power usage efficiency of 1.08, measured over the summer of 2011.
Paper (PDF 15143KB) Reviews (Text) Source (14200KB) BibTeX

Eitan Frachtenberg. Design Principles in the Open Compute Project, In IEEE Optical Fiber Communication Conference OFC/NFOEC, March 2012.
Abstract
The Open Compute Project aims to capture the best principles in datacenter design and open them for third-party implementation and discussion. This paper summarizes them in the areas of electrical, thermal, building, and server design.
Paper (PDF 480KB) Source (29927KB) BibTeX

Steven Hart, Eitan Frachtenberg and Mateusz Berezecki. Predicting Memcache Throughput using Simulation and Modeling, In SCS/IEEE Symposium on Theory of Modeling and Simulation TMS'12, March 2012.
Abstract
The current work introduces a method for predicting Memcached throughput on single-core and multi-core processors. The method is based on traces collected from a full system simulator running Memcached. A series of microarchtectural simulators consume these traces and the results are used to produce a CPI model composed of a baseline issue rate, cache miss rates, and branch mispredictions rate. Simple queueing models are used to produce througput predictions with accuracy in the range of 8% to 17%.
Paper (PDF 160KB) Reviews (Text) Source (648KB) BibTeX

Eitan Frachtenberg, Ali Heydari, Harry Li, Amir Michael, Jacob Na, Avery Nisbet and Pierluigi Sarti. High-Efficiency Server Design, In IEEE/ACM Conference on Supercomputing SC'11, November 2011.
Abstract
Large-scale datacenters consume megawatts in power and cost hundreds of millions of dollars to equip. Reducing the energy and cost footprint of servers can therefore have substantial impact. Web, Grid, and cloud servers in particular can be hard to optimize, since they are expected to operate under a wide range of workloads. For our upcoming datacenter, we set out to significantly improve its power efficiency, cost, reliability, serviceability, and environmental footprint. To this end, we redesigned many dimensions of the datacenter and servers in conjunction. This paper focuses on our new server design, combining aspects of power, motherboard, thermal, and mechanical design. We calculate and confirm experimentally that our custom-designed servers can reduce power consumption across the entire load spectrum while at the same time lower acquisition and maintenance costs. Importantly, our design does not decrease the servers' performance or portability, which would otherwise limit its applicability.
Paper (PDF 1822KB) Reviews (PDF 94KB) Source (Compressed tar 35386KB) BibTeX

Mateusz Berezecki, Eitan Frachtenberg, Mike Paleczny and Kenneth Steele. Many-Core Key-Value Store, In Second International Green Computing Conference IGCC'11, July 2011.
Abstract
Scaling data centers to handle task-parallel workloads requires balancing the cost of hardware, operations, and power. Low-power, low-core-count servers reduce costs in one of these dimensions, but may require additional nodes to provide the required quality of service or increase costs by underutilizing memory and other resources. We show that the throughput, response time, and power consumption of a high-core-count processor operating at a low clock rate and very low power consumption can perform well when compared to a platform using faster but fewer commodity cores. Specific measurements are made for a key-value store, Memcached, using a variety of systems based on three different processors: the 4-core Intel Xeon L5520, 8-core AMD Opteron 6128 HE, and 64-core Tilera TILEPro64.
Paper (PDF 652KB) Reviews (Text) Source (Compressed tar 6324KB) BibTeX

Anand Madhavan and Eitan Frachtenberg. Combinatorial Set Matching using GPUs. In NVision'08, San Jose, CA, August 2008.
Poster image (PDF 565KB) Source (PDF 919KB) BibTeX

Ram Srinivasan, Eitan Frachtenberg, Olaf Lubeck, Scott Pakin and Jeanine Cook. An Idealistic Neuro-PPm Branch Prediction. In Journal of Instruction-Level Parallelism 9, May 2007.
Abstract
Historically, Markovian predictors have been very successful in predicting branch outcomes. In this work we propose a hybrid scheme that employs two Prediction by Partial Matching (PPM) Markovian predictors, one that predicts based on local branch histories and one based on global branch histories. The two independent predictions are combined using a neural network. On the CBP-2 traces the proposed scheme achieves over twice the prediction accuracy of the gshare predictor.
Preprint (PDF 449KB) Reviews (Text 610KB) Source (Compressed tar 617KB) BibTeX

Ram Srinivasan, Eitan Frachtenberg, Olaf Lubeck, Scott Pakin and Jeanine Cook. Neuro-PPm Branch Prediction, In The Second JILP Championship Branch Prediction Competition CBP'02, In conjunction with The 39th Annual IEEE/ACM International Symposium on Microarchitecture Micro-39, December 2006. Competition Finalist.
Abstract
Historically, Markovian predictors have been very successful in predicting branch outcomes. In this work we propose a hybrid scheme that employs two Prediction by Partial Matching (PPM) Markovian predictors, one that predicts based on local branch histories and one based on global branch histories. The two independent predictors are combined using a neural network. On the CBP-2 traces the proposed scheme acheives over twice the prediction accuracy of the gshare predictor.
Paper (PDF 205KB) Reviews (Text) Source (Compressed tar 428KB) BibTeX

Viyom Mittal, Pedro Bruel, Dejan Milojicic and Eitan Frachtenberg. Adaptive Stopping Rule for Performance Measurements, In 14th IEEE International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS'23), November 2023.
Abstract
Performance variability in complex computer systems is a major challenge for accurate benchmarking and per- formance characterization, especially for tightly coupled large-scale high-performance computing systems. Point summaries of performance may be both uninformative, if they do not capture the full richness of its behavior, and inaccurate, if they are derived from an inadequate sam- ple set of measurements. Determining the correct sample set—and in particular, its size—requires balancing trade- offs of computation, methodology, and statistical power. In this paper, we treat the performance distribution as the primary target of the performance evaluation, from which all other metrics can be derived. We propose a meta-heuristic that characterizes the performance distri- bution as it is being measured, dynamically determining when enough samples have been collected to approximate the true distribution. Compared to predetermined fixed stopping criteria, this dynamic and adaptive method can be more efficient in resource use, since it can stop as early as the desired certainty level is obtained, and more accurate, since it does not stop prematurely. Importantly, it requires no advance knowledge or assumptions about the system under test or its performance characteristics. We evaluate a prototype of our proposal using a mix of synthetic and real benchmarks. For synthetic distribu- tions, this approach closely matches the true distribution. For actual benchmarks, the heuristic is overly conser- vative for some applications and overly lax for others, especially those using GPUs. But it still matches the overall shape of the distribution for benchmarks with very diverse distributions, which suggests that it is a viable approach for an adaptive stopping rule.
Paper (PDF 748KB) Reviews (Text) Source (1833KB) BibTeX

Eytan Bakshy and Eitan Frachtenberg. Statistics and Optimal Design for Benchmarking Experiments Involving User Traffic, In 24th International World Wide Web Conference (WWW'15), May 2015.
Abstract
The successful development and deployment of large-scale Internet services depends critically on performance. Even small changes in processing time, bandwidth, and memory usage can translate directly into large financial and user experience costs. Despite the widespread use of traffic-based benchmarks, there is little research on how they should be run in order to obtain valid and precise inferences with minimal data collection costs. Correctly A/B testing Internet services can be surprisingly difficult because interdependencies between user requests (e.g., for search results, social media streams, ads) and hosts can lead to failures in estimating the significance and magnitude of performance differences. We develop multilevel models of Internet service performance that take in to account dependence due to user requests and hosts, and use them to design benchmarking routines that maximize precision subject to time and resource constraints. This design is then validated experimentally on a production system that is used to vet thousands of changes every day.
Paper (PDF 1729KB) Reviews (Text 13KB) Source (48351KB) BibTeX

Yuehai Xu, Eitan Frachtenberg and Song Jiang. Building a High-Performance Key-Value Cache as an Energy-Efficient Appliance, In 32nd International Symposium on Computer Performance, Modeling, Measurements, and Evaluation Performance'14, October 2014. Best Student Paper Award. Archival version in Performance Evaluation vol. 79: 24--37.
Abstract
Key-value (KV) stores have become a critical infrastructure component supporting various services in the cloud. Long considered an application that is memory-bound and network-bound, recent KV-store implementations on multicore servers grow increasingly CPU-bound instead. This limitation often leads to under-utilization of available bandwidth and poor energy efficiency, as well as long response times under heavy load. To address these issues, we present Hippos, a high-throughput, low-latency, and energy-efficient key-value store implementation. Hippos moves the KV store into the operating system's kernel and thus removes most of the overhead associated with the network stack and system calls. Hippos uses the Netfilter framework to quickly handle UDP packets, removing the overhead of UDP-based GET requests almost entirely. Combined with lock-free multithreaded data access, Hippos removes several performance bottlenecks both internal and external to the KV-store application. We prototyped Hippos as a Linux loadable kernel module and evaluated it against the ubiquitous Memcached using various micro-benchmarks and workloads from Facebook's production systems. The experiments show that Hippos provides some 20--200% throughput improvements on a 1Gbps network (up to 590% improvement on a 10Gbps network) and 5--20% saving of power compared with Memcached.
Paper (PDF 403KB) Presentation (804KB) Reviews (Text) Source (95KB) BibTeX

Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang and Mike Paleczny. Workload Analysis of a Large-Scale Key-Value Store, In 12th ACM/IFIP Joint Conference on Measurement and Modeling of Computer Systems SIGMETRICS/Performance'12, June 2012. 2022 SIGMETRICS Test-of-Time Award. Archival version in Performance Evaluation Review vol. 40(1): 53--64.
Abstract
Key-value stores are a vital component in many scale-out enterprises, including social networks, online retail, and risk analysis. Accordingly, they are receiving increased attention from the research community in an effort to improve their performance, scalability, reliability, cost, and power consumption. To be effective, such efforts require a detailed understanding of realistic key-value workloads. And yet little is known about these workloads outside of the companies that operate them. This paper aims to address this gap. To this end, we have collected detailed traces from Facebook's Memcached deployment, arguably the world's largest. The traces capture over 284 billion requests from five different Memcached use cases over several days. We analyze the workloads from multiple angles, including: request composition, size, and rate; cache efficacy; temporal patterns; and application use cases. We also propose a simple model of the most representative trace to enable the generation of more realistic synthetic workloads by the community. Our analysis details many characteristics of the caching workload. It also reveals a number of surprises: a GET/SET ratio of 30:1 that is higher than assumed in the literature; some applications of Memcached behave more like persistent storage than a cache; strong locality metrics, such as keys accessed many millions of times a day, do not always suffice for a high hit rate; and there is still room for efficiency and hit rate improvements in Memcached's implementation. Toward the last point, we make several suggestions that address the exposed deficiencies.
Paper (PDF 645KB) Presentation (1090KB) Reviews (PDF 17KB) Source (124736KB) BibTeX

Parallel Job Scheduling

Dr. Dror Feitelson from the CS dept. at Hebrew University brought me into the field of supercomputing, with his vast knowledge and meticulous research methods. In our work for my MSc thesis, we developed a new method of coscheduling for supercomputers to deal with load imbalances. This work has been extended to a PhD. In this work, we've shown ways to enhace large-scale systems by improving application performance, improve fault-tolerance, reduce system load, and improve resource utilization.
Related publications:

Phillip Raith, Gourav Rattihalli, Aditya Dhakal, Sai R. Chalamalasetti, Dejan Milojicic, Eitan Frachtenberg, Stefan Nastic and Schahram Dustdar. Opportunistic Energy-Aware Scheduling for Container Orchestration Platforms Using Graph Neural Networks, In 24th IEEE/ACM international Symposium on Cluster, Cloud and Internet Computing (CCGRID'24), May 2024.
Reviews (Text 11KB) BibTeX

Gourav Rattihalli, Ninad Hogade, Aditya Dhakal, Eitan Frachtenberg, Rolando P. Hong-Enriquez, Pedro Bruel, Alok Mishra and Dejan Milojicic. Fine-Grained Heterogeneous Execution Framework with Energy Aware Scheduling, In IEEE International Conference on Cloud Computing (CLOUD'23), July 2023.
Abstract
The growing convergence of high-performance, data analytics, and machine learning applications is increasingly pushing computing systems toward heterogeneous processors and specialized hardware accelerators. Hardware heterogeneity, in turn, leads to finer-grained workflows. State-of-the-art server- less computing resource managers do not currently provide efficient scheduling of such fine-grained tasks on systems with heterogeneous CPUs and specialized hardware accelerators (e.g., GPUs and FPGAs). Working with fine-grained tasks presents an opportunity for more efficient energy use via new scheduling models. Our proposed scheduler enables technologies like Nvidia’s Multi-Process Service (MPS) to pack multiple fine-grained tasks on GPUs efficiently. Its advantages include better co-location of jobs and better sharing of hardware resources such as GPUs that were not previously possible on container orchestration systems. We propose a Kubernetes-native energy-aware scheduler that integrates with our heterogeneous framework. Combining fine- grained resource scheduling on heterogeneous hardware and energy-aware scheduling results in up to 17.6% improvement in makespan, up to 20.16% reduction in energy consumption for CPU workloads, and up to 58.15% improvement in makespan, and up to 28.92% reduction in energy consumption for GPU workloads.
Reviews (Text 11KB) Source (ZIP 1579KB) BibTeX

Walfredo Cirne, Narayan Desai, Eitan Frachtenberg and Uwe Schwiegelshohn (ed.) Job Scheduling Strategies for Parallel Processing: 16th International Workshop, JSSPP 2012, Revised Selected Papers (Lecture Notes in Computer Science 7698), Springer-Verlag, January 2013 (ISBN: 978-3642358661). Available from Springer-Verlag and Amazon.
BibTeX

Walfredo Cirne and Eitan Frachtenberg. Web-Scale Job Scheduling, In Job Scheduling Strategies for Parallel Processing, Lecture Notes in Computer Science (7698), January 2013.
Abstract
Web datacenters and clusters can be larger than the world's largest supercomputers, and run workloads that are at least as heterogeneous and complex as their high-performance computing counterparts. And yet little is known about the unique job scheduling challenges of these environments. This article aims to ameliorate this situation. It discusses the challenges of running web infrastructure and describes several techniques to address them. It also presents some of the problems that remain open in the field.
Chapter text (PDF 629KB) Source (446KB) BibTeX

Eitan Frachtenberg and Uwe Schwiegelshohn (ed.) Job Scheduling Strategies for Parallel Processing: 15th International Workshop, JSSPP 2010, Revised Selected Papers (Lecture Notes in Computer Science 6253), Springer-Verlag, October 2010 (ISBN: 978-3642165047). Available from Springer-Verlag and Amazon.
BibTeX

Eitan Frachtenberg and Uwe Schwiegelshohn (ed.) Job Scheduling Strategies for Parallel Processing: 14th International Workshop, JSSPP 2009, Revised Selected Papers (Lecture Notes in Computer Science 5798), Springer-Verlag, October 2009 (ISBN: 978-3642046322). Available from Springer-Verlag and Amazon.
BibTeX

Eitan Frachtenberg and Uwe Schwiegelshohn (ed.) Job Scheduling Strategies for Parallel Processing: 13th International Workshop, JSSPP 2007, Revised Selected Papers (Lecture Notes in Computer Science 4942), Springer-Verlag, May 2008 (ISBN: 978-3540786986). Available from Springer-Verlag and Amazon.
BibTeX

Eitan Frachtenberg and Uwe Schwiegelshohn. New Challenges of Parallel Job Scheduling, In Job Scheduling Strategies for Parallel Processing, Lecture Notes in Computer Science (4942), May 2008.
Abstract
The workshop on job scheduling strategies for parallel processing (JSSPP) studies the myriad aspects of managing resources on parallel and distributed computers. These studies typically focus on large-scale computing environments, where allocation and management of computing resources present numerous challenges. Traditionally, such systems consisted of massively parallel supercomputers, or more recently, large clusters of commodity processor nodes. These systems are characterized by architectures that are largely homogeneous and workloads that are dominated by both computation and communication-intensive applications. Indeed, the large majority of the articles in the first ten JSSPP workshops dealt with such systems and addressed issues such as queuing systems and supercomputer workloads. In this paper, we discuss some of the recent developments in parallel computing technologies that depart from this traditional domain of problems. In particular, we identify several recent and influential technologies that could have a significant impact on the future of research on parallel scheduling. We discuss some of the more specific research challenges that these technologies introduce to the JSSPP community, and propose to enhance the scope of future JSSPP workshops to include these topics.
Chapter text (PDF 228KB) Source (247KB) BibTeX

Eitan Frachtenberg and Uwe Schwiegelshohn (ed.) Job Scheduling Strategies for Parallel Processing: 12th International Workshop, JSSPP 2006, Revised Selected Papers (Lecture Notes in Computer Science 4376), Springer-Verlag, March 2007 (ISBN: 978-3540710349). Available from Springer-Verlag and Amazon.
BibTeX

Eitan Frachtenberg and Yoav Etsion. Hardware Parallelism: Are Operating Systems Ready? (Case Studies in Mis-Scheduling), In Second Workshop on the Interaction between Operating Systems and Computer Architecture WIOSCA'06, In conjunction with ISCA-33, June 2006.
Abstract
Commodity parallel computers are no longer a technology predicted for some indistinct future: they are becoming ubiquitous. In the absence of significant advances in clock speed, chip-multiprocessors (CMPs) and symmetric multithreading (SMT) are the modern workhorses that keep Moore's Law still relevant. On the software side, we are starting to observe the adaptation of some codes to the new commodity parallel hardware. While in the past, only complex professional codes ran on parallel computers, the commoditization of parallel computers is opening the door for many desktop applications to benefit from parallelization. We expect this software trend to continue, since the only apparent way of obtaining additional performance from the hardware will be through parallelization. Based on the premise that the average desktop workload is growing more parallel and complex, this paper asks the question: Are current desktop operating systems appropriate for these trends? Specifically, we are interested in parallel process scheduling, which has been a topic of significant study in the supercomputing community, but so far little of this research has trickled down to the desktop. In this paper, we demonstrate, using several case studies, that contemporary general-purpose operating systems are inadequate for the emerging parallel desktop workloads. We suggest that schedulers designed with an understanding of the requirements of all process classes and their mixes, as well the abilities of the underlying architecture, might be the solution to this inadequacy.
Paper (PDF 173KB) Presentation materials (PDF 289KB) Reviews (Text 17KB) Source (Compressed tar 1826KB) BibTeX

Dror Feitelson, Eitan Frachtenberg, Larry Rudolph and Uwe Schwiegelshohn (ed.) Job Scheduling Strategies for Parallel Processing: 11th International Workshop, JSSPP 2005, Revised Selected Papers (Lecture Notes in Computer Science), Springer-Verlag, January 2006 (ISBN: 978-3540310242). Available from Springer-Verlag and Amazon.
BibTeX

Eitan Frachtenberg. Process Scheduling for the Parallel Desktop, In International Symposium on Parallel Architectures, Algorithms, and Networks I-SPAN'05, December 2005.
Abstract
Commodity hardware and software are growing increasingly more complex, with advances such as chip heterogeneity and specialization, deeper memory hierarchi es, fine-grained power management, and most importantly, chip parallelism. Similarly, workloads are growing more concurrent and diverse. With this new complexity in hardware and software, process scheduling in the operating system (OS) becomes more challenging. Nevertheless, most commodity OS schedulers are based on design principles that are 30 years old. This disparity may soon lead to significant performance degradation. Most significantly, parallel architectures such as multicore chips require more than scalable OSs: parallel programs require parallel-aware scheduling. This paper posits that imminent changes in hardware and software warrant reevaluating the scheduler's policies in the commodity OS. We discuss and demonstrate the main issues that the emerging parallel desktops are raising for the OS scheduler. We propose that a new approach to scheduling is required, applying and generalizing lessons from different domain-specific scheduling algorithms, and in particular, parallel job scheduling. Future architectures can also assist the OS by providing better information on process scheduling requirements.
Paper (PDF 255KB) Presentation (PowerPoint 2788KB) Reviews (Text) Source (Compressed tar 402KB) BibTeX

Eitan Frachtenberg, Dror G. Feitelson, Fabrizio Petrini and Juan Fernandez. Adaptive Parallel Job Scheduling with Flexible CoScheduling. In IEEE Transactions on Parallel and Distributed Systems 11 (16) : 1066–01077, November 2005.
Abstract
Many scientific and high-performance computing applications consist of multiple processes running on different processors that communicate frequently. Because of their synchronization needs, these applications can suffer severe performance penalties if their processes are not all coscheduled to run together. Two common approaches to coscheduling jobs are batch scheduling, wherein nodes are dedicated for the duration of the run, and gang scheduling, wherein time slicing is coordinated across processors. Both work well when jobs are load-balanced and make use of the entire parallel machine. However, these conditions are rarely met and most realistic workloads consequently suffer from both internal and external fragmentation, in which resources and processors are left idle because jobs cannot be packed with perfect efficiency. This situation leads to reduced utilization and suboptimal performance. Flexible CoScheduling (FCS) addresses this problem by monitoring each job's computation granularity and communication pattern and scheduling jobs based on their synchronization and load-balancing requirements. In particular, jobs that do not require stringent synchronization are identified, and are not coscheduled; instead, these processes are used to reduce fragmentation. FCS has been fully implemented on top of the STORM resource manager on a 256-processor Alpha cluster and compared to batch, gang, and implicit coscheduling algorithms. This paper describes in detail the implementation of FCS and its performance evaluation with a variety of workloads, including large-scale benchmarks, scientific applications, and dynamic workloads. The experimental results show that FCS saturates at higher loads than other algorithms (up to 54% higher in some cases), and displays lower response times and slowdown than the other algorithms in nearly all scenarios. Keywords: Cluster computing, load balancing, job scheduling, gang scheduling, parallel architectures, Flexible coscheduling
Preprint (PDF 312KB) Reviews (Text 20KB) Source (Compressed tar 791KB) BibTeX

Eitan Frachtenberg and Dror G. Feitelson. Pitfalls in Parallel Job Scheduling Evaluation, In 11th Workshop on Job Scheduling Strategies for Parallel Processing JSSPP'05, In conjunction with ICS-19, June 2005.
Abstract
There are many choices to make when evaluating the performance of a complex system. In the context of parallel job scheduling, one must decide what workload to use and what measurements to take. These decisions sometimes have subtle implications that are easy to overlook. In this paper we document numerous pitfalls one may fall into, with the hope of providing at least some help in avoiding them. Along the way, we also identify topics that could benefit from additional research. Keywords: parallel job scheduling, performance evaluation, experimental methodology, dynamic workload, static workload, simulation
Paper (PDF 142KB) Presentation materials (PowerPoint 2386KB) Reviews (Text 13KB) Source (Compressed tar 3709KB) BibTeX

Eitan Frachtenberg. Process Coordination for Commodity Systems, December 2004.
Abstract
Parallel and distributed processing are no longer the exclusive realm of supercomputers. The growing prevalence of systems with multiple processing units brings parallel hardware to commodity computers. Parallel hardware cannot be fully utilized unless running parallel software, which in turn depends on the operating system's ability to support various, often conflicting scheduling requirements. This proposal describes research toward achieving a fully flexible autonomous operating system (OS), that seamlessly supports the entire range of current and future applications: serial, multimedia, interactive, distributed, and parallel.
Report (PDF 240KB) Source (Compressed tar 15KB) BibTeX

Eitan Frachtenberg. Toward Realistic Evaluation of Job Scheduling Strategies. In Seminar talk, given at the Computer Science departments of Hebrew University (Jerusalem), Tel Aviv University (Tel Aviv), Technion, Israel Institute of Technology (Haifa), Interdisciplinary Center (Herzlia), and Ben Gurion University, (Beer Sheva), Israel, December 2004.
Abstract
Evaluating parallel job scheduling algorithms is a challenging task. Many factors affect the outcome of the evaluation, including workload and application choices, metrics, choice of scheduling algorithms and their parameters, and the hardware used or assumed. The large, non-continuous parameter space renders analytical evaluation extremely difficult, while simulation evaluations are very sensitive to the assumptions undertaken, sometimes resulting in contradicting results. Experimental evaluations are even rarer, due to the complexity of implementation, and the difficulty in obtaining a dedicated large machine for long periods of time. This talk will describe our efforts to evaluate various scheduling strategies in a dynamic, realistic environment. Of the various parameters that affect job scheduling performance, workload and implementation play a pivotal role. Most studies either employ simulations and/or simplistic workloads, which contain many assumptions, including unknown ones. Instead, we developed a scheduling framework that implements several existing and novel algorithms on various cluster architectures of up to hundreds of nodes. This framework was used to produce the first experimental evaluation of several job scheduling strategies in a dynamic workload environment, using synthetic and scientific MPI applications. This talk will discuss the challenges involved in evaluating job scheduling strategies, and the approaches we chose to address them. An analysis will be presented of three factors affecting scheduling systems running dynamic workloads: multiprogramming level, time quantum, and the use of backfilling for queue management -- and how they depend on offered load. Joint work with Dror Feitelson (Hebrew U.), Fabrizio Petrini (LANL), and Juan Fernandez (Murcia U.)
Presentation (PowerPoint 1934KB) Source (PowerPoint 1934KB)

Eitan Frachtenberg. Leveraging Modern Interconnects for Parallel System Software, Ph.D. dissertation. Hebrew University, Jerusalem, Israel. December 2003.
Abstract
The use of clusters of independent compute nodes as high capability and capacity computers is rapidly growing in industry, academia, and government. This growth is accompanied by fast-paced progress in cluster-aware hardware, and in particular in interconnection technology. Contemporary networks offer not only excellent performance as expressed by latency and bandwidth, but also advanced architectural features, such as programmable network interface cards, hardware support for collective communication operations, and support for modern communication protocols such as MPI and RDMA. The rapid progress in cluster hardware and usage is unfortunately not matched by similar progress in system software. This software consists of the middleware: the operating system, user libraries, and utilities that interface between the hardware and the user applications, allowing them to make use of the machine's resources. In fact, most of these clusters use common workstation operating systems such as Linux running on each of the cluster's nodes, with a collection of loosely-related libraries, utilities, and scripts to access the cluster's resources. Such solutions are hardly adequate for large-scale clusters and/or high-performance computing applications. The problems they cause include (but are not limited to): (1) poor performance and scalability of applications and system software; (2) reduced utilization of the machine due to suboptimal resource allocation; (3) reliability problems caused by the multitude of independent software modules, and the redundancy in their operation, and (4) difficulty in operating and making full use of these machines. The premise behind this dissertation is that system software can be dramatically improved in terms of performance, scalability, reliability, and simplicity by making use of the features offered by modern interconnects. Unlike single-node operating systems, most of a cluster's system software tasks involve efficient global synchronization of resources. As such, parallel system software can be designed to benefit from the novel hardware features offered by contemporary interconnection technology. This dissertation promotes the idea of treating a cluster's operating system as any other high-performance parallel application, and increasing its reliance on synchronization abilities while reducing its per-node complexity and redundancy. This dissertation makes the following primary contributions. First, a set of necessary network mechanisms to support this system software model is described. A prototype implementation of system software based on these mechanisms is then discussed. This system currently tackles three main aspects of parallel computers: resource management, communication libraries, and job scheduling methods. This model was implemented on three different cluster architectures. Extensive performance and scalability evaluations with real clusters and applications show significant improvements over previous work in all three areas. In particular, this research focuses primarily on job scheduling strategies, and demonstrates that through advanced algorithms, the system's throughput and responsiveness can be improved over a wide spectrum of workloads.
Thesis (PDF 800KB) Presentation (PDF 1193KB) Source (Compressed tar 1483KB) BibTeX

Eitan Frachtenberg, Dror G. Feitelson, Juan Fernandez and Fabrizio Petrini. Parallel Job Scheduling under Dynamic Workloads, In 9th Workshop on Job Scheduling Strategies for Parallel Processing JSSPP'03, In conjunction with HPDC-12 / GGF-8, June 2003.
Abstract
Jobs that run on parallel systems that use gang scheduling for multiprogramming may interact with each other in various ways. These interactions are affected by system parameters such as the level of multiprogramming and the scheduling time quantum. A careful evaluation is therefore required in order to find parameter values that lead to optimal performance. We perform a detailed performance evaluation of three factors affecting scheduling systems running dynamic workloads: multiprogramming level, time quantum, and the use of backfilling for queue management --- and how they depend on offered load. Our evaluation is based on synthetic MPI applications running on a real cluster that actually implements the various scheduling schemes. Our results demonstrate the importance of both components of the gang-scheduling plus backfilling combination: gang scheduling reduces response time and slowdown, and backfilling allows doing so with a limited multiprogramming level. This is further improved by using flexible coscheduling rather than strict gang scheduling, as this reduces the constraints and allows for a denser packing.
Paper (PDF 261KB) Presentation materials (PDF 404KB) Reviews (Text) Source (Compressed tar 245KB) BibTeX

Eitan Frachtenberg, Dror G. Feitelson, Fabrizio Petrini and Juan Fernandez. Flexible CoScheduling: Dealing with Load Imbalance and Heterogeneous Resources, In International Parallel and Distributed Processing Symposium IPDPS'03, April 2003. Best Paper Award.
Abstract
Fine-grained parallel applications require all their processes to run simultaneously on distinct processors to achieve good efficiency. This is typically accomplished by space slicing, wherein nodes are dedicated for the duration of the run, or by gang scheduling, wherein time slicing is coordinated across processors. Both schemes suffer from fragmentation, where processors are left idle because jobs cannot be packed with perfect efficiency. Obviously, this leads to reduced utilization and sub-optimal performance. Flexible coscheduling (FCS) solves this problem by monitoring each job's granularity and communication activity, and using gang scheduling only for those jobs that require it. Processes from other jobs, which can be scheduled without any constraints, are used as filler to reduce fragmentation. In addition, inefficiencies due to load imbalance and hardware heterogeneity are also reduced because the classification is done on a per-process basis. FCS has been fully implemented as part of the STORM resource manager, and shown to be competitive with gang scheduling and implicit coscheduling.
Paper (PDF 127KB) Presentation (PDF 634KB) Reviews (Text) Source (Compressed tar 1557KB) BibTeX

Juan Fernandez, Fabrizio Petrini and Eitan Frachtenberg. Buffered Coscheduled (BCS) MPI. In The conference on High-Speed Computing, Glenden Beach, OR, April 2003. Invited work.
Poster image (PDF 355KB) BibTeX

Eitan Frachtenberg. Flexible Coscheduling, M.Sc. thesis. Hebrew University, Jerusalem, Israel. December 2001.
Abstract
In this thesis a novel technique is introduced for job scheduling in clusters and supercomputers with the goal of increasing the efficiency and utilization of these machines. In particular, the problems arising from heterogeneous architecture clusters and software load imbalances are addressed. The suggested technique is a variation on gang scheduling and other coscheduling methods, where several parallel jobs time-share and space-share the same machine, using varying degrees of coordination among processes. The main idea behind this thesis is that a distributed/parallel scheduling system can gather dynamic information on the synchronization behavior of processes, and use this information to identify their different coscheduling needs. Using this information, a scheduler can make better scheduling decisions, to increase the overall system utilization and decrease the runtime of applications in a multiprogramming environment. The contribution of this thesis is threefold: (1) addressing the problems that heterogeneous architectures and load imbalances pose to coscheduling systems; (2) a methodological system of gathering job communication information and subsequent process classification for the making of better scheduling choices; and (3) experimental results that verify the usefulness of applying dynamic communication statistics to scheduling decisions. In addition, this work includes the implementation of an efficient and flexible scheduler, with the ability to use many of the scheduling algorithms found in the literature. The main result of this thesis is the design and development of a new approach to the identification of different process scheduling requirements and their scheduling according to these requirements. This approach is shown to be both feasible and performance-wise promising, and may also prove to be useful when integrated with other approaches. Another accomplishment of this work is the development of an extensive scheduler system that is both very efficient and flexible, and allows for testing real application behavior on real clusters, measuring real scheduling issues. This work was done partly at the parallel systems laboratory of the Hebrew university in Jerusalem partly at the Modeling, Algorithms and Informatics group of the Computer and Computational Sciences division (CCS-3) of the Los Alamos national laboratory.
Thesis (PDF 707KB) Presentation (PDF 634KB) Source (Compressed tar 1014KB) BibTeX

Eitan Frachtenberg, Fabrizio Petrini, Salvador Coll and Wu-chun Feng. Gang Scheduling with Lightweight User-Level Communication, In Workshop on Scheduling and Resource Management for Cluster Computing SRMCC'01, In conjunction with the International Conference on Parallel Processing ICPP'01, September 2001.
Abstract
In this paper, we explore the performance of gang scheduling on a cluster using the Quadrics interconnection network. In such a cluster, the scheduler can take advantage of this network's unique capabilities, including a network interface card-based processor and memory and efficient user-level communication libraries. We developed a micro-benchmark to test the scheduler's performance under various aspects of parallel job workloads: memory usage, bandwidth and latency-bound communication, number of processes, timeslice quantum, and multiprogramming levels. Our experiments show that the gang scheduler performs relatively well under most workload conditions, is largely insensitive to the number of concurrent jobs in the system and scales almost linearly with number of nodes. On the other hand, the scheduler is very sensitive to the timeslice quantum, and values under 30 seconds can incur large overheads and fairness problems.
Paper (PDF 83KB) Presentation materials (PDF 789KB) Source (Compressed tar 190KB) BibTeX

Eitan Frachtenberg and Fabrizio Petrini. Overlapping Communication and Computation in the Quadrics Network, August 2001.
Report (PDF 83KB) Source (LyX 26KB) BibTeX

Eitan Frachtenberg and Fabrizio Petrini. Scheduler Testbed System Design, August 2001.
Report (PDF 351KB) Source (Compressed tar 259KB) BibTeX

Scalable system software and STORM

One of the main research questions our team is trying to address is how to develop scalable, high-performance system software. As part of this effort, we developed and advanced resource management tool, called STORM (Scalable TOol for Resource Management). This environment was measured to have unprecedented performance in typical resource-management tasks such as job launching in large clusters. STORM is also an excellent platform for studying, implementing and evaluating various job scheduling algorithms, and many of these are already incorporated in STORM.
STORM and system software related publications:

Tobias Pfandzelter, Aditya Dhakal, Eitan Frachtenberg, Sai R. Chalamalasetti, Darel Emmot, Ninad Hogade, Hong-Enriquez R. P. and, Gourav Rattihalli, David Bermbach and Dejan Milojicic. Kernel-as-a-Service: A Serverless Programming Model for Heterogeneous Hardware Accelerators, In 24th ACM/IFIP International Middleware Conference (Middleware'23), December 2023.
Abstract
With the slowing of Moore’s law and decline of Dennard scaling, computing systems increasingly rely on specialized hardware ac- celerators in addition to general-purpose compute units. Increased hardware heterogeneity necessitates disaggregating applications into workflows of fine-grained tasks that run on a diverse set of CPUs and accelerators. Current accelerator delivery models cannot support such applications efficiently, as (1) the overhead of manag- ing accelerators erases performance benefits for fine-grained tasks; (2) exclusive accelerator use per task leads to underutilization; and (3) specialization increases complexity for developers. We propose adopting concepts from Function-as-a-Service (FaaS), which has solved these challenges for general-purpose CPUs in cloud computing. Kernel-as-a-Service (KaaS) is a novel serverless programming model for generic compute accelerators that aids heterogeneous workflows by combining the ease-of-use of higher- level abstractions with the performance of low-level hand-tuned code. We evaluate KaaS with a focus on the breadth of the idea and its generality to diverse architectures rather than on an in-depth implementation for a single accelerator. Using proof-of-concept prototypes, we show that this programming model provides per- formance, performance efficiency, and ease-of-use benefits across a diverse range of compute accelerators. Despite increased levels of abstraction, when compared to a naive accelerator implementa- tion, KaaS reduces completion times for fine-grained tasks by up to 96.0% (GPU), 68.4% (FPGA), 98.6% (TPU), and 34.9% (QPU) in our experiments.
Paper (PDF 1031KB) Reviews (Text 13KB) Source (456KB) BibTeX

Eitan Frachtenberg, Fabrizio Petrini, Juan Fernandez and Scott Pakin. STORM: Scalable Resource Management for Large-Scale Parallel Computers. In IEEE Transactions on Computers 55 (12) : 1572–1587, December 2006.
Abstract
Although clusters are a popular form of high-performance computing, they remain more difficult to manage than sequential systems—or even symmetric multiprocessors. In this paper, we identify a small set of primitive mechanisms that are sufﬁciently general to be used as building blocks to solve a variety of resource-management problems. We then present STORM, a resource-management environment that embodies these mechanisms in a scalable, low-overhead, and efﬁcient implementation. The key innovation behind STORM is a modular software architecture that reduces all resource management functionality to a small number of highly scalable mechanisms. These mechanisms simplify the integration of resource management with low-level network features. As a result of this design, STORM can launch large, parallel applications an order of magnitude faster than the best time reported in the literature and can gang-schedule a parallel application as fast as the node OS can schedule a sequential application. This paper describes the mechanisms and algorithms behind STORM and presents a detailed performance model that shows that STORM's performance can scale to thousands of nodes.
Preprint (PDF 264KB) Reviews (Text 16KB) Source (Compressed tar 2022KB) BibTeX

Juan Fernandez, Eitan Frachtenberg, Fabrizio Petrini and Jose C. Sancho. An Abstract Interface for System Software on Large-Scale Clusters. In The Computer Journal 49 (4) : 454–469, July 2006.
Abstract
Scalable management of distributed resources is one of the major challenges when building large-scale clusters for high-performance computing. This task includes transparent fault tolerance, efficient deployment of resources and support for all the needs of parallel applications: parallel I/O, deterministic behavior and responsiveness. These challenges may seem daunting with commodity hardware and operating systems, since they were not designed to support a global, single management view of a large-scale system. In this paper we propose and demonstrate an abstract network interface in the cluster interconnect to facilitate the implementation of a simple yet powerful global operating system. This system, which can be thought of as a coarse-grain SIMD operating system, can allow commodity clusters to grow to thousands of nodes, while still retaining the usability and performance of the single-node workstation.
Preprint (PDF 416KB) BibTeX

Juan Fernandez, Fabrizio Petrini and Eitan Frachtenberg. Achieving Predictable and Scalable Performance with BCS-MPI, In Engineering the Grid: Status and Perspective, January 2006.
Abstract
Demand for increasingly-higher computing capability is driving a similar growth in compute cluster sizes, soon to be reaching tens of thousands of processors. This growth is not matched however by system software, which has remained largely unchanged from the advent of clusters. The failure of system software to scale and develop in the same rate as the underlying hardware constrains the productivity of these machines by severely limiting their utilization, reliability, and responsiveness. The traditional approach to system software, namely, the use of loosely-coupled independent daemons on each node, is inadequate for the management of large-scale clusters, a problem which is inherently tightly-coupled and requires a high degree of synchronization. One model for large-scale system software is Buffered Coscheduling (BCS), wherein synchronization and scalability are obtained by means of global scheduling of all system activities and collective network operations. BCS represents a new methodology for the design of system software as a single, parallel program using traditional parallel constructs. As such, system software can be made orders of magnitude more scalable, simple, and easy to debug than the existing distributed solutions. The most important aspect of the BCS model and the overlying system software is the buffering and scheduling of all communication, resulting in highly controllable and deterministic system behavior. This chapter describes in detail the implementation of BCS-MPI, an MPI library designed after this model, and shows that the benefits of determinism need not come at a significant performance cost. Furthermore, BCS-MPI comes with a sophisticated monitoring and debugging subsystem that simplifies the analysis of system and application performance, and is covered in detail in this chapter. keywords: Cluster computing, system software, buffered coscheduling, MPI, communication protocol, parallel monitoring and debugging, Quadrics, QsNet.
Chapter text (PDF 264KB) Source (945KB) BibTeX

Juan Fernandez, Fabrizio Petrini and Eitan Frachtenberg. Monitoring and Debugging Parallel Software with BCS-MPI on Large-Scale Clusters, In First Workshop on System Management Tools for Large-Scale Parallel Systems SMTPS'05, April 2005.
Abstract
Buffered CoScheduled (BCS) MPI is a novel implementation of MPI based on global synchronization of all system activities. BCS-MPI imposes a model where all processes and their communication are tightly scheduled at a very fine granularity. Thus, BCS-MPI provides a system that is much more controllable and deterministic. BCS-MPI leverages this regular behavior to provide a simple yet powerful monitoring and debugging subsystem that streamlines the analysis of parallel software. This subsystem, called Monitoring and Debugging System (MDS), provides exhaustive process and communication scheduling statistics. This paper covers in detail the design and implementation of the MDS subsystem, and demonstrates how the MDS can be used to monitor and debug not only parallel MPI applications but also the BCS-MPI runtime system itself. Additionally, we show that this functionality need not come at a significant performance loss.
Paper (PDF 208KB) Presentation materials (PDF 571KB) Reviews (Text) Source (849KB) BibTeX

Eitan Frachtenberg. SchedMark: Evaluating Scheduler Performance. In Usenix Annual Technical Conference USENIX'05, Anaheim, CA, April 2005.
Poster image (PDF 444KB) Source (Adobe Illustrator 454KB) BibTeX

Eitan Frachtenberg. SchedMark: Evaluating Scheduler Performance. In Work-in-Progress session of Usenix Annual Technical Conference USENIX'05, Anaheim, CA, April 2005.
Abstract
SchedMark: Evaluating Scheduler Performance In the last decade we have witnessed many new developments in process scheduling---developments driven by innovations in processor technology and by ever more versatile workloads. Many innovative schedulers are emerging in the scientific literature as well as enhancements and patches to production operating systems. However, one difficulty persists in the evaluation and comparison of scheduling schemes, even between those aiming to address the same problem: the lack of an agreed-upon benchmark for scheduling. Many studies come up with their own measures, workloads, and metrics, thus making their comparison an apples-and-oranges situation. The problem will probably grow worse with the emergence of commodity multicore and multithreaded chips, which are likely to result in even more scheduling work. The lack of a reliable, reproducible, and portable way to measure scheduling will grow even more critical. To address this need, we are developing SchedMark, a benchmark suite to evaluate the scheduler's impact on applications and workloads of various representative classes. SchedMark contrasts with previous approaches that measured either specific workloads (e.g., multimedia), or focused on system metrics (e.g., context-switch overhead). SchedMark will include synthetic desktop and server applications that represent a range of parameters and classes, such as continuous media, interactive, parallel, and sequential computation applications. The suite will measure not only global metrics, such as throughput, but also metrics that are of specific relevance to each application, such as dropped frames for continuous media, and response time for interactive applications. The set is designed to be portable, self-calibrating, and self-scaling, so that its results will remain comparable across a large set of architectures and operating systems. Workloads will be both static and dynamic, containing various combinations of classes of applications to capture the effect of scheduling decisions on co-interference and cache performance.
Presentation (HTML) Source (Compressed tar 34KB)

Fabrizio Petrini, Jose C. Sancho, Eitan Frachtenberg, Juan Fernandez and Kei Davis. On the Design of System Software for Large-Scale Clusters, In XV Jornadas de Paralelismo (pp. 237–242), September 2004. (ISBN: 84-8240-714-7).
BibTeX

Eitan Frachtenberg. Designing Parallel Operating Systems using Modern Computer Networks. In MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachussetes Institute of Technology, Cambridge MA, September 2004. Invited talk
Download slides.

Juan Fernandez, Eitan Frachtenberg, Fabrizio Petrini, Kei Davis and Jose C. Sancho. Architectural Support for System Software on Large-Scale Clusters, In International Conference on Parallel Processing ICPP'04, August 2004.
Abstract
Scalable management of distributed resources is one of the major challenges in deployment of large-scale clusters. Management includes transparent fault tolerance, efficient allocation of resources, and support for all the needs of parallel applications: parallel I/O, deterministic behavior, and responsiveness. These requirements are daunting with commodity hardware and operating systems since they were not designed to support a global, single management view of a large-scale system. In this paper we propose a small set of hardware mechanisms in the cluster interconnect to facilitate the implementation of a simple yet powerful global operating system. This system, which can be thought of as a coarse-grain SIMD operating system, allows commodity clusters to grow to thousands of nodes while still retaining the usability and responsiveness of the single-node workstation. Our results on a software prototype show that it is possible to implement efficient and scalable system software using the proposed set of mechanisms. Keywords: Cluster computing, cluster operating system, fault tolerance, network hardware, debuggability, resource management.
Paper (PDF 149KB) Presentation (PowerPoint 554KB) Reviews (Text 10KB) Source (Compressed tar 425KB) BibTeX

Eitan Frachtenberg, Kei Davis, Fabrizio Petrini, Jose C. Sancho and Juan Fernandez. Designing Parallel Operating Systems via Parallel Programming, In Euro-Par Europar'04, August 2004.
Abstract
Ever-increasing demand for computing capability is driving the construction of ever-larger computer clusters, soon to be reaching tens of thousands of processors. Many functionalities of system software have failed to scale accordingly---systems are becoming more complex, less reliable, and less efficient. Our premise is that these deficiencies arise from a lack of global control and coordination of the processing nodes. In practice, current parallel machines are loosely-coupled systems that are used for solving inherently tightly-coupled problems. This paper demonstrates that existing and future systems can be made more scalable by using BSP-like parallel programming principles in the design and implementation of the system software, and by taking full advantage of the latest interconnection network hardware. Moreover, we show that this approach can also yield great improvements in efficiency, reliability, and simplicity.
Paper (PDF 61KB) Presentation (PowerPoint 992KB) Reviews (Text) Source (Compressed tar 34KB) BibTeX

Jose C. Sancho, Fabrizio Petrini, Greg Johnson, Juan Fernandez and Eitan Frachtenberg. On the Feasibility of Incremental Checkpointing for Scientific Computing, In International Parallel and Distributed Processing Symposium IPDPS'04, April 2004.
Abstract
In the near future large-scale parallel computers will feature hundreds of thousands of processing nodes. In such systems, fault tolerance is critical as failures will occur very often. Checkpointing and rollback recovery has been extensively studied as an attempt to provide fault tolerance. However, current implementations do not provide the total transparency and full flexibility that are necessary to support the new paradigm of autonomic computing -- systems able to self-heal and self-repair. In this paper we provide an in-depth evaluation of incremental checkpointing for scientific computing. The experimental results, obtained on a state-of-the art cluster running several scientific applications, show that efficient, scalable, automatic and user-transparent incremental checkpointing is within reach with current technology.
Paper (PDF 165KB) Presentation (PowerPoint 2033KB) Reviews (Text) Source (Compressed tar 430KB) BibTeX

Juan Fernandez, Fabrizio Petrini and Eitan Frachtenberg. Buffered Coscheduled (BCS) MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers. In CCS Division Review Committee, Los Alamos National Laboratory, April 2004.
Poster image (PDF 446KB) Source (Adobe Illustrator 446KB) BibTeX

Juan Fernandez, Eitan Frachtenberg and Fabrizio Petrini. BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers, In IEEE/ACM Conference on Supercomputing SC'03, November 2003.
Abstract
Buffered CoScheduled MPI (BCS-MPI) introduces a new approach to design the communication layer for large-scale parallel machines. The emphasis of BCS-MPI is on the global coordination of a large number of communicating processes rather than on the traditional optimization of the point-to-point performance. BCS-MPI delays the interprocessor communication in order to schedule globally the communication pattern and it is designed on top of a minimal set of collective communication primitives. In this paper we describe a prototype implementation of BCS-MPI and its communication protocols. Several experimental results, executed on a set of scientific applications, show that BCS-MPI can compete with a production-level MPI implementation, but is much simpler to implement, debug and model.
Paper (PDF 203KB) Presentation (PDF 941KB) Reviews (HTML) Source (Compressed tar 201KB) BibTeX

Eitan Frachtenberg, Fabrizio Petrini, Juan Fernandez, Scott Pakin and Salvador Coll. STORM: Lightning-Fast Resource Management, In IEEE/ACM Conference on Supercomputing SC'02, November 2002.
Abstract
Although clusters are a popular form of high-performance computing (HPC), they remain more difficult to manage than sequential systems, or even symmetric multiprocessors. Furthermore, as cluster sizes increase, resource management---essentially, everything that runs on a cluster other than the applications---becomes an increasingly large impediment to application efficiency. In this talk we present STORM, a resource-management framework designed for scalability and performance. The key innovation behind STORM is a software architecture that enables resource management to exploit low-level network features. As a result of this HPC-application-like design, STORM is orders of magnitude faster than the best reported results in the literature on two sample resource-management functions: job launching and process scheduling. Further, we identify a small set of network primitives that is sufficient for a scalable implementation of a resource manager if implemented itself in a scalable manner.
Paper (PDF 265KB) Presentation (PDF 1082KB) Reviews (Postscript 72KB) Source (Compressed tar 190KB) BibTeX

Eitan Frachtenberg, Fabrizio Petrini, Juan Fernandez and Salvador Coll. Scalable Resource Management in High-Performance Computers, In IEEE International Conference on Cluster Computing Cluster'02, September 2002.
Abstract
Clusters of workstations have emerged as an important platform for building cost-effective, scalable, and highly-available computers. Although many hardware solutions are available today, the largest challenge in making large-scale clusters usable lies in the system software. In this paper we present STORM, a resource management tool designed to provide scalability, low overhead, and the flexibility necessary to efficiently support and analyze a wide range of job-scheduling algorithms. STORM achieves these feats by using a small set of primitive mechanisms that are common in modern high-performance interconnects. The architecture of STORM is based on three main technical innovations. First, a part of the scheduler runs in the thread processor located on the network interface. Second, we use hardware collectives that are highly scalable both for implementing control heartbeats and to distribute the binary of a parallel job in near-constant time. Third, we use an I/O bypass protocol that allows fast data movements from the file system to the communication buffers in the network interface and vice versa. The experimental results show that STORM can launch a job with a binary of 12MB on a 64-processor, 32-node cluster in less than 250ms. This paper provides experimental and analytical evidence that these results scale to a much larger number of nodes. To the best of our knowledge, STORM significantly outperforms existing production schedulers in launching jobs, performing resource management tasks, and gang-scheduling tasks. Keywords: Cluster Computing, Resource Management, Job Scheduling, Gang Scheduling, Parallel Architectures, Quadrics Interconnect, I/O bypass
Paper (PDF 132KB) Presentation (PDF 1401KB) Reviews (Text) Source (Compressed tar 604KB) BibTeX

Eitan Frachtenberg, Juan Fernandez, Fabrizio Petrini and Scott Pakin. STORM: A Scalable TOol for Resource Management. In The conference on High-Speed Computing, Glenden Beach, OR, April 2002. Invited work A more recent (and nicer) version of the poster can be found here. Yet another version was shown at the Los Alamos National Laboratory booth at IEEE/ACM Conference on Supercomputing SC'02 here.
Poster image (1093KB) Source (Compressed tar 5150KB) BibTeX

Network Protocols and interconnects

We have investigated and proposed several network protocols for advanced networks that offer multiple rails, that is, a redundancy of networks (interfaces and switches). Multiple rails allow for increased network performance, but hard to exploit efficiently with current bus technology. The papers in this list try to address this with by static or dynamic allocation of rails to messages:

Fabrizio Petrini, Juan Fernandez, Eitan Frachtenberg and Salvador Coll. Scalable Collective Communication on the ASCI Q Machine, In 11th Hot Interconnects conference HOTi11, August 2003.
Abstract
Scientific codes spend a considerable part of their run time executing collective communication operations. Such operations can also be critical for efficient resource management in large-scale machines. Therefore, scalable collective communication is a key factor to achieve good performance in large-scale parallel computers. In this paper we describe the performance and scalability of some common collective communication patterns on the ASCI~Q machine. Experimental results conducted on a 1024-node/4096-processor segment show that the network is fast and scalable. The network is able to barrier-synchronize in a few tens of $\mu$s, perform a broadcast with an aggregate bandwidth of more than 100 GB/s and sustain heavy hot-spot traffic with a limited performance degradation.
Paper (PDF 214KB) Presentation (PDF 1162KB) Reviews (Text) Source (Compressed tar 1494KB) BibTeX

Fabrizio Petrini, Eitan Frachtenberg, Adolfy Hoisie and Salvador Coll. Performance Evaluation of the Quadrics Interconnection Network. In Journal of Cluster Computing 6 (2) : 125–142, April 2003.
Abstract
In this paper we present an in-depth description of the Quadrics interconnection network (QsNET) and an experimental performance evaluation on a 64-node Alphaserver cluster. We expose the performance and the scaling properties of the network by using a collection of benchmarks, using permutation patterns, congested traffic with several types of hotspots and I/O traffic. The experimental results indicate that the QsNET provides excellent performance in most cases, with excellent contention resolution mechanisms.
Preprint (PDF 597KB) Reviews (Text) Source (Compressed tar 882KB) BibTeX

Salvador Coll, Fabrizio Petrini, Eitan Frachtenberg and Adolfy Hoisie. Performance Evaluation of I/O Traffic and Placement of I/O Nodes on a High Performance Network, In Workshop on Communication Architecture for Clusters CAC'02, In conjunction with the International Parallel and Distributed Processing Symposium IPDPS'02, April 2002.
Abstract
A common trend in the design of large-scale clusters is to use a high-performanc e data network to integrate the processing nodes in a single parallel computer. In these systems the performance of the interconnect can be a limiting factor for the input/output (I/O), which is traditionally bottlenecked by the disk bandwidth. In this paper we present an experimental analysis on a 64-node AlphaServer cluster based on the Quadrics network (QsNET) of the behavior of the interconne ct under I/O traffic, and the influence of the placement of the I/O servers on the overall performance. The effects of using dedicated I/O nodes or overlapping I/O and computation on the I/O nodes are also analyzed. In addition, we evaluate how background I/O traffic interferes with other parallel applications running concurrently. Our experimental results show that a correct placement of the I/O servers can provide up to 20% increase in the available I/O bandwidth. Moreover, some important guidelines for applications and I/O servers mapping on large-scale clusters are given.
Paper (PDF 141KB) Presentation materials (PDF 2833KB) Reviews (Text) Source (Compressed tar 431KB) BibTeX

Fabrizio Petrini, Wu-chun Feng, Adolfy Hoisie, Salvador Coll and Eitan Frachtenberg. The Quadrics Network (QsNet): High-Performance Clustering Technology. In IEEE Micro 22 (1) : 46–57, February 2002.
Abstract
The Quadrics interconnection network (QsNet) contributes two important innovations to the field of high-performance interconnects: (1) integration of the virtual-address spaces of individual nodes into a single, global, virtual-address space and (2) network fault tolerance via link-level and end-to-end protocols that can detect faults and automatically re-transmit packets. QsNet achieves these feats by extending the native operating system in the nodes with a network operating system and specialized hardware support in the network interface. As these and other important features of QsNet can be found in the InfiniBand specification, QsNet can be viewed as a precursor to InfiniBand. In this paper, we present an initial performance evaluation of QsNet. We first describe the main hardware and software features of QsNet, followed by the results of benchmarks that we ran on our experimental, Intel-based, Linux cluster built around QsNet. Our initial analysis indicates that QsNet performs remarkably well, e.g., user-level latency under 2 mus and bandwidth over 300 MB/s.
Preliminary version (PDF 127KB) Source (Compressed tar 162KB) BibTeX

Fabrizio Petrini, Salvador Coll, Eitan Frachtenberg and Adolfy Hoisie. Hardware- and Software-Based Collective Communication on the Quadrics Network, In 1st IEEE International Symposium on Network Computing and Applications NCA'01, October 2001.
Abstract
The efficient implementation of collective communication patterns in a parallel machine is a challenging design effort, that requires the solution of many problems. In this paper we present an in-depth description of how the Quadrics network supports both hardware- and software-based collectives. We describe the main features of the two building blocks of this network, a network interface that can perform zero-copy user-level communication and a wormhole routing switch. We also focus our attention on the routing and flow control algorithms, deadlock avoidance and on how the processing nodes are integrated in a global, virtual shared memory. Experimental results conducted on 64-node AlphaServer cluster indicate that the time to complete the hardware-based barrier synchronization on the whole network is as low as 6 mus, with very good scalability. Good latency and scalability are also achieved with the software-based synchronization, which takes about 15 mus. With the broadcast, similar performance is achieved by the hardware- and software-based implementations, which can deliver messages of up to 256 bytes in 13 mus and can get a sustained asymptotic bandwidth of 288 Mbytes/sec on all the nodes. The hardware-based barrier is almost insensitive to the network congestion, with 93% of the synchronizations taking less than 20 when the network is flooded with a background traffic of unicast messages. On the other hand, the software-based implementation suffers from a significant performance degradation. With high load the hardware broadcast maintains a reasonably good latency, delivering messages up to 2KB in 200 mus, while the software broadcast suffers from slightly higher latencies inherited from the synchronization mechanism. Both broadcast algorithms experience a significative performance degradation of the sustained bandwidth with large messages.
Paper (PDF 168KB) Presentation (PDF 2473KB) Reviews (Text) Source (Compressed tar 182KB) BibTeX

Fabrizio Petrini, Wu-chun Feng, Adolfy Hoisie, Salvador Coll and Eitan Frachtenberg. The Quadrics Network (QsNet): High-Performance Clustering Technology, In 9th Hot Interconnects conference HOTi9, August 2001.
Abstract
The Quadrics interconnection network (QsNet) contributes two novel innovations to the field of high-performance interconnects: (1) integration of the virtual-address spaces of individual nodes into a single, global, virtual-address space and (2) network fault tolerance via link-level and end-to-end protocols that can detect faults and automatically re-transmit packets. QsNet achieves these feats by extending the native operating system in the nodes with a network operating system and specialized hardware support in the network interface. As these and other important features of QsNet can be found in the InfiniBand specification, QsNet can be viewed as a precursor to InfiniBand. In this paper, we present an initial performance evaluation of QsNet. We first describe the main hardware and software features of QsNet, followed by the results of benchmarks that we ran on our experimental, Intel-based, Linux cluster built around QsNet. Our initial analysis indicates that QsNet performs remarkably well, e.g., user-level latency under 2 mus and bandwidth over 300 MB/s.
Paper (PDF 93KB) Presentation (PDF 435KB) Source (Compressed tar 166KB) BibTeX

Eitan Frachtenberg and Fabrizio Petrini. Overlapping Communication and Computation in the Quadrics Network, August 2001.
Report (PDF 83KB) Source (LyX 26KB) BibTeX