Eitan Frachtenberg's Publications

Sort by: Date Type Area

Jump to:

2026

Eitan Frachtenberg, Viyom Mittal, Mohammed Baydoun, Aditya Dhakal, Izzat El Hajj and Dejan Milojicic. Variability-Guided Performance Optimization, In 17th ACM/SPEC International Conference on Performance Engineering ICPE'26, May 2026. Best Paper Award — Research Track.
Abstract
The past few decades have seen software and hardware growing more heterogeneous and layered in abstractions. This trend produced many benefits for hiding complexity and increasing efficiency and modularity. But it also makes reasoning about performance and identifying its underlying factors more challenging because of the presence of performance variability. Moreover, performance variability can prevent synchronous applications from scaling and server applications from meeting service-level agreements. In this paper, we present a variability-guided optimization (VGO) workflow that leverages information in performance distributions to optimize performance, and more importantly, to reduce performance variability. It works by uncovering software, hardware, and compiler factors associated with specific aspects of variability and then suggesting measures for reducing it. Our experimental evaluation on a set of CPU and GPU benchmarks and applications shows that our tool successfully reduces the standard deviation and coefficient of variation of application run times by 0.374x and 0.444x, while also reducing mean run time by 0.843x. This technique enables tuning applications and their environments, improving their performance and predictability.
Paper (PDF 1509KB) Reviews (Text 13KB) Source (1155KB) BibTeX

Viyom Mittal, Mohammed Baydoun, Alok Mishra, Pavana Prakash, Gourav Rattihalli, Aditya Dhakal, Eitan Frachtenberg, Izzat El Hajj, Michalis Faloutsos and Dejan Milojicic. OASIS: Optimal Allocation Strategy for Inference Services in Cloud Environments, In 4th AIPerf and Optimization in the LLM World Workshop AIPerfLLM'26, in conjunction with ICPE'26, May 2026.
Abstract
Provisioning language models as inference services can be significantly challenging to cloud providers because of the need to balance service-level objectives (SLOs) with capital and operational costs. The heterogeneous nature of user queries, combined with temporal variations in workload patterns and the high energy consumption of GPUs warrants an intelligent resource-allocation strategy. We present OASIS (Optimal Allocation Strategy for Inference Services), a comprehensive methodology that combines static and dynamic optimizations to provision inference services efficiently. OASIS employs a two-phase approach: (1) pre-deployment profiling to determine optimal parameter configurations for different query types across various hardware options, and (2) run-time query routing and dynamic provisioning based on observed workload patterns. We evaluate OASIS using real-world BurstGPT traces on NVIDIA A100 and H100 GPUs. Our results show that: (1) GPU frequency scaling from 1320 MHz to 855 MHz reduces power by 39% with only 7% throughput loss; (2) query type classification reveals long-input-short-output workloads achieve 2.1x higher throughput than short-input-long-output; (3) MIG-based multi-tenancy achieves 42-43% power reduction when running two models simultaneously; and (4) runtime adaptation on real campus traces achieves statistically significant 12.4% power reduction while maintaining SLO compliance and identical throughput.
Paper (PDF 843KB) Reviews (Text) Source (ZIP 456KB) BibTeX

Mohammad Sonji, Mohammed Baydoun, Safaa Diab, Amir Nassereldine, Pedro Bruel, Aditya Dhakal, Rolando P. Hong Enriquez, Gourav Rattihalli, Diman Zad Tootaghaj, Gallig Renaud, Barbara Chapman, Fatima K. Abu Salem, Eitan Frachtenberg, Dejan Milojicic and Izzat El Hajj. Are We There Yet? Predicting if Executing Applications are Near Completion, In 17th ACM/SPEC International Conference on Performance Engineering ICPE'26, May 2026.
Abstract
Predicting the running time or remaining time of batch-style applications is useful to schedulers and resource managers. However, it is fundamentally challenging to make such predictions accurately for applications that have not been seen before or that run on datasets with varying sizes. For this reason, we aim to answer a simpler, but nevertheless instrumental, question: is an executing application about to finish executing? To this end, we present AWTY, a workflow for predicting whether or not a running batch-style application is near completion. AWTY analyzes application profiles to identify what applications' last phases look like while treating applications as black boxes. It then uses this data to train classifiers that can identify whether or not an executing application is in its last phase. AWTY employs both single-application classifiers that work on applications that have been seen before and general classifiers that work on applications that have not been seen before. Our evaluation shows that AWTY can predict if an application is near completion reasonably well. AWTY can inform schedulers and resource managers in making decisions about whether to kill applications that have overstayed their time or to let them finish.
Paper (PDF 2210KB) Reviews (Text 15KB) Source (ZIP 61775KB) BibTeX

Christof Ebert, Izzat El Hajj, Eitan Frachtenberg, Albert Lysko, Dejan Milojicic, Roxana Saint Nom, Saurabh Sinha and Julio Toro. Technology Predictions 2026. In IEEE Computer 59 (4) : 172–181, April 2026.
Abstract
IEEE Computer Society technology experts have unveiled 26 breakthrough technologies set to redefine industries and shape the future of our world for decades to come. Computing is no longer just evolving — it is reshaping every industry, from energy and health care to space and mobility. The 114-member 2026 Technology Predictions team foresees accelerated growth in many artificial intelligence (AI) facets requiring reskilling of the workforce; increased focus on new sources of power and energy to feed demanding AI applications; ever-increasing automation in many dimensions, setting the stage for additional AI opportunities; and the emergence of health/biotech/agrotech and personal assistants by wearables and physical AI.
Preliminary version (PDF 1474KB) Source (3867KB) BibTeX

2025

Suparna Bhattacharya, Thomas Coughlin, Lance Evans, Paolo Faraboschi, Eitan Frachtenberg, Gary Grider, Dejan Milojicic, Sreenivas Rangan Sukumar and Alex Veprinsky. The Future of Data. In IEEE Computer 58 (11) : 144–152, November 2025.
Abstract
Data has always been important, but even more so today. In the era of artificial intelligence (AI), data also appears as models, tags, voice, video, sensor readings, artificially created data, and much more. With AI, data is not the past; it's the future. Its value, size, and relevance are meant to continuously fuel every aspect of our lives. In this article, we discuss the historical and new use of storage and data, with special focus on AI and agentic AI, privacy, and security. We cover storage architecture, media, and its economics. We conclude with a summary and outlook.
Preliminary version (PDF 843KB) Source (289KB) BibTeX

Mohammed Baydoun, Mohammad Sonji, Pedro Bruel, Dejan Milojicic, Eitan Frachtenberg and Izzat El Hajj. Predicting Performance Variability, In 19th International Workshop on Parallel Applications Performance Tools iWAPT'25 (pp. 225–234), May 2025.
Abstract
As computing systems grow increasingly more complex, application performance on these systems is becoming more variable and less deterministic. Scalar performance summaries such as mean or median run time do not adequately reflect the true behavior of an application that can only be gleaned from the complete performance distribution. However, measuring the distribution of an application’s performance on a system requires running the application many times on that system, which can be an expensive process. To address this challenge, we aim to answer the question: Can the performance distribution of an application on a system be predicted by learning from other representative applications? We aim to answer this question in the context of two use cases: predicting the performance distribution of an application on a system from a few runs of that application on the system, and predicting the performance distribution of an application on a system from a measured distribution of that application’s performance on a different system. To this end, we measure the performance distribution of a large set of representative benchmarks and use that information to train prediction models that predict the performance distribution of new applications. We consider different alternatives for formulating the prediction problem as well as different types of prediction models. Our evaluation compares these alternatives to identify the best formulation and model to use for each use case, and shows that many application performance distributions can be predicted with reasonable accuracy
Paper (PDF 4126KB) Presentation (2922KB) Reviews (Text) Source (49468KB) BibTeX

Hongzheng Tian, Alok Mishra, Zhiheng Chen, Rolando P. Hong Enriquez, Dejan Milojicic and Eitan Frachtenberg. HeteroBench: Multi-Kernel Benchmarks for Heterogeneous Systems, In 15th ACM/SPEC International Conference on Performance Engineering ICPE'25, May 2025. Best Paper Award — Artifact Track.
Abstract
The end of Moore’s Law and Dennard scaling has driven the proliferation of heterogeneous systems with accelerators, including CPUs, GPUs, and FPGAs, each with distinct architectures, compilers, and programming environments. GPUs excel at massively parallel processing for tasks like deep learning training and graphics rendering, while FPGAs offer hardware-level flexibility and energy efficiency for low-latency, high-throughput applications. In contrast, CPUs, while general-purpose, often fall short in highparallelism or power-constrained applications. This architectural diversity makes it challenging to compare these accelerators effectively, leading to uncertainty in selecting optimal hardware and software tools for specific applications. To address this challenge, we introduce HeteroBench, a versatile benchmark suite for heterogeneous systems. HeteroBench allows users to evaluate multi-compute kernel applications across various accelerators, including CPUs, GPUs (from NVIDIA, AMD, Intel), and FPGAs (AMD), supporting programming environments of Python, Numba-accelerated Python, serial C++, OpenMP (both CPUs and GPUs), OpenACC and CUDA for GPUs, and Vitis HLS for FPGAs. This setup enables users to assign kernels to suitable hardware platforms, ensuring comprehensive device comparisons. What makes HeteroBench unique is its vendor-agnostic, crossplatform approach, spanning diverse domains such as image processing, machine learning, numerical computation, and physical simulation, ensuring deeper insights for HPC optimization. Extensive testing across multiple systems provides practical reference points for HPC practitioners, simplifying hardware selection and performance tuning for both developers and end-users alike. This suite may assist to make more informed decision on AI/ML deployment and HPC development, making it an invaluable resource for advancing academic research and industrial applications.
Paper (PDF 1485KB) Reviews (Text 11KB) Source (ZIP 1570KB) BibTeX

2024

Viyom Mittal, Pedro Bruel, Michalis Faloutsos, Dejan Milojicic and Eitan Frachtenberg. SHARP: A Distribution-Based Framework for Reproducible Performance Evaluation, In IEEE International Symposium on Workload Characterization IISWC'24 (pp. 82–93), November 2024.
Abstract
Performance evaluation studies often produce unreliable or irreproducible results because: (a) measurements have high variability due to multiple system variables and diverse operational conditions, (b) reported results are often focused on point-summary statistics, such as average values. Despite recent efforts, there does not exist a general framework to assess and compare the performance of high-performance systems in a principled and reproducible way. This paper addresses this critical gap by introducing SHARP, an open-source framework designed to redefine performance evaluation following a reproducibility-first approach. SHARP enables and facilitates a comprehensive characterization of the performance distribution of an application, while orchestrating experiments efficiently. SHARP addresses these key challenges using (a) robust performance analysis and comparison with Similarity Metrics; (b) the automatic determination of a reliable sample size through a diverse set of Stopping Rules; and (c) comprehensive recording of experimental conditions and results. We showcase the need for and advantages of SHARP by evaluating the performance of 20 Rodinia benchmarks on 3 HPC servers with different CPU and GPU configurations. We empirically evaluate SHARP to expose the need for distributionbased statistics, and demonstrate how the stopping rules of SHARP attain reliable performance results while minimizing resource usage up to ∼90% relative to a large fixed number of experiments sufficient enough to establish ground-truth. We see the SHARP framework as a fundamental step towards providing customers and engineers with a reproducible and reliable way to reason and compare the performance of HPC applications and infrastructure.
Paper (PDF 2771KB) Presentation (10035KB) Reviews (Text) Source (ZIP 8912KB) BibTeX

Rolando P. Hong Enriquez, Rosa M. Badia, Barbara Chapman, Kirk Bresniker, Scott Pakin and Eitan Frachtenberg. Quantum Optimization Algorithms: Energetic Implications. In Concurrency and Computation: Practice and Experience 36 (16) : e8121, August 2024.
Abstract
Since the dawn of Quantum Computing (QC), theoretical developments like Shor’s algorithm proved the conceptual superiority of QC over traditional computing. However, such quantum supremacy claims are difficult to achieve in practice because of the technical challenges of realizing noiseless qubits. In the near future, QC applications will need to rely on noisy quantum devices that offload part of their work to classical devices. One way to achieve this is by using Parameterized Quantum Circuits (PQCs) in optimization or even in machine learning tasks. The energy requirements of quantum algorithms have not yet been studied extensively. In this article we explore several optimization algorithms using both theoretical insights and numerical experiments to understand their impact on energy consumption. Specifically, we highlight why and how algorithms like Quantum Natural Gradient Descent, Simultaneous Perturbation Stochastic Approximations or Circuit Learning methods, are at least 2× to 4× more energy efficient than their classical counterparts; why Feedback-Based Quantum Optimization is energy-inefficient; and how techniques like Rosalin can improve the energy efficiency of other algorithms by a factor of ≥ 20×. Finally, we use the NchooseK high-level programming model to run optimization problems on both gate-based quantum computers and quantum annealers. Empirical data indicate that these optimization problems run faster, have better success rates, and consume less energy on quantum annealers than on their gate-based counterparts
Preprint (PDF 2460KB) Source (ZIP 7158KB) BibTeX

Paul Raith, Gourav Rattihalli, Aditya Dhakal, Suresh R. Chalamalasetti, Dejan Milojicic and Eitan Frachtenberg. Opportunistic Energy-Aware Scheduling for Container Orchestration Platforms Using Graph Neural Networks, In 24th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing CCGrid'24, May 2024.
Abstract
Reducing the energy consumption of data centers is critical to meeting international climate goals and lowering operation costs. Container orchestration platforms can help counteract this trend by optimally placing applications across the infrastructure to increase resource utilization and reduce energy consumption. But platforms in use today are still energy-agnostic and do not offer any insights into energy consumption. In this paper, we present a monitoring framework and a new modeling approach for resource usage in data centers. The model captures heterogeneous hardware and software and acts as input for a Graph Neural Network (GNN) to predict power consumption. Based on this model, we derive a set of container scheduling algorithms that opportunistically schedule applications based on the estimated energy impact of incoming containers. Our results show that the GNN-based prediction model is very accurate and achieves an average RMSE (Root Mean Square Error) of 7.5%. We have implemented a custom scheduler to demonstrate the benefits of using our prediction, and our scheduler can decrease energy consumption on average by 6.2% without any code changes for the application and without increasing workload completion time compared to the default Kubernetes scheduler.
Paper (PDF 382KB) Reviews (Text 11KB) Source (ZIP 1162KB) BibTeX

Eitan Frachtenberg, Viyom Mittal, Pedro Bruel, Michalis Faloutsos and Dejan Milojicic. The Distribution Is the Performance. In IEEE Computer 57 (4) : 143–149, April 2024.
Abstract
In this column, we suggest how to sharpen our understanding of computer performance evaluation in light of variability and heterogeneity.
Preliminary version (PDF 1387KB) Reviews (Text) Source (ZIP 583KB) BibTeX

Caden Corontzos and Eitan Frachtenberg. Direct-Coding DNA With Multilevel Parallelism. In IEEE Computer Architecture Letters 23 (1) : 21–24, January 2024.
Abstract
The cost and time to sequence entire genomes have been on a steady and rapid decline since the early 2000s, leading to an explosion of genomic data. In contrast, the growth rates for digital storage device capacity, CPU clock speed, and networking bandwidth have been much more moderate. This gap means that the need for storing, transmitting, and processing sequenced genomic data is outpacing the capacities of the underlying technologies. Compounding the problem is the fact that traditional data compression techniques used for natural language or images are not optimal for genomic data. To address this challenge, many data-compression techniques have been developed, offering a range of tradeoffs between compression ratio, computation time, memory requirements, and complexity. This paper focuses on a specific technique on one extreme of this tradeoff, namely two-bit coding, wherein every base in a genomic sequence is compressed from its original 8-bit ASCII representation to a unique two-bit binary representation. Even for this simple direct-coding scheme, current implementations leave room for significant performance improvements. Here, we show that this encoding can exploit multiple levels of parallelism in modern computer architectures to maximize encoding and decoding efficiency. Our open-source implementation achieves encoding and decoding rates of billions of bases per second, which are much higher than previously reported results. In fact, our measured throughput is typically limited only by the speed of the underlying storage media.
Preprint (PDF 229KB) Reviews (Text) Source (28749KB) BibTeX

Kirk Bresniker, Paolo Faraboschi, Eitan Frachtenberg, Dejan Milojicic and Phillip Laplante. What Gets You Hired Now Will Not Get You Hired Then. In IT Professional 26 (1) : 26–31, January 2024.
Abstract
Since its inception, humanity has depended on the skills of individuals and groups. Over millennia, humanity and skills evolved, differentiating those who would prosper from those who did not. From gatherers to hunters, from agriculture to industry, and from IT to AI, the complexity and rate of change have increased exponentially, driven by language, curiosity, and communication. IT skills and a variety of IT professions have dominated the past 25 years. At this juncture, advances in AI have altered the technology landscape, causing tectonic shifts in many professions. How will IT professions evolve, and how should IT professionals adapt? We contend that, first, required skill sets will rapidly change, increasing the importance of continuing education. Second, with the increased adoption of AI, the importance of data will also increase, demanding an ever-growing need for data science skills. Finally, many IT activities will be automated, requiring IT professionals to collaborate with AI assistants and take more strategic roles.
Preliminary version (PDF 234KB) Reviews (PDF 1145KB) Source (390KB) BibTeX

Pedro Bruel, Sai R. Chalamalasetti, Aditya Dhakal, Eitan Frachtenberg and Ninad Hogade. Predicting Heterogeneity and Serverless Principles of Converged High-Performance Computing, Artificial Intelligence, and Workflows. In IEEE Computer 57 (1) : 136–144, January 2024.
Abstract
Traditional highperformance computing and modern artificial intelligence computing are converging with workflows as a common paradigm. We predict nine principles of heterogeneity and serverless computing for this convergence, from high-level programming to low-level hardware.
Preliminary version (PDF 2586KB) Reviews (PDF 2450KB) BibTeX

2023

Tobias Pfandzelter, Aditya Dhakal, Eitan Frachtenberg, Sai R. Chalamalasetti and David Emmot. Kernel-as-a-Service: A Serverless Programming Model for Heterogeneous Hardware Accelerators, In 24th ACM/IFIP/USENIX International Middleware Conference Middleware'23 (pp. 192–206), December 2023.
Abstract
With the slowing of Moore’s law and decline of Dennard scaling, computing systems increasingly rely on specialized hardware accelerators in addition to general-purpose compute units. Increased hardware heterogeneity necessitates disaggregating applications into workflows of fine-grained tasks that run on a diverse set of CPUs and accelerators. Current accelerator delivery models cannot support such applications efficiently, as (1) the overhead of managing accelerators erases performance benefits for fine-grained tasks; (2) exclusive accelerator use per task leads to underutilization; and (3) specialization increases complexity for developers. We propose adopting concepts from Function-as-a-Service (FaaS), which has solved these challenges for general-purpose CPUs in cloud computing. Kernel-as-a-Service (KaaS) is a novel serverless programming model for generic compute accelerators that aids heterogeneous workflows by combining the ease-of-use of higher-level abstractions with the performance of low-level hand-tuned code. We evaluate KaaS with a focus on the breadth of the idea and its generality to diverse architectures rather than on an in-depth implementation for a single accelerator. Using proof-of-concept prototypes, we show that this programming model provides performance, performance efficiency, and ease-of-use benefits across a diverse range of compute accelerators. Despite increased levels of abstraction, when compared to a naive accelerator implementation, KaaS reduces completion times for fine-grained tasks by up to 96.0% (GPU), 68.4% (FPGA), 98.6% (TPU), and 34.9% (QPU) in our experiments.
Paper (PDF 939KB) Reviews (Text 29KB) Source (ZIP 22417KB) BibTeX

Pedro Bruel, Viyom Mittal, Dejan Milojicic, Michalis Faloutsos and Eitan Frachtenberg. Revisiting Performance Evaluation in the Age of Uncertainty, In 11th Workshop on Education for High-Performance Computing EduHPC'23, November 2023.
Abstract
Given a cloud-native application, how do we accurately estimate its performance, such as run time or memory consumption? Accurate estimation is necessary to ensure that the application meets performance goals without resorting to overprovisioning of resources. Additionally, in practice, performance estimation needs to be meaningful and reproducible. Unfortunately, modern HPC systems come with numerous factors affecting performance estimation, such as heterogeneous accelerators, multilevel networks, millions of cores, layered software abstractions, and specialized middleware. Each of these factors adds a degree of variability to empirical performance results. The approaches currently being taught and practiced limit performance evaluation in three ways: (1) usage of incomplete performance descriptions/metrics such as point summaries (e.g., mean, 99th-percentile or median) which hide the rich behavioral patterns in different scenarios; (2) measuring insufficient performance samples, leading to inaccurate performance description; and (3) measuring excessive performance samples, leading to waste of precious computing resources. To overcome these limitations, we propose a new approach to evaluate and reason about application performance in modern HPC in a meaningful way. Our contribution is threefold: (a) we show the difficulty of estimating performance in realistic scenarios: one performance measurement is not enough; (b) we propose to use distributions as the true measure of performance; and (c) we propose several practices and concepts to be taught to HPC students and practitioners, so that they may produce rich and accurate performance evaluations. We see our work having an impact both on educators and on practitioners.
Paper (PDF 414KB) Presentation materials (6659KB) Reviews (Text) BibTeX

Viyom Mittal, Pedro Bruel, Dejan Milojicic and Eitan Frachtenberg. Adaptive Stopping Rule for Performance Measurements, In 13th Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems PMBS'23, November 2023.
Abstract
Performance variability in complex computer systems is a major challenge for accurate benchmarking and performance characterization, especially for tightly-coupled large-scale high-performance computing systems. Point summaries of performance may be both uninformative, if they do not capture the full richness of its behavior, and inaccurate, if they are derived from an inadequate sample set of measurements. Determining the correct sample set—and in particular, its size—requires balancing tradeoffs of computation, methodology, and statistical power. In this paper, we treat the performance distribution as the primary target of the performance evaluation, from which all other metrics can be derived. We propose a meta-heuristic that characterizes the performance distribution as it is being measured, dynamically determining when enough samples have been collected to approximate the true distribution. Compared to predetermined fixed stopping criteria, this dynamic and adaptive method can be more efficient in resource use, since it can stop as early as the desired certainty level is obtained, and more accurate, since it does not stop prematurely. Importantly, it requires no advance knowledge or assumptions about the system under test or its performance characteristics. We evaluate a prototype of our proposal using a mix of synthetic and real benchmarks. For synthetic distributions, this approach closely matches the true distribution. For actual benchmarks, the heuristic is overly conservative for some applications and overly lax for others, especially those using GPUs. But it still matches the overall shape of the distribution for benchmarks with very diverse distributions, which suggests that it is a viable approach for an adaptive stopping rule.
Paper (PDF 748KB) Reviews (68KB) Source (1833KB) BibTeX

Paolo Faraboschi, Eitan Frachtenberg, Phillip Laplante, Dejan Milojicic and Roberto Saracco. Artificial General Intelligence: Humanity's Downturn or Unlimited Prosperity. In IEEE Computer 56 (10) : 93–101, October 2023.
Abstract
Generative AI, exemplified by foundation models such as large language models, has become a focus of the technical and non-technical communities alike. The results delivered by Generative AI have given the impression that we are now very close to an Artificial General Intelligence, AGI. The world seems to be split among those who foresee big opportunities and those who anticipate big threats. This paper takes a balanced look at AGI technology and speculates on possible outcomes for the eventual development and deployment of AGI.
Preliminary version (PDF 2271KB) Reviews (PDF 1778KB) Source (186KB) BibTeX

Eitan Frachtenberg. Citation analysis of computer systems papers. In PeerJ Computer Science 9 : e1389, August 2023. Open access.
Abstract
Citation analysis is used extensively in the bibliometrics literature to assess the impact of individual works, researchers, institutions, and even entire fields of study. In this article, we analyze citations in one large and influential field within computer science, namely computer systems. Using citation data from a cross-sectional sample of 2,088 papers in 50 systems conferences from 2017, we examine four research areas of investigation: overall distribution of systems citations; their evolution over time; the differences between databases (Google Scholar and Scopus), and; the characteristics of self-citations in the field. On citation distribution, we find that overall, systems papers were well cited, with the most cited subfields and conference areas within systems being security, databases, and computer architecture. Only 1.5% of papers remain uncited after five years, while 12.8% accrued at least 100 citations. For the second area, we find that most papers achieved their first citation within a year from publication, and the median citation count continued to grow at an almost linear rate over five years, with only a few papers peaking before that. We also find that early citations could be linked to papers with a freely available preprint, or may be primarily composed of self-citations. For the third area, it appears that the choice of citation database makes little difference in relative citation comparisons, despite marked differences in absolute counts. On the fourth area, we find that the ratio of self-citations to total citations starts relatively high for most papers but appears to stabilize by 12–18 months, at which point highly cited papers revert to predominately external citations. Past self-citation count (taken from each paper’s reference list) appears to bear little if any relationship with the future self-citation count of each paper. The primary practical implication of these results is that the impact of systems papers, as measured in citations, tends to be high relative to comparable studies of other fields and that it takes at least five years to stabilize. A secondary implication is that at least for this field, Google Scholar appears to be a reliable source of citation data for relative comparisons.
Preprint (PDF 624KB) Reviews (20KB) Source (23950KB) BibTeX

Cullen Bash, Paolo Faraboschi, Eitan Frachtenberg, Phillip Laplante, Dejan Milojicic and Roberto Saracco. Megatrends. In IEEE Computer 56 (7) : 93–100, July 2023.
Abstract
Understanding megatrends allows individuals and organizations to align with and benefit from the richness of trending technologies.
Preliminary version (PDF 675KB) Reviews (PDF 4289KB) Source (2575KB) BibTeX

Gourav Rattihalli, Ninad Hogade, Aditya Dhakal, Eitan Frachtenberg, Rolando P. Hong Enriquez and Pedro Bruel. Fine-Grained Heterogeneous Execution Framework With Energy Aware Scheduling, In 16th IEEE International Conference on Cloud Computing CLOUD'23 (pp. 35–44), July 2023.
Abstract
The growing convergence of high-performance, data analytics, and machine-learning applications is increasingly pushing computing systems toward heterogeneous processors and specialized hardware accelerators. Hardware heterogeneity, in turn, leads to finer-grained workflows. State-of-the-art serverless computing resource managers do not currently provide efficient scheduling of such fine-grained tasks on systems with heterogeneous CPUs and specialized hardware accelerators (e.g., GPUs and FPGAs). Working with fine-grained tasks presents an opportunity for more efficient energy use via new scheduling models. Our proposed scheduler enables technologies like Nvidia’s Multi-Process Service (MPS) to pack multiple fine-grained tasks on GPUs efficiently. Its advantages include better co-location of jobs and better sharing of hardware resources such as GPUs that were not previously possible on container orchestration systems. We propose a Kubernetes-native energy-aware scheduler that integrates with our heterogeneous framework. Combining finegrained resource scheduling on heterogeneous hardware and energy-aware scheduling results in up to 17.6% improvement in makespan, up to 20.16% reduction in energy consumption for CPU workloads, and up to 58.15% improvement in makespan, and up to 28.92% reduction in energy consumption for GPU workloads.
Paper (PDF 577KB) Reviews (Text 11KB) Source (ZIP 1580KB) BibTeX

Rolando P. Hong Enriquez, Rosa M. Badia, Barbara Chapman, Kirk Bresniker, Aditya Dhakal, Eitan Frachtenberg, Ninad Hogade, Gourav Rattihalli, Pedro Bruel, Alok Mishra and Dejan Milojicic. Estimating Energy-Efficiency in Quantum Optimization Algorithms, In Cray User Group 2023 CUG'23, May 2023.
Abstract
Since the dawn of Quantum Computing (QC), theoretical developments like Shor’s algorithm, proved the conceptual superiority of QC over traditional computing. However, such quantum supremacy claims are difficult to achieve in practice due to the technical challenges of realizing noiseless qubits. In the near future, QC applications will need to rely on noisy quantum devices that offload part of their work to classical devices. A way to achieve this is by using Parameterized Quantum Circuits (PQCs) in optimization or even machine learning tasks. The energy consumption of quantum algorithms has been poorly studied. Here we explore several optimization algorithms using both, theoretical insights and numerical experiments, to understand their impact on energy consumption. Specifically, we highlight why and how algorithms like Quantum Natural Gradient Descent, Simultaneous Perturbation Stochastic Approximations or Circuit Learning methods, are at least 2× to 4× more energy efficient than their classical counterparts. Why FeedbackBased Quantum Optimization is energy-inefficient and how a technique like Rosalin, could boost the energy-efficiency of other algorithms by a factor of ≥ 20×.
Paper (PDF 2234KB) Presentation materials (6196KB) Source (ZIP 2467KB) BibTeX

Paolo Faraboschi, Eitan Frachtenberg, Phillip Laplante, Dejan Milojicic and Roberto Saracco. Digital Transformation: Lights and Shadows. In IEEE Computer 56 (4) : 123–130, April 2023.
Abstract
We have been talking about Digital Transformation for decades. The growing pervasiveness of computers in all facets of life and business has made the world of bits a constant presence. This presence, in turn, has changed both our lives and business.
Preliminary version (PDF 2898KB) Source (97KB) BibTeX

Alexis Richter, Josh Yamamoto and Eitan Frachtenberg. Why Are There So Few Women in Computer Systems Research?. In IEEE Computer 56 (2) : 101–105, February 2023.
Abstract
Numerous explanations have been proposed for the underrepresentation of women in computer science research. Using an expansive dataset, we ask which of these explanations apply even more to researchers in computer systems.
Preliminary version (PDF 3004KB) Reviews (Text) Source (229KB) BibTeX

Martin Arlitt, Thomas Coughlin, Paolo Faraboschi, Eitan Frachtenberg, Phillip Laplante and Dejan Milojicic. Future of the Workforce. In IEEE Computer 56 (1) : 52–63, January 2023.
Abstract
The global COVID-19 Pandemic caused tremendous stress in many facets of our existence from global shifts to daily life-altering impacts. Most significantly, the pandemic caused many people and organizations to rethink how they worked, i.e., the pandemic was a forcing function.
Preliminary version (PDF 1002KB) Reviews (Text 12KB) Source (2020KB) BibTeX

2022

Eitan Frachtenberg. Multifactor citation analysis over five years: a case study of SIGMETRICS papers. In Publications 10 (4) : 47, October 2022. Open access.
Abstract
Performance evaluation is a broad discipline within computer science, combining deep technical work in experimentation, simulation, and modeling. The field’s subjects encompass all aspects of computer systems, including computer architecture, networking, energy efficiency, and machine learning. This wide methodological and topical focus can make it difficult to discern what attracts the community’s attention and how this attention evolves over time. As a first attempt to quantify and qualify this attention, using the proxy metric of paper citations, this study looks at the premier conference in the field, SIGMETRICS. We analyze citation frequencies at monthly intervals over a five-year period and examine possible associations with myriad other factors, such as time since publication, comparable conferences, peer review, self-citations, author demographics, and textual properties of the papers. We found that in several ways, SIGMETRICS is distinctive not only in its scope, but also in its citation phenomena: papers generally exhibit a strongly linear rate of citation growth over time, few if any uncited papers, a large gamut of topics of interest, and a possible disconnect between peer-review outcomes and eventual citations. The two most-cited papers in the dataset also exhibit larger author teams, higher than typical self-citations, and distinctive citation growth curves. These two papers, sharing some coauthors and a research focus, could either signal the area where SIGMETRICS had the most research impact, or they could represent outliers; their omission from the analysis reduces some of the otherwise distinctive observed metrics to nonsignificant levels.
Preprint (PDF 387KB) Reviews (Text) Source (2690KB) BibTeX

Paolo Faraboschi, Eitan Frachtenberg, Phillip Laplante, Dejan Milojicic and Roberto Saracco. Virtual Worlds (Metaverse): From Skepticism, to Fear, to Immersive Opportunities. In IEEE Computer 55 (10) : 100–106, October 2022.
Abstract
The advancement of human–computer interfaces and computational power enables the creation of believable virtual worlds. These were once limited to the gaming community but are now used for business purposes, including industrial applications, health, education.
Preliminary version (PDF 614KB) Source (172KB) BibTeX

Eitan Frachtenberg and Rhody D. Kaner. Underrepresentation of women in computer systems research. In PLOS ONE 17 (4) : e0266439, April 2022. Open access.
Abstract
The gender gap in computer science (CS) research is a well-studied problem, with an estimated ratio of 15%–30% women researchers. However, far less is known about gender representation in specific fields within CS. Here, we investigate the gender gap in one large field, computer systems. To this end, we collected data from 72 leading peer-reviewed CS conferences, totalling 6,949 accepted papers and 19,829 unique authors (2,946 women, 16,307 men, the rest unknown). We combined these data with external demographic and bibliometric data to evaluate the ratio of women authors and the factors that might affect this ratio. Our main findings are that women represent only about 10% of systems researchers, and that this ratio is not associated with various conference factors such as size, prestige, double-blind reviewing, and inclusivity policies. Author research experience also does not significantly affect this ratio, although author country and work sector do. The 10% ratio of women authors is significantly lower than the 16% in the rest of CS. Our findings suggest that focusing on inclusivity policies alone cannot address this large gap. Increasing women’s participation in systems research will require addressing the systemic causes of their exclusion, which are even more pronounced in systems than in the rest of CS.
Preprint (PDF 1015KB) Reviews (Text) Source (604KB) BibTeX

Eitan Frachtenberg. Research artifacts and citations in computer systems papers. In PeerJ Computer Science 6 : e887, February 2022. Open access.
Abstract
Research in computer systems often involves the engineering, implementation, and measurement of complex systems software and data. The availability of these artifacts is critical to the reproducibility and replicability of the research results, because system software often embodies numerous implicit assumptions and parameters that are not fully documented in the research article itself. Artifact availability has also been previously associated with higher paper impact, as measured by citations counts. And yet, the sharing of research artifacts is still not as common as warranted by its importance. The primary goal of this study is to provide an exploratory statistical analysis of the artifactsharing rates and associated factors in the research field of computer systems. To this end, we explore a cross-sectional dataset of papers from 56 contemporaneous systems conferences. In addition to extensive data on the conferences, papers, and authors, this analyze dataset includes data on the release, ongoing availability, badging, and locations of research artifacts. We combine this manually curated dataset with citation counts to evaluate the relationships between different artifact properties and citation metrics. Additionally, we revisit previous observations from other fields on the relationships between artifact properties and various other characteristics of papers, authors, and venue and apply them to this field. The overall rate of artifact sharing we find in this dataset is approximately 30%, although it varies significantly with paper, author, and conference factors, and it is closer to 43% for conferences that actively evaluated artifact sharing. Approximately 20% of all shared artifacts are no longer accessible four years after publications, predominately when hosted on personal and academic websites. Our main finding is that papers with shared artifacts averaged approximately 75% more citations than papers with none. Even after controlling for numerous confounding covariates, the release of an artifact appears to increase the citations of a systems paper by some 34%. This metric is further boosted by the open availability of the paper’s text.
Preprint (PDF 267KB) Reviews (Text 29KB) Source (706KB) BibTeX

Eitan Frachtenberg and Kelly McConville. Metrics and methods in the evaluation of prestige bias in peer review: A case study in computer systems conferences. In PLOS ONE 17 (2) : e0264131, February 2022. Open access.
Abstract
The integrity of peer review is essential for modern science. Numerous studies have therefore focused on identifying, quantifying, and mitigating biases in peer review. One of these better-known biases is prestige bias, where the recognition of a famous author or affiliation leads reviewers to subconsciously treat their submissions preferentially. A common mitigation approach for prestige bias is double-blind reviewing, where the identify of authors is hidden from reviewers. However, studies on the effectivness of this mitigation are mixed and are rarely directly comparable to each other, leading to difficulty in generalization of their results. In this paper, we explore the design space for such studies in an attempt to reach common ground. Using an observational approach with a large dataset of peer-reviewed papers in computer systems, we systematically evaluate the effects of different prestige metrics, aggregation methods, control variables, and outlier treatments. We show that depending on these choices, the data can lead to contradictory conclusions with high statistical significance. For example, authors with higher h-index often preferred to publish in competitive conferences which are also typically double-blind, whereas authors with higher paper counts often preferred the single-blind conferences. The main practical implication of our analyses is that a narrow evaluation may lead to unreliable results. A thorough evaluation of prestige bias requires a careful inventory of assumptions, metrics, and methodology, often requiring a more detailed sensitivity analysis than is normally undertaken. Importantly, two of the most commonly used metrics for prestige evaluation, past publication count and h-index, are not independent from the choice of publishing venue, which must be accounted for when comparing authors prestige across conferences.
Preprint (PDF 4231KB) Reviews (Text 21KB) Source (5890KB) BibTeX

Josh Yamamoto and Eitan Frachtenberg. Gender Differences in Collaboration Patterns in Computer Science. In Publications 10 (1) : 10, February 2022. Open access Cover story.
Abstract
The research discipline of computer science (CS) has a well-publicized gender disparity. Multiple studies estimate the ratio of women among publishing researchers to be around 15–30%. Many explanatory factors have been studied in association with this gender gap, including differences in collaboration patterns. Here, we extend this body of knowledge by looking at differences in collaboration patterns specific to various fields and subfields of CS. We curated a dataset of nearly 20,000 unique authors of some 7000 top conference papers from a single year. We manually assigned a field and subfield to each conference and a gender to most researchers. We then measured the gender gap in each subfield as well as five other collaboration metrics, which we compared to the gender gap. Our main findings are that the gender gap varies greatly by field, ranging from 6% female authors in theoretical CS to 42% in CS education; subfields with a higher gender gap also tend to exhibit lower female productivity, larger coauthor groups, and higher gender homophily. Although women published fewer single-author papers, we did not find an association between single-author papers and the ratio of female researchers in a subfield.
Preprint (PDF 405KB) Reviews (Text 15KB) Source (947KB) BibTeX

2021

Eitan Frachtenberg. Experience and Practice Teaching an Undergraduate Course on Diverse Heterogeneous Architectures, In The Ninth Workshop on Education for High-Performance Computing EduHPC'21, in conjunction with The International Conference for High Performance Computing, Networking, Storage, and Analysis The International Conference for High Performance Computing, Networking, Storage, and Analysis SC'21, November 2021.
Abstract
Heterogeneous computing is growing as an important hardware and software paradigm, both in high-performance computing and in application computing in general. Nevertheless, the topic is a relative newcomer to undergraduate curricula, and there is a dearth of guidance on suitable syllabi and lesson plans. The educational challenge of teaching this topic is exacerbated by the rapid pace of heterogeneous-hardware innovation and adoption, which can render parts of current textbooks obsolete. To help other educators facing these challenges, and to promote a conversation about a standardized approach toward teaching heterogeneous computing, this paper presents a case study for one semester-long class on the topic. It describes the goals, structure, challenges, and lessons learned from the introduction of a diverse heterogeneous hardware and software environment to computer science majors at Reed College, a small liberal-arts school. This paper also includes suggestions and ideas for future adoption, adaptation, and expansion of this class.
Paper (PDF 124KB) Presentation materials (PDF 6033KB) Source (7632KB) BibTeX

Eitan Frachtenberg and Rhody Kaner. Representation of Women in HPC conferences, In The International Conference for High Performance Computing, Networking, Storage, and Analysis SC'21, November 2021.
Abstract
Women are acutely underrepresented in the HPC workforce. Addressing this gap requires accurate metrics on the repre- sentation of women and its associated factors. The goal of this paper is to provide current, broad, and reproducible data on this gender gap. Specifically, this study provides in-depth statistics on women’s representation in HPC conferences, es- pecially for authors of peer-reviewed papers, who serve as the keystone for future advances in the field. To this end, we analyzed participant data from nine HPC and HPC-related peer-reviewed conferences. In addition to gender distributions, we looked at post-publication citation statistics of the papers and authors’ research experience, country, and work sector. Our main finding is that women represent only 10% of all HPC authors, with large geographical variations and small variations by sector. Representation is particularly low at higher experience levels. This 10% ratio is lower than even the 20–30% ratio in all computer science.
Paper (PDF 839KB) Presentation (PDF 4725KB) Reviews (Text) Source (5261KB) BibTeX

Aspen Russell and Eitan Frachtenberg. Worlds Apart: Technology, Remote Work, and Equity. In IEEE Computer 54 (7) : 41–49, July 2021.
Abstract
Two major phenomena shaped the US's news for most of 2020: the Covid-19 pandemic and a new civil rights movement. The former required many employees, especially in tech, to switch to remote work. The latter has refocused attention of both employers and employees on questions of diversity, equity, and inclusion (DEI) in the workplace. In this article, we examine the intersection of these events and their effects on the tech work landscape. Starting from the assumption that remote work will continue to some extent when the pandemic is over, we ask: how will this transition affect different populations from a DEI perspective? We make predictions on technology and the workforce based on current trends and data for six marginalized populations. Keeping in mind that many people share characteristics with these groups, we also attempt to generalize our predictions to the entire tech workforce, and speculate on their benefits, risks, and impact.
Preliminary version (PDF 277KB) Reviews (Text 13KB) Source (962KB) BibTeX

Hrishee Shastri and Eitan Frachtenberg. An analysis of the locality of binary representations in genetic and evolutionary algorithms. In PeerJ Computer Science 7 : e561, May 2021. Open access.
Abstract
Mutation and recombination operators play a key role in determining the speed and quality of Genetic and Evolutionary Algorithms (GEAs). Prior work has analyzed the effects of these operators on genotypic variation, often using locality metrics that measure the sensitivity and stability of genotype-phenotype representations to these operators. In this paper, we focus on an important subset of representations, namely nonredundant bitstring-to-integer representations, and analyze them through the lens of Rothlauf's widely used locality metrics. Our main research question is, does strong locality predict good GEA performance for these representations? Our main findings, both theoretical and empirical, show the answer to be negative. To this end, we define locality metrics equivalent to Rothlauf's that are tailored to our domain: the point locality for single-bit mutation and general locality for recombination. With these definitions, we derive tight bounds and a closed-form expected value for point locality. For general locality we show that it is asymptotically equivalent across all representations and operators. We also reproduce three established GEA empirical results to understand the predictive power of point locality on GEA performance, focusing on two popular and often juxtaposed representations: standard binary and binary-reflected Gray. We show that standard binary has provably no worse locality than any Gray encoding, including binary reflected Gray. We discuss this result in the context of previous studies that found binary reflected Gray to outperform standard binary, and we argue that locality cannot be the explanation for strong performance. Finally, we provide empirical evidence that weak point locality representations can be beneficial to performance in the exploration phase of the GEA, while strong point locality representations are more beneficial in the exploitation phase.
Preprint (PDF 634KB) Reviews (Text) Source (6325KB) BibTeX

2020

Hrishee Shastri and Eitan Frachtenberg. Locality Bounds for Nonredundant Binary-Integer Representations, In IEEE Symposium on Foundations of Computational Intelligence FOCI'20, part of the 2020 IEEE Symposium Series on Computational Intelligence, December 2020.
Abstract
Genetic and Evolutionary Algorithms (GEAs) rely on operators such as mutation and recombination to introduce variation to the genotypes. Because of their crucial role and effect on GEA performance, several studies have attempted to model and quantify the variation induced by different operators on various genotypic representations and GEAs. One metric of particular interest is the locality of genetic operators and representations, or how sensitive the phenotype is to small changes in genotype. Consequently, there is a considerable body of empirical work on the effects that different representations have on locality, with an emphasis on several popular representations, such as Gray encoding, and popular variation operators, such as single-bit mutation and single-point crossover. Here, we compute and prove tight upper and lower bounds on locality. We first precisely define our locality metrics for the single-point mutation and generic crossover operators by reformulating Rothlauf's seminal definitions specific to the binary-to-integer domain of representations. We then prove lower and upper bounds for single-point mutation locality by reducing the problem to mappings on hypercubes, and present constructive algorithms to generate representations of both optimal and pessimal locality. We also compute asymptotic bounds for generalized locality under any crossover operator. Our primary result is that the single-point locality of standard binary encoding is provably as good as Binary-Reflected Gray encoding, while other Gray encodings, which we construct, have worse locality. Another important result is that the generalized locality of any nonredundant binary-to-integer representation quickly converges to the same value, meaning that this metric cannot discriminate among representations and may therefore lose its usefulness for binary-integer representations.
Paper (PDF 191KB) Reviews (Text 11KB) Source (97KB) BibTeX

Eitan Frachtenberg and Noah Koster. A Survey of Accepted Authors in Computer Systems Conferences. In PeerJ Computer Science 6 : e299, September 2020. Open access.
Abstract
Computer Science researchers rely on peer-reviewed conferences to publish their work and to receive feedback. The impact of these peer-reviewed papers on researchers' careers can hardly be overstated. Yet conference organizers can make inconsistent choices for their review process, even in the same subfield. These choices are rarely reviewed critically, and when they are, the emphasis centers on the effects on the technical program, not the authors. In particular, the effects of conference policies on author experience and diversity are still not well understood. To help address this knowledge gap, this paper presents a cross-sectional study of 56 conferences from one large subfield of computer science, namely computer systems. We introduce a large author survey (n=918), representing 809 unique papers. The goal of this paper is to expose this data and present an initial analysis of its findings. We primarily focus on quantitative comparisons between different survey questions and comparisons to external information we collected on author demographics, conference policies, and paper statistics. Another focal point of this study is author diversity. We found poor balance in the gender and geographical distributions of authors, but a more balanced spread across sector, experience, and English-proficiency. For the most part, women and nonnative English speakers exhibit no differences in their experience of the peer-review process, suggesting no specific evidence of bias against these accepted authors. We also found strong support for author rebuttal to reviewers' comments, especially among students and less experienced researchers.
Preprint (PDF 438KB) Reviews (Text 26KB) Source (2450KB) BibTeX

2019

Paolo Faraboschi, Eitan Frachtenberg, Phil Laplante, Katherine Mansfield and Dejan Milojicic. Technology Predictions: Art, Science, and Fashion. In IEEE Computer 52 (12), December 2019.
Abstract
Predicting the future is never easy, it always entails a degree of uncertainty, if not luck. Predicting technology trends is even harder as it requires both technical and business acumen, e.g., whether the technology will be developed, productized, and ultimately adopted on the market. It is almost an art to distill between a fashion and a true scientific trend. At the same time, the public likes to read predictions and many individuals and organizations regularly write technology predictions, such as Gartner [1], MIT [2], Forbes [3] and many others [4][5][6] regularly produce predictions. IEEE Computer Society started its technology predictions informally in early 2010 and formally via annual press releases in 2014, followed by their respective scorecards in 2016 [7]. We realized that our audience appreciates self-evaluation, hence we introduced scorecards at the end of the period of prediction. Our predictions reached substantial audience, e.g., in 2018, it was picked up by 300 media outlets (84.6M audience), which is entirely different from classical publishing. We considered predictions as a new type of publication, a lightweight, short publication (approximately a paragraph per prediction). These predictions also triggered other media outreach, such as blogs [8], interviews [9], panel sessions [10], and this special issue of IEEE Computer magazine. Over the years we became better in press releases and social media announcing our report, to the extent that it became visible at the IEEE Board of Directors, and found its way to the report of the IEEE Executive Director. One notable side-product that grew out of our predictions was the 2022 report [11][12] that comprehensively predicted 23 technologies 7 years ahead. This report had a sister report written by Industrial Technology Research Institute (ITRI) [13], Taiwan on technology predictions specific to Asia. These technology predictions surpassed all our expectations in terms of impact, and we plan to continue for as long as audience has interest.
Preliminary version (PDF 723KB) Reviews (Text) Source (MS-Word 27KB) BibTeX

Eitan Frachtenberg. Practical Drone Delivery. In IEEE Computer 52 (12), December 2019.
Abstract
Parcel delivery represents a market of over USD 200 Billion per year in the US alone [1]. Despite its enormous economic impact, this industry has evolved relatively slowly over the decades, with the biggest innovations of recent years coming from the global positioning system (GPS) and planning algorithms. Parcel delivery today can still be frustratingly slow, wasteful, labor-intensive, and expensive. These inefficiencies, combined with recent developments in drone technology, such as aircraft design, battery improvements, and control software, leave the field ripe for disruption. Several companies such as Amazon and Google have taken the lead in recent years in researching and developing practical delivery drones. This article posits that these technologies are ready to come out of the lab and start to completely transform this industry, and consequently society as a whole.
Preliminary version (PDF 717KB) Reviews (Text 14KB) Source (791KB) BibTeX

Rhody Kander and Eitan Frachtenberg. Gender Representation in Computer Systems Publications. In ACM Richard Tapia Celebration of Diversity in Computing, San Diego, CA, September 2019.
Poster image (PDF 2127KB) Source (1813KB) BibTeX

2015

Eytan Bakshy and Eitan Frachtenberg. Statistics and Optimal Design for Benchmarking Experiments Involving User Traffic, In 24th International World Wide Web Conference (WWW'15), May 2015.
Abstract
The successful development and deployment of large-scale Internet services depends critically on performance. Even small changes in processing time, bandwidth, and memory usage can translate directly into large financial and user experience costs. Despite the widespread use of traffic-based benchmarks, there is little research on how they should be run in order to obtain valid and precise inferences with minimal data collection costs. Correctly A/B testing Internet services can be surprisingly difficult because interdependencies between user requests (e.g., for search results, social media streams, ads) and hosts can lead to failures in estimating the significance and magnitude of performance differences. We develop multilevel models of Internet service performance that take in to account dependence due to user requests and hosts, and use them to design benchmarking routines that maximize precision subject to time and resource constraints. This design is then validated experimentally on a production system that is used to vet thousands of changes every day.
Paper (PDF 1729KB) Reviews (Text 13KB) Source (48351KB) BibTeX

Hasan Alkhatib, Paolo Faraboschi, Eitan Frachtenberg, Hironori Kasahara, Danny Lange, Phil Laplante, Arif Merchant, Dejan Milojicic and Schwan Karsten. What Will 2022 Look Like? The IEEE CS 2020 Report. In IEEE Computer 48 (3) : 68–76, March 2015.
Preliminary version (PDF 1198KB) Source (15996KB) BibTeX

2014

Yuehai Xu, Eitan Frachtenberg and Song Jiang. Building a High-Performance Key-Value Cache as an Energy-Efficient Appliance, In 32nd International Symposium on Computer Performance, Modeling, Measurements, and Evaluation Performance'14, October 2014. Best Student Paper Award. Archival version in Performance Evaluation vol. 79: 24--37
Abstract
Key-value (KV) stores have become a critical infrastructure component supporting various services in the cloud. Long considered an application that is memory-bound and network-bound, recent KV-store implementations on multicore servers grow increasingly CPU-bound instead. This limitation often leads to under-utilization of available bandwidth and poor energy efficiency, as well as long response times under heavy load. To address these issues, we present Hippos, a high-throughput, low-latency, and energy-efficient key-value store implementation. Hippos moves the KV store into the operating system's kernel and thus removes most of the overhead associated with the network stack and system calls. Hippos uses the Netfilter framework to quickly handle UDP packets, removing the overhead of UDP-based GET requests almost entirely. Combined with lock-free multithreaded data access, Hippos removes several performance bottlenecks both internal and external to the KV-store application. We prototyped Hippos as a Linux loadable kernel module and evaluated it against the ubiquitous Memcached using various micro-benchmarks and workloads from Facebook's production systems. The experiments show that Hippos provides some 20--200% throughput improvements on a 1Gbps network (up to 590% improvement on a 10Gbps network) and 5--20% saving of power compared with Memcached.
Paper (PDF 403KB) Presentation (804KB) Reviews (Text) Source (95KB) BibTeX

Hasan Alkhatib, Paolo Faraboschi, Eitan Frachtenberg, Hironori Kasahara, Danny Lange, Phil Laplante, Arif Merchant, Dejan Milojicic and Schwan Karsten. IEEE CS 2020 Report, August 2014. Published at the IEEE web site. Featured in Wired.
Abstract
Predicting the future is hard and risky. Predicting the future in the computer industry is even harder and riskier due to dramatic changes in technology and limitless challenges to innovation. Only a small fraction of innovations truly disrupt the state of the art. Some are not practical or cost-effective, some are ahead of their time, and some simply do not have a market. There are numerous examples of superior technologies that were never adopted because others arrived on time or fared better n the market. Therefore this document is only an attempt to better understand where technologies are going. The book Innovators Dilemma and its sequels best describe the process of innovation and disruption. Nine technical leaders of the IEEE Computer Society joined forces to write a technical report, entitled IEEE CS 2022, symbolically surveying 23 potential technologies that could change the landscape of computer science and industry by the year 2022. In particular, this report focuses on 3D printing, big data and analytics, open intellectual property movement, massively online open courses, security cross-cutting issues, universal memory, 3D integrated circuits, photonics, cloud computing, computational biology and bioinformatics, device and nanotechnology, sustainability, high-performance computing, the Internet of Things, life sciences, machine learning and intelligent systems, natural user interfaces, networking and inter-connectivity, quantum computing, software-defined networks, multicore, and robotics for medical care.
Report (PDF 14313KB) Source (20706KB) BibTeX

Yuehai Xu, Eitan Frachtenberg, Song Jiang and Mike Paleczny. Characterizing Facebook's Memcached Workload. In IEEE Internet Computing 18 (4) : 41–49, March 2014.
Abstract
Key-value caches play an important role in the infrastructure of many Web sites. Analyzing their workloads can lead not only to a better understanding of how these unique tools are used in production, but can also lead to better design decisions in future implementations. Abstract This paper analyzes the workload of Memcached at Facebook, one of the world's largest key-value deployments. We look at server-side performance, request composition, caching efficacy, and key locality. Our observations lead to several design insights and new research directions for key value caches, such as the relative inadequacy of the least-recently-used replacement policy.
Preliminary version (PDF 310KB) Reviews (Text 15KB) Source (22836KB) BibTeX

2013

Dror G. Feitelson, Eitan Frachtenberg and Kent L. Beck. Development and Deployment at Facebook. In IEEE Internet Computing 17 (4) : 8–16, July 2013.
Abstract
More than one billion users log in to Facebook at least once a month to connect and share content with each other. Among other activities, these users upload over 2.5 billion content items every day. In this article we describe the development and deployment of the software that supports all this activity, focusing on the site's primary codebase for the Web front-end.
Preliminary version (PDF 131KB) Reviews (Text 22KB) Source (48222KB) BibTeX

Walfredo Cirne, Narayan Desai, Eitan Frachtenberg and Uwe Schwiegelshohn (ed.) Job Scheduling Strategies for Parallel Processing: 16th International Workshop, JSSPP 2012, Revised Selected Papers (Lecture Notes in Computer Science 7698), Springer-Verlag, January 2013 (ISBN: 978-3642358661). Available from Springer-Verlag and Amazon.
BibTeX

Walfredo Cirne and Eitan Frachtenberg. Web-Scale Job Scheduling, In Job Scheduling Strategies for Parallel Processing, Lecture Notes in Computer Science (7698), January 2013.
Abstract
Web datacenters and clusters can be larger than the world's largest supercomputers, and run workloads that are at least as heterogeneous and complex as their high-performance computing counterparts. And yet little is known about the unique job scheduling challenges of these environments. This article aims to ameliorate this situation. It discusses the challenges of running web infrastructure and describes several techniques to address them. It also presents some of the problems that remain open in the field.
Chapter text (PDF 629KB) Source (446KB) BibTeX

2012

Eitan Frachtenberg. High Efficiency at Web Scale. In The 9th ACM International Conference on Autonomic Computing ICAC'12, San Jose, CA, September 2012. Keynote.
Abstract
Every day, over half a billion people log in to Facebook to communicate with their contacts. They exchange more than 300 million photos and more than 3 billion likes and comments each day. And almost every day, Facebook releases new code with new features and products to all these users. This staggering amount of information and processing is served from dozens of clusters in four geographical regions. The keys to operating successfully at this almost incomprehensibly large scale are efficiency and automation. Efficiency starts at the hundreds Facebook engineers and the processes they use to develop, test, and deploy code; it continues with scalable models of distributing and constantly monitoring the software on tens of thousands of servers on a daily basis; and ends at the very hardware and datacenters that serves this data, bringing capital and operational expenditures down to make the economic model viable. Automation is the leverage behind each of these relatively few engineers. It lets them focus on quick iteration and experimentation, catching problems early and solving many automatically. This talk will describe the challenges of developing and operating a product that serves a significant percentage of the worldwide internet population. Through several examples, we will see how efficiency and automation drive and enable operation at Web scale.
Source (76775KB)

Eitan Frachtenberg. Holistic Datacenter Design in the Open Compute Project. In IEEE Computer 45 (7) : 83–85, July 2012.
Abstract
Facebook's Open Compute Project lets the community benefit from and contribute to improvements in power and water usage effectiveness, cost, and operation.
Preliminary version (PDF 305KB) Source (1734KB) BibTeX

Mateusz Berezecki, Eitan Frachtenberg, Mike Paleczny and Kenneth Steele. Power and Performance Evaluation of Memcached on the TILEPro64 Architecture. In Sustainable Computing: Informatics and Systems 2 (2) : 81–90, June 2012.
Abstract
Power consumption of data centers had become an important factor in the economy and sustainability of large-scale Web services. Researchers and practitioners are spending considerable effort to characterize Web-scale workloads and evaluating their applicability to alternative, more power-efficient architectures. One such workload in particular is the caching layer, which stores expensive-to-regenerate data in fast storage to reduce service times. In this paper we look at one such application, the Memcached key-value store, which is widely deployed at Facebook and other Web services, and one alternative architecture, the TILEPro64 many-core system-on-chip. We explore the performance and power characteristics of Memcached under a variety of workload variations, implementation choices, and communication protocol, and compare them to a traditional implementation on x86-based servers. Our results suggest that the TILEPro64 architecture can significantly outperform x86-based architectures in terms of throughput per Watt for the evaluated version of Memcached.
Preprint (PDF 829KB) Reviews (Text) Source (9460KB) BibTeX

Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang and Mike Paleczny. Workload Analysis of a Large-Scale Key-Value Store, In 12th ACM/IFIP Joint Conference on Measurement and Modeling of Computer Systems SIGMETRICS/Performance'12, June 2012. Test-of-Time 2022 Award. Archival version in Performance Evaluation Review vol. 40(1): 53--64
Abstract
Key-value stores are a vital component in many scale-out enterprises, including social networks, online retail, and risk analysis. Accordingly, they are receiving increased attention from the research community in an effort to improve their performance, scalability, reliability, cost, and power consumption. To be effective, such efforts require a detailed understanding of realistic key-value workloads. And yet little is known about these workloads outside of the companies that operate them. This paper aims to address this gap. To this end, we have collected detailed traces from Facebook's Memcached deployment, arguably the world's largest. The traces capture over 284 billion requests from five different Memcached use cases over several days. We analyze the workloads from multiple angles, including: request composition, size, and rate; cache efficacy; temporal patterns; and application use cases. We also propose a simple model of the most representative trace to enable the generation of more realistic synthetic workloads by the community. Our analysis details many characteristics of the caching workload. It also reveals a number of surprises: a GET/SET ratio of 30:1 that is higher than assumed in the literature; some applications of Memcached behave more like persistent storage than a cache; strong locality metrics, such as keys accessed many millions of times a day, do not always suffice for a high hit rate; and there is still room for efficiency and hit rate improvements in Memcached's implementation. Toward the last point, we make several suggestions that address the exposed deficiencies.
Paper (PDF 645KB) Presentation (1090KB) Reviews (PDF 17KB) Source (88590KB) BibTeX

Eitan Frachtenberg, Dan Lee, Marco Magarelli, Veerendra Mulay and Jay Park. Thermal Design in the Open Compute Datacenter, In 13th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems ITHERM'12, May 2012.
Abstract
The advent of Web-based services and cloud computing has instigated an explosive growth in demand for datacenters. Traditionally, Internet companies would lease datacenter space and servers from vendors that often emphasize flexibility over efficiency. But as these companies grew larger, they sought to reduce acquisition and operation costs by building their own datacenters. Facebook reached this stage earlier in 2011 when it unveiled its first customized datacenter in Prineville, Oregon. In designing this datacenter, Facebook took a blank-slate approach where all aspects were rethought for maximum efficiency. Although the resulting datacenter is optimized for Facebook's workload, it is general enough to be appeal to a wide variety of applications. This paper describes our choices and innovations in the thermal design of the datacenter building, which employs 100% outside-air economization. The efficiency of this design is manifest in an average infrastructure energy use reduction of 86% compared to leased space, and an overall energy use reduction of 29%. This reduction in turn translates to a power usage efficiency of 1.08, measured over the summer of 2011.
Paper (PDF 15143KB) Reviews (Text) Source (14200KB) BibTeX

Eitan Frachtenberg. Design Principles in the Open Compute Project, In IEEE Optical Fiber Communication Conference OFC/NFOEC, March 2012.
Abstract
The Open Compute Project aims to capture the best principles in datacenter design and open them for third-party implementation and discussion. This paper summarizes them in the areas of electrical, thermal, building, and server design.
Paper (PDF 480KB) Source (29927KB) BibTeX

Steven Hart, Eitan Frachtenberg and Mateusz Berezecki. Predicting Memcache Throughput using Simulation and Modeling, In SCS/IEEE Symposium on Theory of Modeling and Simulation TMS'12, March 2012.
Abstract
The current work introduces a method for predicting Memcached throughput on single-core and multi-core processors. The method is based on traces collected from a full system simulator running Memcached. A series of microarchtectural simulators consume these traces and the results are used to produce a CPI model composed of a baseline issue rate, cache miss rates, and branch mispredictions rate. Simple queueing models are used to produce througput predictions with accuracy in the range of 8% to 17%.
Paper (PDF 160KB) Reviews (Text) Source (648KB) BibTeX

2011

Eitan Frachtenberg, Ali Heydari, Harry Li, Amir Michael, Jacob Na, Avery Nisbet and Pierluigi Sarti. High-Efficiency Server Design, In IEEE/ACM Conference on Supercomputing SC'11, November 2011.
Abstract
Large-scale datacenters consume megawatts in power and cost hundreds of millions of dollars to equip. Reducing the energy and cost footprint of servers can therefore have substantial impact. Web, Grid, and cloud servers in particular can be hard to optimize, since they are expected to operate under a wide range of workloads. For our upcoming datacenter, we set out to significantly improve its power efficiency, cost, reliability, serviceability, and environmental footprint. To this end, we redesigned many dimensions of the datacenter and servers in conjunction. This paper focuses on our new server design, combining aspects of power, motherboard, thermal, and mechanical design. We calculate and confirm experimentally that our custom-designed servers can reduce power consumption across the entire load spectrum while at the same time lower acquisition and maintenance costs. Importantly, our design does not decrease the servers' performance or portability, which would otherwise limit its applicability.
Paper (PDF 1822KB) Reviews (PDF 94KB) Source (Compressed tar 35386KB) BibTeX

Mateusz Berezecki, Eitan Frachtenberg, Mike Paleczny and Kenneth Steele. Many-Core Key-Value Store, In Second International Green Computing Conference IGCC'11, July 2011.
Abstract
Scaling data centers to handle task-parallel workloads requires balancing the cost of hardware, operations, and power. Low-power, low-core-count servers reduce costs in one of these dimensions, but may require additional nodes to provide the required quality of service or increase costs by underutilizing memory and other resources. We show that the throughput, response time, and power consumption of a high-core-count processor operating at a low clock rate and very low power consumption can perform well when compared to a platform using faster but fewer commodity cores. Specific measurements are made for a key-value store, Memcached, using a variety of systems based on three different processors: the 4-core Intel Xeon L5520, 8-core AMD Opteron 6128 HE, and 64-core Tilera TILEPro64.
Paper (PDF 652KB) Reviews (Text) Source (Compressed tar 6324KB) BibTeX

2010

Eitan Frachtenberg and Uwe Schwiegelshohn (ed.) Job Scheduling Strategies for Parallel Processing: 15th International Workshop, JSSPP 2010, Revised Selected Papers (Lecture Notes in Computer Science 6253), Springer-Verlag, October 2010 (ISBN: 978-3642165047). Available from Springer-Verlag and Amazon.
BibTeX

2009

Eitan Frachtenberg. Reducing Query Latencies in Web Search Using Fine-Grained Parallelism. In World Wide Web 12 (4) : 441–460, December 2009.
Abstract
Semantic Web search is a new application of recent advances in information retrieval (IR), natural language processing, artificial intelligence, and other fields. Our group (Powerset) develops a semantic search engine that aims to answer queries not only by matching keywords, but by actually matching meaning in queries to meaning in Web documents. Compared to typical keyword search, semantic search can pose additional engineering challenges for the back-end and infrastructure designs. Of these, the main challenge addressed in this paper is how to lower query latencies to acceptable, interactive levels. Index-based semantic search can include numerous synonyms, hypernyms, multiple linguistic readings, and other semantic information, both on queries and in the index. In addition, some of the algorithms can be super-linear, such as matching co-references across a document. Consequently, many semantic queries can run significantly slower than the same keyword query. Users, however, have grown to expect Web search engines to provide near-instantaneous results, and a slow search engine could be deemed unusable even if it provides highly relevant results. It is therefore imperative for any search engine to meet its users' interactivity expectations, or risk losing them. Our approach to tackle this challenge to exploit data parallelism in slow search queries to reduce their latency in multi-core systems. Although all search engines are designed to exploit parallelism, at the single-node level this usually translates to throughput-oriented task parallelism. This paper focuses on the engineering of two latency-oriented approaches (coarse- and fine-grained) and compares them to the task-parallel approach. We evaluate on Powerset's deployed search engine the various factors that affect parallel performance: workload, overhead, load balancing, and resource contention. We also discuss heuristics to selectively control the degree of parallelism and resulting overhead on a query-by-query level. Our experimental results show that using fine-grained parallelism with these dynamic heuristics can significantly reduce query latencies compared to fixed, coarse-granularity parallelization schemes. Although these results were obtained on, and optimized for, Powerset's semantic search, they can be readily generalized to a wide class of inverted-index search engines.
Preprint (PDF 273KB) Reviews (Text) Source (Compressed tar 163KB) BibTeX

Eitan Frachtenberg and Uwe Schwiegelshohn (ed.) Job Scheduling Strategies for Parallel Processing: 14th International Workshop, JSSPP 2009, Revised Selected Papers (Lecture Notes in Computer Science 5798), Springer-Verlag, October 2009 (ISBN: 978-3642046322). Available from Springer-Verlag and Amazon.
BibTeX

Maayan Zhitomirsky-Geffet, Dror Feitelson, Eitan Frachtenberg and Yair Wiseman. A Unified Strategy for Search And Result Representation for an Online Bibliographical Catalog. In Online Information Review 33 (3) : 511–536, June 2009.
Abstract
One of the biggest concerns of modern information retrieval systems is reducing the user effort required for manual traversal and filtering of long matching document lists. Thus, the first goal of this research is to propose an improved scheme for representation of search results. Further, it aims to explore the impact of various user information needs on the searching process with the aim of finding a unified searching approach well suited for different query types and retrieval tasks.
Preprint (PDF 434KB) Reviews (Text 1256KB) Source (MS-Word 914KB) BibTeX

2008

Maayan Zhitomirsky-Geffet, Eitan Frachtenberg, Yair Wiseman and Dror Feitelson. Seeking Optimal Search Strategy and Result Representation in BoW, 2008.
Abstract
One of the biggest concerns of modern information retrieval systems is reducing the user effort required for manual traversal and ﬁltering of long matching document lists. In this paper we propose an alternative approach for compact and concise representation of search results, which we implemented in the BoW on-line bibliographical repository. The BoW repository is based on an hierarchical concept index to which entries are linked. The key idea is that searching in the hierarchical repository should take advantage of the repository structure and return matching topics from the hierarchy, rather than just a long list of entries. Likewise, when new entries are inserted, a search for relevant topics to which they should be linked is required. Therefore, a similar hierarchical scheme for query-topic matching can be applied for both tasks. However, our experiments show that different query types used for these tasks are best treated by different topic ranking functions. For example, keyword search which is typically based on short (1-3 word) queries requires a weight-based (rather than Boolean) ranking approach. The underlying rationale of weight-based ranking is that for a truly relevant topic all (or almost all) the query terms should appear in its vector representation and with approximately even high weights. Applying this reasoning to the topic ranking method is shown to signiﬁcantly increase the precision and the F1 (by over 30%) for short keyword queries compared to the baseline Boolean ranking metric.
BibTeX

Anand Madhavan and Eitan Frachtenberg. Combinatorial Set Matching using GPUs. In NVision'08, San Jose, CA, August 2008.
Poster image (PDF 565KB) Source (PDF 919KB) BibTeX

Eitan Frachtenberg and Uwe Schwiegelshohn (ed.) Job Scheduling Strategies for Parallel Processing: 13th International Workshop, JSSPP 2007, Revised Selected Papers (Lecture Notes in Computer Science 4942), Springer-Verlag, May 2008 (ISBN: 978-3540786986). Available from Springer-Verlag and Amazon.
BibTeX

Eitan Frachtenberg and Uwe Schwiegelshohn. New Challenges of Parallel Job Scheduling, In Job Scheduling Strategies for Parallel Processing, Lecture Notes in Computer Science (4942), May 2008.
Abstract
The workshop on job scheduling strategies for parallel processing (JSSPP) studies the myriad aspects of managing resources on parallel and distributed computers. These studies typically focus on large-scale computing environments, where allocation and management of computing resources present numerous challenges. Traditionally, such systems consisted of massively parallel supercomputers, or more recently, large clusters of commodity processor nodes. These systems are characterized by architectures that are largely homogeneous and workloads that are dominated by both computation and communication-intensive applications. Indeed, the large majority of the articles in the first ten JSSPP workshops dealt with such systems and addressed issues such as queuing systems and supercomputer workloads. In this paper, we discuss some of the recent developments in parallel computing technologies that depart from this traditional domain of problems. In particular, we identify several recent and influential technologies that could have a significant impact on the future of research on parallel scheduling. We discuss some of the more specific research challenges that these technologies introduce to the JSSPP community, and propose to enhance the scope of future JSSPP workshops to include these topics.
Chapter text (PDF 228KB) Source (247KB) BibTeX

2007

Ram Srinivasan, Eitan Frachtenberg, Olaf Lubeck, Scott Pakin and Jeanine Cook. An Idealistic Neuro-PPm Branch Prediction. In Journal of Instruction-Level Parallelism 9, May 2007.
Abstract
Historically, Markovian predictors have been very successful in predicting branch outcomes. In this work we propose a hybrid scheme that employs two Prediction by Partial Matching (PPM) Markovian predictors, one that predicts based on local branch histories and one based on global branch histories. The two independent predictions are combined using a neural network. On the CBP-2 traces the proposed scheme achieves over twice the prediction accuracy of the gshare predictor.
Preprint (PDF 449KB) Reviews (Text 610KB) Source (Compressed tar 617KB) BibTeX

Eitan Frachtenberg and Uwe Schwiegelshohn (ed.) Job Scheduling Strategies for Parallel Processing: 12th International Workshop, JSSPP 2006, Revised Selected Papers (Lecture Notes in Computer Science 4376), Springer-Verlag, March 2007 (ISBN: 978-3540710349). Available from Springer-Verlag and Amazon.
BibTeX

2006

Eitan Frachtenberg, Fabrizio Petrini, Juan Fernandez and Scott Pakin. STORM: Scalable Resource Management for Large-Scale Parallel Computers. In IEEE Transactions on Computers 55 (12) : 1572–1587, December 2006.
Abstract
Although clusters are a popular form of high-performance computing, they remain more difficult to manage than sequential systems—or even symmetric multiprocessors. In this paper, we identify a small set of primitive mechanisms that are sufﬁciently general to be used as building blocks to solve a variety of resource-management problems. We then present STORM, a resource-management environment that embodies these mechanisms in a scalable, low-overhead, and efﬁcient implementation. The key innovation behind STORM is a modular software architecture that reduces all resource management functionality to a small number of highly scalable mechanisms. These mechanisms simplify the integration of resource management with low-level network features. As a result of this design, STORM can launch large, parallel applications an order of magnitude faster than the best time reported in the literature and can gang-schedule a parallel application as fast as the node OS can schedule a sequential application. This paper describes the mechanisms and algorithms behind STORM and presents a detailed performance model that shows that STORM's performance can scale to thousands of nodes.
Preprint (PDF 264KB) Reviews (Text 16KB) Source (Compressed tar 2022KB) BibTeX

Ram Srinivasan, Eitan Frachtenberg, Olaf Lubeck, Scott Pakin and Jeanine Cook. Neuro-PPm Branch Prediction, In The Second JILP Championship Branch Prediction Competition CBP'02, In conjunction with The 39th Annual IEEE/ACM International Symposium on Microarchitecture Micro-39, December 2006. Competition Finalist.
Abstract
Historically, Markovian predictors have been very successful in predicting branch outcomes. In this work we propose a hybrid scheme that employs two Prediction by Partial Matching (PPM) Markovian predictors, one that predicts based on local branch histories and one based on global branch histories. The two independent predictors are combined using a neural network. On the CBP-2 traces the proposed scheme acheives over twice the prediction accuracy of the gshare predictor.
Paper (PDF 205KB) Reviews (Text) Source (Compressed tar 428KB) BibTeX

Fabrizio Petrini, Juan Fernandez, Adam Moody, Eitan Frachtenberg and Dhabaleswar K. Panda. NIC-based Reduction Algorithms for Large-scale Clusters. In International Journal of High Performance Computing and Networking 4 (3/4) : 122–136, August 2006.
Abstract
Reduction collective operations are a key element of many parallel scientific applications. Previous research results assume that data processing can only be done on the host processor, incurring significant data transfer and processor overheads. With the advent of modern cluster interconnects, that provide programmable processors and memory in the network interface card, we believe it is time to reexamine this assumption and propose to offload the execution of collective reduction operations to the network level. This paper presents both analytical and experimental evaluations of reduce algorithms that use the processing capability of the network interface. We introduce a detailed model for a family of reduction algorithms using $f$-nomial trees. Our model allows the design and evaluation of reduction algorithms that are optimized for cluster and network specific parameters. Guided by this model, we designed and implemented some optimized reduce algorithms on the Quadrics QsNet network. Extensive performance and scalability evaluations were carried out on the 960-node, 1920-processor ASCI Linux Cluster (ALC) at Lawrence Livermore National Laboratory, using up to 1812 processors. Our experiments confirm that modern interconnects do indeed allow for more efficient, scalable, and consistently-performing NIC-based reductions than host-based reductions. In particular, on the largest configuration tested, NIC-based operations provided speedups of 121% and 39% over the host-based, production-level MPI operations for small integer and floating point reduction arrays, respectively. Keywords: Cluster Computing, Reduce, Allreduce, Quadrics QsNet, NIC-based operations, collective communication
Preprint (PDF 382KB) Reviews (Text) Source (Compressed tar 513KB) BibTeX

Juan Fernandez, Eitan Frachtenberg, Fabrizio Petrini and Jose C. Sancho. An Abstract Interface for System Software on Large-Scale Clusters. In The Computer Journal 49 (4) : 454–469, July 2006.
Abstract
Scalable management of distributed resources is one of the major challenges when building large-scale clusters for high-performance computing. This task includes transparent fault tolerance, efficient deployment of resources and support for all the needs of parallel applications: parallel I/O, deterministic behavior and responsiveness. These challenges may seem daunting with commodity hardware and operating systems, since they were not designed to support a global, single management view of a large-scale system. In this paper we propose and demonstrate an abstract network interface in the cluster interconnect to facilitate the implementation of a simple yet powerful global operating system. This system, which can be thought of as a coarse-grain SIMD operating system, can allow commodity clusters to grow to thousands of nodes, while still retaining the usability and performance of the single-node workstation.
Preprint (PDF 416KB) BibTeX

Eitan Frachtenberg and Yoav Etsion. Hardware Parallelism: Are Operating Systems Ready? (Case Studies in Mis-Scheduling), In Second Workshop on the Interaction between Operating Systems and Computer Architecture WIOSCA'06, In conjunction with ISCA-33, June 2006.
Abstract
Commodity parallel computers are no longer a technology predicted for some indistinct future: they are becoming ubiquitous. In the absence of significant advances in clock speed, chip-multiprocessors (CMPs) and symmetric multithreading (SMT) are the modern workhorses that keep Moore's Law still relevant. On the software side, we are starting to observe the adaptation of some codes to the new commodity parallel hardware. While in the past, only complex professional codes ran on parallel computers, the commoditization of parallel computers is opening the door for many desktop applications to benefit from parallelization. We expect this software trend to continue, since the only apparent way of obtaining additional performance from the hardware will be through parallelization. Based on the premise that the average desktop workload is growing more parallel and complex, this paper asks the question: Are current desktop operating systems appropriate for these trends? Specifically, we are interested in parallel process scheduling, which has been a topic of significant study in the supercomputing community, but so far little of this research has trickled down to the desktop. In this paper, we demonstrate, using several case studies, that contemporary general-purpose operating systems are inadequate for the emerging parallel desktop workloads. We suggest that schedulers designed with an understanding of the requirements of all process classes and their mixes, as well the abilities of the underlying architecture, might be the solution to this inadequacy.
Paper (PDF 173KB) Presentation materials (PDF 289KB) Reviews (Text 17KB) Source (Compressed tar 1826KB) BibTeX

Dror Feitelson, Eitan Frachtenberg, Larry Rudolph and Uwe Schwiegelshohn (ed.) Job Scheduling Strategies for Parallel Processing: 11th International Workshop, JSSPP 2005, Revised Selected Papers (Lecture Notes in Computer Science), Springer-Verlag, January 2006 (ISBN: 978-3540310242). Available from Springer-Verlag and Amazon.
BibTeX

Juan Fernandez, Fabrizio Petrini and Eitan Frachtenberg. Achieving Predictable and Scalable Performance with BCS-MPI, In Engineering the Grid: Status and Perspective, January 2006.
Abstract
Demand for increasingly-higher computing capability is driving a similar growth in compute cluster sizes, soon to be reaching tens of thousands of processors. This growth is not matched however by system software, which has remained largely unchanged from the advent of clusters. The failure of system software to scale and develop in the same rate as the underlying hardware constrains the productivity of these machines by severely limiting their utilization, reliability, and responsiveness. The traditional approach to system software, namely, the use of loosely-coupled independent daemons on each node, is inadequate for the management of large-scale clusters, a problem which is inherently tightly-coupled and requires a high degree of synchronization. One model for large-scale system software is Buffered Coscheduling (BCS), wherein synchronization and scalability are obtained by means of global scheduling of all system activities and collective network operations. BCS represents a new methodology for the design of system software as a single, parallel program using traditional parallel constructs. As such, system software can be made orders of magnitude more scalable, simple, and easy to debug than the existing distributed solutions. The most important aspect of the BCS model and the overlying system software is the buffering and scheduling of all communication, resulting in highly controllable and deterministic system behavior. This chapter describes in detail the implementation of BCS-MPI, an MPI library designed after this model, and shows that the benefits of determinism need not come at a significant performance cost. Furthermore, BCS-MPI comes with a sophisticated monitoring and debugging subsystem that simplifies the analysis of system and application performance, and is covered in detail in this chapter. keywords: Cluster computing, system software, buffered coscheduling, MPI, communication protocol, parallel monitoring and debugging, Quadrics, QsNet.
Chapter text (PDF 264KB) Source (945KB) BibTeX

2005

Eitan Frachtenberg. Process Scheduling for the Parallel Desktop, In International Symposium on Parallel Architectures, Algorithms, and Networks I-SPAN'05, December 2005.
Abstract
Commodity hardware and software are growing increasingly more complex, with advances such as chip heterogeneity and specialization, deeper memory hierarchi es, fine-grained power management, and most importantly, chip parallelism. Similarly, workloads are growing more concurrent and diverse. With this new complexity in hardware and software, process scheduling in the operating system (OS) becomes more challenging. Nevertheless, most commodity OS schedulers are based on design principles that are 30 years old. This disparity may soon lead to significant performance degradation. Most significantly, parallel architectures such as multicore chips require more than scalable OSs: parallel programs require parallel-aware scheduling. This paper posits that imminent changes in hardware and software warrant reevaluating the scheduler's policies in the commodity OS. We discuss and demonstrate the main issues that the emerging parallel desktops are raising for the OS scheduler. We propose that a new approach to scheduling is required, applying and generalizing lessons from different domain-specific scheduling algorithms, and in particular, parallel job scheduling. Future architectures can also assist the OS by providing better information on process scheduling requirements.
Paper (PDF 255KB) Presentation (PowerPoint 2788KB) Reviews (Text) Source (Compressed tar 402KB) BibTeX

Eitan Frachtenberg, Dror G. Feitelson, Fabrizio Petrini and Juan Fernandez. Adaptive Parallel Job Scheduling with Flexible CoScheduling. In IEEE Transactions on Parallel and Distributed Systems 11 (16) : 1066–01077, November 2005.
Abstract
Many scientific and high-performance computing applications consist of multiple processes running on different processors that communicate frequently. Because of their synchronization needs, these applications can suffer severe performance penalties if their processes are not all coscheduled to run together. Two common approaches to coscheduling jobs are batch scheduling, wherein nodes are dedicated for the duration of the run, and gang scheduling, wherein time slicing is coordinated across processors. Both work well when jobs are load-balanced and make use of the entire parallel machine. However, these conditions are rarely met and most realistic workloads consequently suffer from both internal and external fragmentation, in which resources and processors are left idle because jobs cannot be packed with perfect efficiency. This situation leads to reduced utilization and suboptimal performance. Flexible CoScheduling (FCS) addresses this problem by monitoring each job's computation granularity and communication pattern and scheduling jobs based on their synchronization and load-balancing requirements. In particular, jobs that do not require stringent synchronization are identified, and are not coscheduled; instead, these processes are used to reduce fragmentation. FCS has been fully implemented on top of the STORM resource manager on a 256-processor Alpha cluster and compared to batch, gang, and implicit coscheduling algorithms. This paper describes in detail the implementation of FCS and its performance evaluation with a variety of workloads, including large-scale benchmarks, scientific applications, and dynamic workloads. The experimental results show that FCS saturates at higher loads than other algorithms (up to 54% higher in some cases), and displays lower response times and slowdown than the other algorithms in nearly all scenarios. Keywords: Cluster computing, load balancing, job scheduling, gang scheduling, parallel architectures, Flexible coscheduling
Preprint (PDF 312KB) Reviews (Text 20KB) Source (Compressed tar 791KB) BibTeX

Eitan Frachtenberg and Dror G. Feitelson. Pitfalls in Parallel Job Scheduling Evaluation, In 11th Workshop on Job Scheduling Strategies for Parallel Processing JSSPP'05, In conjunction with ICS-19, June 2005.
Abstract
There are many choices to make when evaluating the performance of a complex system. In the context of parallel job scheduling, one must decide what workload to use and what measurements to take. These decisions sometimes have subtle implications that are easy to overlook. In this paper we document numerous pitfalls one may fall into, with the hope of providing at least some help in avoiding them. Along the way, we also identify topics that could benefit from additional research. Keywords: parallel job scheduling, performance evaluation, experimental methodology, dynamic workload, static workload, simulation
Paper (PDF 142KB) Presentation materials (PowerPoint 2386KB) Reviews (Text 13KB) Source (Compressed tar 3709KB) BibTeX

Juan Fernandez, Fabrizio Petrini and Eitan Frachtenberg. Monitoring and Debugging Parallel Software with BCS-MPI on Large-Scale Clusters, In First Workshop on System Management Tools for Large-Scale Parallel Systems SMTPS'05, April 2005.
Abstract
Buffered CoScheduled (BCS) MPI is a novel implementation of MPI based on global synchronization of all system activities. BCS-MPI imposes a model where all processes and their communication are tightly scheduled at a very fine granularity. Thus, BCS-MPI provides a system that is much more controllable and deterministic. BCS-MPI leverages this regular behavior to provide a simple yet powerful monitoring and debugging subsystem that streamlines the analysis of parallel software. This subsystem, called Monitoring and Debugging System (MDS), provides exhaustive process and communication scheduling statistics. This paper covers in detail the design and implementation of the MDS subsystem, and demonstrates how the MDS can be used to monitor and debug not only parallel MPI applications but also the BCS-MPI runtime system itself. Additionally, we show that this functionality need not come at a significant performance loss.
Paper (PDF 208KB) Presentation materials (PDF 571KB) Reviews (Text) Source (849KB) BibTeX

Eitan Frachtenberg. SchedMark: Evaluating Scheduler Performance. In Usenix Annual Technical Conference USENIX'05, Anaheim, CA, April 2005.
Poster image (PDF 444KB) Source (Adobe Illustrator 454KB) BibTeX

Eitan Frachtenberg. SchedMark: Evaluating Scheduler Performance. In Work-in-Progress session of Usenix Annual Technical Conference USENIX'05, Anaheim, CA, April 2005.
Abstract
SchedMark: Evaluating Scheduler Performance In the last decade we have witnessed many new developments in process scheduling---developments driven by innovations in processor technology and by ever more versatile workloads. Many innovative schedulers are emerging in the scientific literature as well as enhancements and patches to production operating systems. However, one difficulty persists in the evaluation and comparison of scheduling schemes, even between those aiming to address the same problem: the lack of an agreed-upon benchmark for scheduling. Many studies come up with their own measures, workloads, and metrics, thus making their comparison an apples-and-oranges situation. The problem will probably grow worse with the emergence of commodity multicore and multithreaded chips, which are likely to result in even more scheduling work. The lack of a reliable, reproducible, and portable way to measure scheduling will grow even more critical. To address this need, we are developing SchedMark, a benchmark suite to evaluate the scheduler's impact on applications and workloads of various representative classes. SchedMark contrasts with previous approaches that measured either specific workloads (e.g., multimedia), or focused on system metrics (e.g., context-switch overhead). SchedMark will include synthetic desktop and server applications that represent a range of parameters and classes, such as continuous media, interactive, parallel, and sequential computation applications. The suite will measure not only global metrics, such as throughput, but also metrics that are of specific relevance to each application, such as dropped frames for continuous media, and response time for interactive applications. The set is designed to be portable, self-calibrating, and self-scaling, so that its results will remain comparable across a large set of architectures and operating systems. Workloads will be both static and dynamic, containing various combinations of classes of applications to capture the effect of scheduling decisions on co-interference and cache performance.
Presentation (HTML) Source (Compressed tar 34KB)

2004

Eitan Frachtenberg. Process Coordination for Commodity Systems, December 2004.
Abstract
Parallel and distributed processing are no longer the exclusive realm of supercomputers. The growing prevalence of systems with multiple processing units brings parallel hardware to commodity computers. Parallel hardware cannot be fully utilized unless running parallel software, which in turn depends on the operating system's ability to support various, often conflicting scheduling requirements. This proposal describes research toward achieving a fully flexible autonomous operating system (OS), that seamlessly supports the entire range of current and future applications: serial, multimedia, interactive, distributed, and parallel.
Report (PDF 240KB) Source (Compressed tar 15KB) BibTeX

Eitan Frachtenberg. Toward Realistic Evaluation of Job Scheduling Strategies. In Seminar talk, given at the Computer Science departments of Hebrew University (Jerusalem), Tel Aviv University (Tel Aviv), Technion, Israel Institute of Technology (Haifa), Interdisciplinary Center (Herzlia), and Ben Gurion University, (Beer Sheva), Israel, December 2004.
Abstract
Evaluating parallel job scheduling algorithms is a challenging task. Many factors affect the outcome of the evaluation, including workload and application choices, metrics, choice of scheduling algorithms and their parameters, and the hardware used or assumed. The large, non-continuous parameter space renders analytical evaluation extremely difficult, while simulation evaluations are very sensitive to the assumptions undertaken, sometimes resulting in contradicting results. Experimental evaluations are even rarer, due to the complexity of implementation, and the difficulty in obtaining a dedicated large machine for long periods of time. This talk will describe our efforts to evaluate various scheduling strategies in a dynamic, realistic environment. Of the various parameters that affect job scheduling performance, workload and implementation play a pivotal role. Most studies either employ simulations and/or simplistic workloads, which contain many assumptions, including unknown ones. Instead, we developed a scheduling framework that implements several existing and novel algorithms on various cluster architectures of up to hundreds of nodes. This framework was used to produce the first experimental evaluation of several job scheduling strategies in a dynamic workload environment, using synthetic and scientific MPI applications. This talk will discuss the challenges involved in evaluating job scheduling strategies, and the approaches we chose to address them. An analysis will be presented of three factors affecting scheduling systems running dynamic workloads: multiprogramming level, time quantum, and the use of backfilling for queue management -- and how they depend on offered load. Joint work with Dror Feitelson (Hebrew U.), Fabrizio Petrini (LANL), and Juan Fernandez (Murcia U.)
Presentation (PowerPoint 1934KB) Source (PowerPoint 1934KB)

Fabrizio Petrini, Jose C. Sancho, Eitan Frachtenberg, Juan Fernandez and Kei Davis. On the Design of System Software for Large-Scale Clusters, In XV Jornadas de Paralelismo (pp. 237–242), September 2004. (ISBN: 84-8240-714-7).
BibTeX

Eitan Frachtenberg. Designing Parallel Operating Systems using Modern Interconnects. In MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachussetes Institute of Technology, Cambridge MA, September 2004. Invited talk
Download slides.

Juan Fernandez, Eitan Frachtenberg, Fabrizio Petrini, Kei Davis and Jose C. Sancho. Architectural Support for System Software on Large-Scale Clusters, In International Conference on Parallel Processing ICPP'04, August 2004.
Abstract
Scalable management of distributed resources is one of the major challenges in deployment of large-scale clusters. Management includes transparent fault tolerance, efficient allocation of resources, and support for all the needs of parallel applications: parallel I/O, deterministic behavior, and responsiveness. These requirements are daunting with commodity hardware and operating systems since they were not designed to support a global, single management view of a large-scale system. In this paper we propose a small set of hardware mechanisms in the cluster interconnect to facilitate the implementation of a simple yet powerful global operating system. This system, which can be thought of as a coarse-grain SIMD operating system, allows commodity clusters to grow to thousands of nodes while still retaining the usability and responsiveness of the single-node workstation. Our results on a software prototype show that it is possible to implement efficient and scalable system software using the proposed set of mechanisms. Keywords: Cluster computing, cluster operating system, fault tolerance, network hardware, debuggability, resource management.
Paper (PDF 149KB) Presentation (PowerPoint 554KB) Reviews (Text 10KB) Source (Compressed tar 425KB) BibTeX

Eitan Frachtenberg, Kei Davis, Fabrizio Petrini, Jose C. Sancho and Juan Fernandez. Designing Parallel Operating Systems via Parallel Programming, In Euro-Par Europar'04, August 2004.
Abstract
Ever-increasing demand for computing capability is driving the construction of ever-larger computer clusters, soon to be reaching tens of thousands of processors. Many functionalities of system software have failed to scale accordingly---systems are becoming more complex, less reliable, and less efficient. Our premise is that these deficiencies arise from a lack of global control and coordination of the processing nodes. In practice, current parallel machines are loosely-coupled systems that are used for solving inherently tightly-coupled problems. This paper demonstrates that existing and future systems can be made more scalable by using BSP-like parallel programming principles in the design and implementation of the system software, and by taking full advantage of the latest interconnection network hardware. Moreover, we show that this approach can also yield great improvements in efficiency, reliability, and simplicity.
Paper (PDF 61KB) Presentation (PowerPoint 992KB) Reviews (Text) Source (Compressed tar 34KB) BibTeX

Jose C. Sancho, Fabrizio Petrini, Greg Johnson, Juan Fernandez and Eitan Frachtenberg. On the Feasibility of Incremental Checkpointing for Scientific Computing, In International Parallel and Distributed Processing Symposium IPDPS'04, April 2004.
Abstract
In the near future large-scale parallel computers will feature hundreds of thousands of processing nodes. In such systems, fault tolerance is critical as failures will occur very often. Checkpointing and rollback recovery has been extensively studied as an attempt to provide fault tolerance. However, current implementations do not provide the total transparency and full flexibility that are necessary to support the new paradigm of autonomic computing -- systems able to self-heal and self-repair. In this paper we provide an in-depth evaluation of incremental checkpointing for scientific computing. The experimental results, obtained on a state-of-the art cluster running several scientific applications, show that efficient, scalable, automatic and user-transparent incremental checkpointing is within reach with current technology.
Paper (PDF 165KB) Presentation (PowerPoint 2033KB) Reviews (Text) Source (Compressed tar 430KB) BibTeX

Juan Fernandez, Fabrizio Petrini and Eitan Frachtenberg. Buffered Coscheduled (BCS) MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers. In CCS Division Review Committee, Los Alamos National Laboratory, April 2004.
Poster image (PDF 446KB) Source (Adobe Illustrator 446KB) BibTeX

2003

Eitan Frachtenberg. Leveraging Modern Interconnects for Parallel System Software, Ph.D. dissertation. Hebrew University, Jerusalem, Israel. December 2003.
Abstract
The use of clusters of independent compute nodes as high capability and capacity computers is rapidly growing in industry, academia, and government. This growth is accompanied by fast-paced progress in cluster-aware hardware, and in particular in interconnection technology. Contemporary networks offer not only excellent performance as expressed by latency and bandwidth, but also advanced architectural features, such as programmable network interface cards, hardware support for collective communication operations, and support for modern communication protocols such as MPI and RDMA. The rapid progress in cluster hardware and usage is unfortunately not matched by similar progress in system software. This software consists of the middleware: the operating system, user libraries, and utilities that interface between the hardware and the user applications, allowing them to make use of the machine's resources. In fact, most of these clusters use common workstation operating systems such as Linux running on each of the cluster's nodes, with a collection of loosely-related libraries, utilities, and scripts to access the cluster's resources. Such solutions are hardly adequate for large-scale clusters and/or high-performance computing applications. The problems they cause include (but are not limited to): (1) poor performance and scalability of applications and system software; (2) reduced utilization of the machine due to suboptimal resource allocation; (3) reliability problems caused by the multitude of independent software modules, and the redundancy in their operation, and (4) difficulty in operating and making full use of these machines. The premise behind this dissertation is that system software can be dramatically improved in terms of performance, scalability, reliability, and simplicity by making use of the features offered by modern interconnects. Unlike single-node operating systems, most of a cluster's system software tasks involve efficient global synchronization of resources. As such, parallel system software can be designed to benefit from the novel hardware features offered by contemporary interconnection technology. This dissertation promotes the idea of treating a cluster's operating system as any other high-performance parallel application, and increasing its reliance on synchronization abilities while reducing its per-node complexity and redundancy. This dissertation makes the following primary contributions. First, a set of necessary network mechanisms to support this system software model is described. A prototype implementation of system software based on these mechanisms is then discussed. This system currently tackles three main aspects of parallel computers: resource management, communication libraries, and job scheduling methods. This model was implemented on three different cluster architectures. Extensive performance and scalability evaluations with real clusters and applications show significant improvements over previous work in all three areas. In particular, this research focuses primarily on job scheduling strategies, and demonstrates that through advanced algorithms, the system's throughput and responsiveness can be improved over a wide spectrum of workloads.
Thesis (PDF 800KB) Presentation (PDF 1193KB) Source (Compressed tar 1483KB) BibTeX

Juan Fernandez, Eitan Frachtenberg and Fabrizio Petrini. BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers, In IEEE/ACM Conference on Supercomputing SC'03, November 2003.
Abstract
Buffered CoScheduled MPI (BCS-MPI) introduces a new approach to design the communication layer for large-scale parallel machines. The emphasis of BCS-MPI is on the global coordination of a large number of communicating processes rather than on the traditional optimization of the point-to-point performance. BCS-MPI delays the interprocessor communication in order to schedule globally the communication pattern and it is designed on top of a minimal set of collective communication primitives. In this paper we describe a prototype implementation of BCS-MPI and its communication protocols. Several experimental results, executed on a set of scientific applications, show that BCS-MPI can compete with a production-level MPI implementation, but is much simpler to implement, debug and model.
Paper (PDF 203KB) Presentation (PDF 941KB) Reviews (HTML) Source (Compressed tar 201KB) BibTeX

Fabrizio Petrini, Juan Fernandez, Eitan Frachtenberg and Salvador Coll. Scalable Collective Communication on the ASCI Q Machine, In 11th Hot Interconnects conference HOTi11, August 2003.
Abstract
Scientific codes spend a considerable part of their run time executing collective communication operations. Such operations can also be critical for efficient resource management in large-scale machines. Therefore, scalable collective communication is a key factor to achieve good performance in large-scale parallel computers. In this paper we describe the performance and scalability of some common collective communication patterns on the ASCI~Q machine. Experimental results conducted on a 1024-node/4096-processor segment show that the network is fast and scalable. The network is able to barrier-synchronize in a few tens of $\mu$s, perform a broadcast with an aggregate bandwidth of more than 100 GB/s and sustain heavy hot-spot traffic with a limited performance degradation.
Paper (PDF 214KB) Presentation (PDF 1162KB) Reviews (Text) Source (Compressed tar 1494KB) BibTeX

Eitan Frachtenberg, Dror G. Feitelson, Juan Fernandez and Fabrizio Petrini. Parallel Job Scheduling under Dynamic Workloads, In 9th Workshop on Job Scheduling Strategies for Parallel Processing JSSPP'03, In conjunction with HPDC-12 / GGF-8, June 2003.
Abstract
Jobs that run on parallel systems that use gang scheduling for multiprogramming may interact with each other in various ways. These interactions are affected by system parameters such as the level of multiprogramming and the scheduling time quantum. A careful evaluation is therefore required in order to find parameter values that lead to optimal performance. We perform a detailed performance evaluation of three factors affecting scheduling systems running dynamic workloads: multiprogramming level, time quantum, and the use of backfilling for queue management --- and how they depend on offered load. Our evaluation is based on synthetic MPI applications running on a real cluster that actually implements the various scheduling schemes. Our results demonstrate the importance of both components of the gang-scheduling plus backfilling combination: gang scheduling reduces response time and slowdown, and backfilling allows doing so with a limited multiprogramming level. This is further improved by using flexible coscheduling rather than strict gang scheduling, as this reduces the constraints and allows for a denser packing.
Paper (PDF 261KB) Presentation materials (PDF 404KB) Reviews (Text) Source (Compressed tar 245KB) BibTeX

Salvador Coll, Eitan Frachtenberg, Fabrizio Petrini, Adolfy Hoisie and Leonid Gurvits. Using Multirail Networks in High-Performance Clusters. In Concurrency and Computation: Practice and Experience 15 (7-8) : 625–651, April 2003.
Abstract
Using multiple independent networks (also known as rails) is an emerging technique to overcome bandwidth limitations and enhance fault tolerance of current high-performance parallel computers. In this paper we present and analyze various algorithms to allocate multiple communication rails, including static and dynamic allocation schemes. An analytical lower bound on the number of rails required for static rail allocation is shown. We also present an extensive experimental compar-ison of the behavior of various algorithms in terms of bandwidth and latency. We show that striping messages over multiple rails can substantially reduce network latency, depending on average message size, network load, and allocation scheme. The compared methods include a static rail allocation, a basic round-robin rail al-location, a local-dynamic allocation based on local knowledge, and a dynamic rail allocation that reserves both communication endpoints of a message before send-ing it. The last method is shown to perform better than the others at higher loads: up to 49% better than local-knowledge allocation and 37% better than the round-robin allocation. This allocation scheme also shows lower latency and it saturates at higher loads (for messages long enough). Most importantly, this proposed allo-cation scheme scales well with the number of rails and message sizes. In addition we propose a hybrid algorithm that combines the benefits of the local-dynamic for short messages with those of the dynamic algorithm for large messages. Keywords: Communication Protocols, High-Performance Interconnection Networks, Performance Evaluation, Routing, Communication Libraries, Parallel Architectures.
Preprint (PDF 167KB) Reviews (Text) Source (Compressed tar 282KB) BibTeX

Fabrizio Petrini, Eitan Frachtenberg, Adolfy Hoisie and Salvador Coll. Performance Evaluation of the Quadrics Interconnection Network. In Journal of Cluster Computing 6 (2) : 125–142, April 2003.
Abstract
In this paper we present an in-depth description of the Quadrics interconnection network (QsNET) and an experimental performance evaluation on a 64-node Alphaserver cluster. We expose the performance and the scaling properties of the network by using a collection of benchmarks, using permutation patterns, congested traffic with several types of hotspots and I/O traffic. The experimental results indicate that the QsNET provides excellent performance in most cases, with excellent contention resolution mechanisms.
Preprint (PDF 597KB) Reviews (Text) Source (Compressed tar 882KB) BibTeX

Eitan Frachtenberg, Dror G. Feitelson, Fabrizio Petrini and Juan Fernandez. Flexible CoScheduling: Dealing with Load Imbalance and Heterogeneous Resources, In International Parallel and Distributed Processing Symposium IPDPS'03, April 2003. Best Paper Award — Architecture Track
Abstract
Fine-grained parallel applications require all their processes to run simultaneously on distinct processors to achieve good efficiency. This is typically accomplished by space slicing, wherein nodes are dedicated for the duration of the run, or by gang scheduling, wherein time slicing is coordinated across processors. Both schemes suffer from fragmentation, where processors are left idle because jobs cannot be packed with perfect efficiency. Obviously, this leads to reduced utilization and sub-optimal performance. Flexible coscheduling (FCS) solves this problem by monitoring each job's granularity and communication activity, and using gang scheduling only for those jobs that require it. Processes from other jobs, which can be scheduled without any constraints, are used as filler to reduce fragmentation. In addition, inefficiencies due to load imbalance and hardware heterogeneity are also reduced because the classification is done on a per-process basis. FCS has been fully implemented as part of the STORM resource manager, and shown to be competitive with gang scheduling and implicit coscheduling.
Paper (PDF 127KB) Presentation (PDF 634KB) Reviews (Text) Source (Compressed tar 1557KB) BibTeX

Juan Fernandez, Fabrizio Petrini and Eitan Frachtenberg. Buffered Coscheduled (BCS) MPI. In The conference on High-Speed Computing, Glenden Beach, OR, April 2003. Invited work.
Poster image (PDF 355KB) BibTeX

2002

Eitan Frachtenberg, Fabrizio Petrini, Juan Fernandez, Scott Pakin and Salvador Coll. STORM: Lightning-Fast Resource Management, In IEEE/ACM Conference on Supercomputing SC'02, November 2002.
Abstract
Although clusters are a popular form of high-performance computing (HPC), they remain more difficult to manage than sequential systems, or even symmetric multiprocessors. Furthermore, as cluster sizes increase, resource management---essentially, everything that runs on a cluster other than the applications---becomes an increasingly large impediment to application efficiency. In this talk we present STORM, a resource-management framework designed for scalability and performance. The key innovation behind STORM is a software architecture that enables resource management to exploit low-level network features. As a result of this HPC-application-like design, STORM is orders of magnitude faster than the best reported results in the literature on two sample resource-management functions: job launching and process scheduling. Further, we identify a small set of network primitives that is sufficient for a scalable implementation of a resource manager if implemented itself in a scalable manner.
Paper (PDF 265KB) Presentation (PDF 1082KB) Reviews (Postscript 72KB) Source (Compressed tar 190KB) BibTeX

Eitan Frachtenberg, Fabrizio Petrini, Juan Fernandez and Salvador Coll. Scalable Resource Management in High-Performance Computers, In IEEE International Conference on Cluster Computing Cluster'02, September 2002.
Abstract
Clusters of workstations have emerged as an important platform for building cost-effective, scalable, and highly-available computers. Although many hardware solutions are available today, the largest challenge in making large-scale clusters usable lies in the system software. In this paper we present STORM, a resource management tool designed to provide scalability, low overhead, and the flexibility necessary to efficiently support and analyze a wide range of job-scheduling algorithms. STORM achieves these feats by using a small set of primitive mechanisms that are common in modern high-performance interconnects. The architecture of STORM is based on three main technical innovations. First, a part of the scheduler runs in the thread processor located on the network interface. Second, we use hardware collectives that are highly scalable both for implementing control heartbeats and to distribute the binary of a parallel job in near-constant time. Third, we use an I/O bypass protocol that allows fast data movements from the file system to the communication buffers in the network interface and vice versa. The experimental results show that STORM can launch a job with a binary of 12MB on a 64-processor, 32-node cluster in less than 250ms. This paper provides experimental and analytical evidence that these results scale to a much larger number of nodes. To the best of our knowledge, STORM significantly outperforms existing production schedulers in launching jobs, performing resource management tasks, and gang-scheduling tasks. Keywords: Cluster Computing, Resource Management, Job Scheduling, Gang Scheduling, Parallel Architectures, Quadrics Interconnect, I/O bypass
Paper (PDF 132KB) Presentation (PDF 1401KB) Reviews (Text) Source (Compressed tar 604KB) BibTeX

Salvador Coll, Fabrizio Petrini, Eitan Frachtenberg and Adolfy Hoisie. Performance Evaluation of I/O Traffic and Placement of I/O Nodes on a High Performance Network, In Workshop on Communication Architecture for Clusters CAC'02, In conjunction with the International Parallel and Distributed Processing Symposium IPDPS'02, April 2002.
Abstract
A common trend in the design of large-scale clusters is to use a high-performanc e data network to integrate the processing nodes in a single parallel computer. In these systems the performance of the interconnect can be a limiting factor for the input/output (I/O), which is traditionally bottlenecked by the disk bandwidth. In this paper we present an experimental analysis on a 64-node AlphaServer cluster based on the Quadrics network (QsNET) of the behavior of the interconne ct under I/O traffic, and the influence of the placement of the I/O servers on the overall performance. The effects of using dedicated I/O nodes or overlapping I/O and computation on the I/O nodes are also analyzed. In addition, we evaluate how background I/O traffic interferes with other parallel applications running concurrently. Our experimental results show that a correct placement of the I/O servers can provide up to 20% increase in the available I/O bandwidth. Moreover, some important guidelines for applications and I/O servers mapping on large-scale clusters are given.
Paper (PDF 141KB) Presentation materials (PDF 2833KB) Reviews (Text) Source (Compressed tar 431KB) BibTeX

Eitan Frachtenberg, Juan Fernandez, Fabrizio Petrini and Scott Pakin. STORM: A Scalable TOol for Resource Management. In The conference on High-Speed Computing, Glenden Beach, OR, April 2002. Invited work A more recent (and nicer) version of the poster can be found here. Yet another version was shown at the Los Alamos National Laboratory booth at IEEE/ACM Conference on Supercomputing SC'02 here.
Poster image (1093KB) Source (Compressed tar 5150KB) BibTeX

Fabrizio Petrini, Wu-chun Feng, Adolfy Hoisie, Salvador Coll and Eitan Frachtenberg. The Quadrics Network (QsNet): High-Performance Clustering Technology. In IEEE Micro 22 (1) : 46–57, February 2002.
Abstract
The Quadrics interconnection network (QsNet) contributes two important innovations to the field of high-performance interconnects: (1) integration of the virtual-address spaces of individual nodes into a single, global, virtual-address space and (2) network fault tolerance via link-level and end-to-end protocols that can detect faults and automatically re-transmit packets. QsNet achieves these feats by extending the native operating system in the nodes with a network operating system and specialized hardware support in the network interface. As these and other important features of QsNet can be found in the InfiniBand specification, QsNet can be viewed as a precursor to InfiniBand. In this paper, we present an initial performance evaluation of QsNet. We first describe the main hardware and software features of QsNet, followed by the results of benchmarks that we ran on our experimental, Intel-based, Linux cluster built around QsNet. Our initial analysis indicates that QsNet performs remarkably well, e.g., user-level latency under 2 mus and bandwidth over 300 MB/s.
Preliminary version (PDF 127KB) Source (Compressed tar 162KB) BibTeX

2001

Eitan Frachtenberg. Flexible Coscheduling, M.Sc. thesis. Hebrew University, Jerusalem, Israel. December 2001.
Abstract
In this thesis a novel technique is introduced for job scheduling in clusters and supercomputers with the goal of increasing the efficiency and utilization of these machines. In particular, the problems arising from heterogeneous architecture clusters and software load imbalances are addressed. The suggested technique is a variation on gang scheduling and other coscheduling methods, where several parallel jobs time-share and space-share the same machine, using varying degrees of coordination among processes. The main idea behind this thesis is that a distributed/parallel scheduling system can gather dynamic information on the synchronization behavior of processes, and use this information to identify their different coscheduling needs. Using this information, a scheduler can make better scheduling decisions, to increase the overall system utilization and decrease the runtime of applications in a multiprogramming environment. The contribution of this thesis is threefold: (1) addressing the problems that heterogeneous architectures and load imbalances pose to coscheduling systems; (2) a methodological system of gathering job communication information and subsequent process classification for the making of better scheduling choices; and (3) experimental results that verify the usefulness of applying dynamic communication statistics to scheduling decisions. In addition, this work includes the implementation of an efficient and flexible scheduler, with the ability to use many of the scheduling algorithms found in the literature. The main result of this thesis is the design and development of a new approach to the identification of different process scheduling requirements and their scheduling according to these requirements. This approach is shown to be both feasible and performance-wise promising, and may also prove to be useful when integrated with other approaches. Another accomplishment of this work is the development of an extensive scheduler system that is both very efficient and flexible, and allows for testing real application behavior on real clusters, measuring real scheduling issues. This work was done partly at the parallel systems laboratory of the Hebrew university in Jerusalem partly at the Modeling, Algorithms and Informatics group of the Computer and Computational Sciences division (CCS-3) of the Los Alamos national laboratory.
Thesis (PDF 707KB) Presentation (PDF 634KB) Source (Compressed tar 1014KB) BibTeX

Salvador Coll, Eitan Frachtenberg, Fabrizio Petrini, Adolfy Hoisie and Leonid Gurvits. Using Multirail Networks in High-Performance Clusters, In IEEE International Conference on Cluster Computing Cluster'01, October 2001.
Abstract
Using multiple independent networks (also known as rails) is an emerging technique to overcome bandwidth limitations and enhance fault-tolerance of current high-performance clusters. We present and analyze various venues of exploiting multiple rails. Different rail access policies are presented and compared, including static and dynamic allocation schemes. An analytical lower bound on the number of networks required for static rail allocation is shown. We also present an extensive experimental comparison of the behavior of various allocation schemes in terms of bandwidth and latency. It is also shown that striping messages over multiple rails can substantially reduce network latency, depending on average message size, network load and allocation scheme. The allocation methods compared include a static rail allocation, a round-robin rail allocation, a dynamic allocation based on local knowledge, and a rail allocation that reserves both end-points of a message before sending it. The latter is shown to perform better than other methods at higher loads: upto 49% better than local-knowledge allocation and 37% better than the round-robin allocation. This allocation scheme also shows lower latency and it saturates on higher loads (for messages large enough). Most importantly, this proposed allocation scheme scales well with the number of rails and message sizes. This in turn suggests that the performance obtained from the network can be increased through the use of multiple rails if an appropriate rail allocation scheme is used.
Paper (PDF 107KB) Presentation (PDF 1855KB) Source (Compressed tar 134KB) BibTeX

Fabrizio Petrini, Salvador Coll, Eitan Frachtenberg and Adolfy Hoisie. Hardware- and Software-Based Collective Communication on the Quadrics Network, In 1st IEEE International Symposium on Network Computing and Applications NCA'01, October 2001.
Abstract
The efficient implementation of collective communication patterns in a parallel machine is a challenging design effort, that requires the solution of many problems. In this paper we present an in-depth description of how the Quadrics network supports both hardware- and software-based collectives. We describe the main features of the two building blocks of this network, a network interface that can perform zero-copy user-level communication and a wormhole routing switch. We also focus our attention on the routing and flow control algorithms, deadlock avoidance and on how the processing nodes are integrated in a global, virtual shared memory. Experimental results conducted on 64-node AlphaServer cluster indicate that the time to complete the hardware-based barrier synchronization on the whole network is as low as 6 mus, with very good scalability. Good latency and scalability are also achieved with the software-based synchronization, which takes about 15 mus. With the broadcast, similar performance is achieved by the hardware- and software-based implementations, which can deliver messages of up to 256 bytes in 13 mus and can get a sustained asymptotic bandwidth of 288 Mbytes/sec on all the nodes. The hardware-based barrier is almost insensitive to the network congestion, with 93% of the synchronizations taking less than 20 when the network is flooded with a background traffic of unicast messages. On the other hand, the software-based implementation suffers from a significant performance degradation. With high load the hardware broadcast maintains a reasonably good latency, delivering messages up to 2KB in 200 mus, while the software broadcast suffers from slightly higher latencies inherited from the synchronization mechanism. Both broadcast algorithms experience a significative performance degradation of the sustained bandwidth with large messages.
Paper (PDF 168KB) Presentation (PDF 2473KB) Reviews (Text) Source (Compressed tar 182KB) BibTeX

Eitan Frachtenberg, Fabrizio Petrini, Salvador Coll and Wu-chun Feng. Gang Scheduling with Lightweight User-Level Communication, In Workshop on Scheduling and Resource Management for Cluster Computing SRMCC'01, In conjunction with the International Conference on Parallel Processing ICPP'01, September 2001.
Abstract
In this paper, we explore the performance of gang scheduling on a cluster using the Quadrics interconnection network. In such a cluster, the scheduler can take advantage of this network's unique capabilities, including a network interface card-based processor and memory and efficient user-level communication libraries. We developed a micro-benchmark to test the scheduler's performance under various aspects of parallel job workloads: memory usage, bandwidth and latency-bound communication, number of processes, timeslice quantum, and multiprogramming levels. Our experiments show that the gang scheduler performs relatively well under most workload conditions, is largely insensitive to the number of concurrent jobs in the system and scales almost linearly with number of nodes. On the other hand, the scheduler is very sensitive to the timeslice quantum, and values under 30 seconds can incur large overheads and fairness problems.
Paper (PDF 83KB) Presentation materials (PDF 789KB) Source (Compressed tar 190KB) BibTeX

Fabrizio Petrini, Wu-chun Feng, Adolfy Hoisie, Salvador Coll and Eitan Frachtenberg. The Quadrics Network (QsNet): High-Performance Clustering Technology, In 9th Hot Interconnects conference HOTi9, August 2001.
Abstract
The Quadrics interconnection network (QsNet) contributes two novel innovations to the field of high-performance interconnects: (1) integration of the virtual-address spaces of individual nodes into a single, global, virtual-address space and (2) network fault tolerance via link-level and end-to-end protocols that can detect faults and automatically re-transmit packets. QsNet achieves these feats by extending the native operating system in the nodes with a network operating system and specialized hardware support in the network interface. As these and other important features of QsNet can be found in the InfiniBand specification, QsNet can be viewed as a precursor to InfiniBand. In this paper, we present an initial performance evaluation of QsNet. We first describe the main hardware and software features of QsNet, followed by the results of benchmarks that we ran on our experimental, Intel-based, Linux cluster built around QsNet. Our initial analysis indicates that QsNet performs remarkably well, e.g., user-level latency under 2 mus and bandwidth over 300 MB/s.
Paper (PDF 93KB) Presentation (PDF 435KB) Source (Compressed tar 166KB) BibTeX

Eitan Frachtenberg and Fabrizio Petrini. Overlapping Communication and Computation in the Quadrics Network, August 2001.
Report (PDF 83KB) Source (LyX 26KB) BibTeX

Eitan Frachtenberg and Fabrizio Petrini. Scheduler Testbed System Design, August 2001.
Report (PDF 351KB) Source (Compressed tar 259KB) BibTeX

Salvador Coll, Eitan Frachtenberg, Fabrizio Petrini, Adolfy Hoisie and Leonid Gurvits. Static Allocation of Multirail Networks, July 2001.
Abstract
Using multiple independent networks (also known as rails) is an emerging tech- nique to overcome bandwidth limitations and enhance fault-tolerance of current high-performance clusters. This report presents the limitations and performance of static rail-allocation approaches, where each rail is pre-assigned a direction for communication. An analytical lower bound on the number of networks required for rail allocation is shown. We present an extensive experimental comparison of the behavior of various allocation schemes in terms of bandwidth and latency, com- pared to static rail allocation. We also compare the ability of static and dynamic rail-allocation mechanism to stripe messages over multiple rails. Scalability issues of static and dynamic rail allocation are also compared. We find that not only static rail allocation necessarily consumes many resources, it also performs poorly com- pared to dynamic rail allocation schemes, in all the tested aspects.
Report (PDF 89KB) Source (Compressed tar 226KB) BibTeX

List generated automatically with Publist v. 2.1.0

Back to homepage