Historically, Markovian predictors have been very successful in predicting branch outcomes. In this work we propose a hybrid scheme that employs two Prediction by Partial Matching (PPM) Markovian predictors, one that predicts based on local branch histories and one based on global branch histories. The two independent predictors are combined using a neural network. On the CBP-2 traces the proposed scheme acheives over twice the prediction accuracy of the gshare predictor.
The advent of Web-based services and cloud computing has instigated an explosive growth in demand for datacenters. Traditionally, Internet companies would lease datacenter space and servers from vendors that often emphasize flexibility over efficiency. But as these companies grew larger, they sought to reduce acquisition and operation costs by building their own datacenters. Facebook reached this stage earlier in 2011 when it unveiled its first customized datacenter in Prineville, Oregon. In designing this datacenter, Facebook took a blank-slate approach where all aspects were rethought for maximum efficiency. Although the resulting datacenter is optimized for Facebook's workload, it is general enough to be appeal to a wide variety of applications. This paper describes our choices and innovations in the thermal design of the datacenter building, which employs 100% outside-air economization. The efficiency of this design is manifest in an average infrastructure energy use reduction of 86% compared to leased space, and an overall energy use reduction of 29%. This reduction in turn translates to a power usage efficiency of 1.08, measured over the summer of 2011.
Since the dawn of Quantum Computing (QC), theoretical developments like Shor’s algorithm, proved the conceptual superiority of QC over traditional computing. However, such quantum supremacy claims are difficult to achieve in practice due to the technical challenges of realizing noiseless qubits. In the near future, QC applications will need to rely on noisy quantum devices that offload part of their work to classical devices. A way to achieve this is by using Parameterized Quantum Circuits (PQCs) in optimization or even machine learning tasks. The energy consumption of quantum algorithms has been poorly studied. Here we explore several optimization algorithms using both, theoretical insights and numerical experiments, to understand their impact on energy consumption. Specifically, we highlight why and how algorithms like Quantum Natural Gradient Descent, Simultaneous Perturbation Stochastic Approximations or Circuit Learning methods, are at least 2× to 4× more energy efficient than their classical counterparts. Why Feedback-Based Quantum Optimization is energy-inefficient and how a technique like Rosalin, could boost the energy-efficiency of other algorithms by a factor of ≥ 20×
Scaling data centers to handle task-parallel workloads requires balancing the cost of hardware, operations, and power. Low-power, low-core-count servers reduce costs in one of these dimensions, but may require additional nodes to provide the required quality of service or increase costs by underutilizing memory and other resources. We show that the throughput, response time, and power consumption of a high-core-count processor operating at a low clock rate and very low power consumption can perform well when compared to a platform using faster but fewer commodity cores. Specific measurements are made for a key-value store, Memcached, using a variety of systems based on three different processors: the 4-core Intel Xeon L5520, 8-core AMD Opteron 6128 HE, and 64-core Tilera TILEPro64.
Large-scale datacenters consume megawatts in power and cost hundreds of millions of dollars to equip. Reducing the energy and cost footprint of servers can therefore have substantial impact. Web, Grid, and cloud servers in particular can be hard to optimize, since they are expected to operate under a wide range of workloads. For our upcoming datacenter, we set out to significantly improve its power efficiency, cost, reliability, serviceability, and environmental footprint. To this end, we redesigned many dimensions of the datacenter and servers in conjunction. This paper focuses on our new server design, combining aspects of power, motherboard, thermal, and mechanical design. We calculate and confirm experimentally that our custom-designed servers can reduce power consumption across the entire load spectrum while at the same time lower acquisition and maintenance costs. Importantly, our design does not decrease the servers' performance or portability, which would otherwise limit its applicability.
The current work introduces a method for predicting Memcached throughput on single-core and multi-core processors. The method is based on traces collected from a full system simulator running Memcached. A series of microarchtectural simulators consume these traces and the results are used to produce a CPI model composed of a baseline issue rate, cache miss rates, and branch mispredictions rate. Simple queueing models are used to produce througput predictions with accuracy in the range of 8% to 17%.
Using multiple independent networks (also known as rails) is an emerging technique to overcome bandwidth limitations and enhance fault-tolerance of current high-performance clusters. We present and analyze various venues of exploiting multiple rails. Different rail access policies are presented and compared, including static and dynamic allocation schemes. An analytical lower bound on the number of networks required for static rail allocation is shown. We also present an extensive experimental comparison of the behavior of various allocation schemes in terms of bandwidth and latency. It is also shown that striping messages over multiple rails can substantially reduce network latency, depending on average message size, network load and allocation scheme. The allocation methods compared include a static rail allocation, a round-robin rail allocation, a dynamic allocation based on local knowledge, and a rail allocation that reserves both end-points of a message before sending it. The latter is shown to perform better than other methods at higher loads: upto 49% better than local-knowledge allocation and 37% better than the round-robin allocation. This allocation scheme also shows lower latency and it saturates on higher loads (for messages large enough). Most importantly, this proposed allocation scheme scales well with the number of rails and message sizes. This in turn suggests that the performance obtained from the network can be increased through the use of multiple rails if an appropriate rail allocation scheme is used.
Using multiple independent networks (also known as rails) is an emerging tech- nique to overcome bandwidth limitations and enhance fault-tolerance of current high-performance clusters. This report presents the limitations and performance of static rail-allocation approaches, where each rail is pre-assigned a direction for communication. An analytical lower bound on the number of networks required for rail allocation is shown. We present an extensive experimental comparison of the behavior of various allocation schemes in terms of bandwidth and latency, com- pared to static rail allocation. We also compare the ability of static and dynamic rail-allocation mechanism to stripe messages over multiple rails. Scalability issues of static and dynamic rail allocation are also compared. We find that not only static rail allocation necessarily consumes many resources, it also performs poorly com- pared to dynamic rail allocation schemes, in all the tested aspects.
Scientific codes spend a considerable part of their run time executing collective communication operations. Such operations can also be critical for efficient resource management in large-scale machines. Therefore, scalable collective communication is a key factor to achieve good performance in large-scale parallel computers. In this paper we describe the performance and scalability of some common collective communication patterns on the ASCI~Q machine. Experimental results conducted on a 1024-node/4096-processor segment show that the network is fast and scalable. The network is able to barrier-synchronize in a few tens of $\mu$s, perform a broadcast with an aggregate bandwidth of more than 100 GB/s and sustain heavy hot-spot traffic with a limited performance degradation.
A common trend in the design of large-scale clusters is to use a high-performanc e data network to integrate the processing nodes in a single parallel computer. In these systems the performance of the interconnect can be a limiting factor for the input/output (I/O), which is traditionally bottlenecked by the disk bandwidth. In this paper we present an experimental analysis on a 64-node AlphaServer cluster based on the Quadrics network (QsNET) of the behavior of the interconne ct under I/O traffic, and the influence of the placement of the I/O servers on the overall performance. The effects of using dedicated I/O nodes or overlapping I/O and computation on the I/O nodes are also analyzed. In addition, we evaluate how background I/O traffic interferes with other parallel applications running concurrently. Our experimental results show that a correct placement of the I/O servers can provide up to 20% increase in the available I/O bandwidth. Moreover, some important guidelines for applications and I/O servers mapping on large-scale clusters are given.
The efficient implementation of collective communication patterns in a parallel machine is a challenging design effort, that requires the solution of many problems. In this paper we present an in-depth description of how the Quadrics network supports both hardware- and software-based collectives. We describe the main features of the two building blocks of this network, a network interface that can perform zero-copy user-level communication and a wormhole routing switch. We also focus our attention on the routing and flow control algorithms, deadlock avoidance and on how the processing nodes are integrated in a global, virtual shared memory. Experimental results conducted on 64-node AlphaServer cluster indicate that the time to complete the hardware-based barrier synchronization on the whole network is as low as 6 mus, with very good scalability. Good latency and scalability are also achieved with the software-based synchronization, which takes about 15 mus. With the broadcast, similar performance is achieved by the hardware- and software-based implementations, which can deliver messages of up to 256 bytes in 13 mus and can get a sustained asymptotic bandwidth of 288 Mbytes/sec on all the nodes. The hardware-based barrier is almost insensitive to the network congestion, with 93% of the synchronizations taking less than 20 when the network is flooded with a background traffic of unicast messages. On the other hand, the software-based implementation suffers from a significant performance degradation. With high load the hardware broadcast maintains a reasonably good latency, delivering messages up to 2KB in 200 mus, while the software broadcast suffers from slightly higher latencies inherited from the synchronization mechanism. Both broadcast algorithms experience a significative performance degradation of the sustained bandwidth with large messages.
The Quadrics interconnection network (QsNet) contributes two novel innovations to the field of high-performance interconnects: (1) integration of the virtual-address spaces of individual nodes into a single, global, virtual-address space and (2) network fault tolerance via link-level and end-to-end protocols that can detect faults and automatically re-transmit packets. QsNet achieves these feats by extending the native operating system in the nodes with a network operating system and specialized hardware support in the network interface. As these and other important features of QsNet can be found in the InfiniBand specification, QsNet can be viewed as a precursor to InfiniBand. In this paper, we present an initial performance evaluation of QsNet. We first describe the main hardware and software features of QsNet, followed by the results of benchmarks that we ran on our experimental, Intel-based, Linux cluster built around QsNet. Our initial analysis indicates that QsNet performs remarkably well, e.g., user-level latency under 2 mus and bandwidth over 300 MB/s.
Heterogeneous computing is growing as an important hardware and software paradigm, both in high-performance computing and in application computing in general. Nevertheless, the topic is a relative newcomer to undergraduate curricula, and there is a dearth of guidance on suitable syllabi and lesson plans. The educational challenge of teaching this topic is exacerbated by the rapid pace of heterogeneous-hardware innovation and adoption, which can render parts of current textbooks obsolete. To help other educators facing these challenges, and to promote a conversation about a standardized approach toward teaching heterogeneous computing, this paper presents a case study for one semester-long class on the topic. It describes the goals, structure, challenges, and lessons learned from the introduction of a diverse heterogeneous hardware and software environment to computer science majors at Reed College, a small liberal-arts school. This paper also includes suggestions and ideas for future adoption, adaptation, and expansion of this class.
Given a cloud-native application, how do we accurately estimate its performance, such as run time or memory consumption? Accurate estimation is necessary to ensure that the application meets perfor- mance goals without resorting to overprovisioning of resources. Additionally, in practice, performance esti- mation needs to be meaningful and reproducible. Un- fortunately, modern HPC systems come with numerous factors affecting performance estimation, such as het- erogeneous accelerators, multilevel networks, millions of cores, layered software abstractions, and specialized middleware. Each of these factors adds a degree of variability to empirical performance results. The approaches currently being taught and practiced limit performance evaluation in three ways: (1) usage of incomplete performance descriptions/metrics such as point summaries (e.g., mean, 99th-percentile or median) which hide the rich behavioral patterns in dif- ferent scenarios; (2) measuring insufficient performance samples, leading to inaccurate performance descrip- tion; and (3) measuring excessive performance samples, leading to waste of precious computing resources. To overcome these limitations, we propose a new approach to evaluate and reason about application performance in modern HPC in a meaningful way. Our contribution is threefold: (a) we show the difficulty of estimating performance in realistic scenarios: one per- formance measurement is not enough; (b) we propose to use distributions as the true measure of performance; and (c) we propose several practices and concepts to be taught to HPC students and practitioners, so that they may produce rich and accurate performance evaluations. We see our work having an impact both on educators and on practitioners.
Genetic and Evolutionary Algorithms (GEAs) rely on operators such as mutation and recombination to introduce variation to the genotypes. Because of their crucial role and effect on GEA performance, several studies have attempted to model and quantify the variation induced by different operators on various genotypic representations and GEAs. One metric of particular interest is the locality of genetic operators and representations, or how sensitive the phenotype is to small changes in genotype. Consequently, there is a considerable body of empirical work on the effects that different representations have on locality, with an emphasis on several popular representations, such as Gray encoding, and popular variation operators, such as single-bit mutation and single-point crossover. Here, we compute and prove tight upper and lower bounds on locality. We first precisely define our locality metrics for the single-point mutation and generic crossover operators by reformulating Rothlauf's seminal definitions specific to the binary-to-integer domain of representations. We then prove lower and upper bounds for single-point mutation locality by reducing the problem to mappings on hypercubes, and present constructive algorithms to generate representations of both optimal and pessimal locality. We also compute asymptotic bounds for generalized locality under any crossover operator. Our primary result is that the single-point locality of standard binary encoding is provably as good as Binary-Reflected Gray encoding, while other Gray encodings, which we construct, have worse locality. Another important result is that the generalized locality of any nonredundant binary-to-integer representation quickly converges to the same value, meaning that this metric cannot discriminate among representations and may therefore lose its usefulness for binary-integer representations.
One of the biggest concerns of modern information retrieval systems is reducing the user effort required for manual traversal and filtering of long matching document lists. In this paper we propose an alternative approach for compact and concise representation of search results, which we implemented in the BoW on-line bibliographical repository. The BoW repository is based on an hierarchical concept index to which entries are linked. The key idea is that searching in the hierarchical repository should take advantage of the repository structure and return matching topics from the hierarchy, rather than just a long list of entries. Likewise, when new entries are inserted, a search for relevant topics to which they should be linked is required. Therefore, a similar hierarchical scheme for query-topic matching can be applied for both tasks. However, our experiments show that different query types used for these tasks are best treated by different topic ranking functions. For example, keyword search which is typically based on short (1-3 word) queries requires a weight-based (rather than Boolean) ranking approach. The underlying rationale of weight-based ranking is that for a truly relevant topic all (or almost all) the query terms should appear in its vector representation and with approximately even high weights. Applying this reasoning to the topic ranking method is shown to significantly increase the precision and the F1 (by over 30%) for short keyword queries compared to the baseline Boolean ranking metric.
Commodity parallel computers are no longer a technology predicted for some indistinct future: they are becoming ubiquitous. In the absence of significant advances in clock speed, chip-multiprocessors (CMPs) and symmetric multithreading (SMT) are the modern workhorses that keep Moore's Law still relevant. On the software side, we are starting to observe the adaptation of some codes to the new commodity parallel hardware. While in the past, only complex professional codes ran on parallel computers, the commoditization of parallel computers is opening the door for many desktop applications to benefit from parallelization. We expect this software trend to continue, since the only apparent way of obtaining additional performance from the hardware will be through parallelization. Based on the premise that the average desktop workload is growing more parallel and complex, this paper asks the question: Are current desktop operating systems appropriate for these trends? Specifically, we are interested in parallel process scheduling, which has been a topic of significant study in the supercomputing community, but so far little of this research has trickled down to the desktop. In this paper, we demonstrate, using several case studies, that contemporary general-purpose operating systems are inadequate for the emerging parallel desktop workloads. We suggest that schedulers designed with an understanding of the requirements of all process classes and their mixes, as well the abilities of the underlying architecture, might be the solution to this inadequacy.
Commodity hardware and software are growing increasingly more complex, with advances such as chip heterogeneity and specialization, deeper memory hierarchi es, fine-grained power management, and most importantly, chip parallelism. Similarly, workloads are growing more concurrent and diverse. With this new complexity in hardware and software, process scheduling in the operating system (OS) becomes more challenging. Nevertheless, most commodity OS schedulers are based on design principles that are 30 years old. This disparity may soon lead to significant performance degradation. Most significantly, parallel architectures such as multicore chips require more than scalable OSs: parallel programs require parallel-aware scheduling. This paper posits that imminent changes in hardware and software warrant reevaluating the scheduler's policies in the commodity OS. We discuss and demonstrate the main issues that the emerging parallel desktops are raising for the OS scheduler. We propose that a new approach to scheduling is required, applying and generalizing lessons from different domain-specific scheduling algorithms, and in particular, parallel job scheduling. Future architectures can also assist the OS by providing better information on process scheduling requirements.
The growing convergence of high-performance, data analytics, and machine learning applications is increasingly pushing computing systems toward heterogeneous processors and specialized hardware accelerators. Hardware heterogeneity, in turn, leads to finer-grained workflows. State-of-the-art server- less computing resource managers do not currently provide efficient scheduling of such fine-grained tasks on systems with heterogeneous CPUs and specialized hardware accelerators (e.g., GPUs and FPGAs). Working with fine-grained tasks presents an opportunity for more efficient energy use via new scheduling models. Our proposed scheduler enables technologies like Nvidia’s Multi-Process Service (MPS) to pack multiple fine-grained tasks on GPUs efficiently. Its advantages include better co-location of jobs and better sharing of hardware resources such as GPUs that were not previously possible on container orchestration systems. We propose a Kubernetes-native energy-aware scheduler that integrates with our heterogeneous framework. Combining fine- grained resource scheduling on heterogeneous hardware and energy-aware scheduling results in up to 17.6% improvement in makespan, up to 20.16% reduction in energy consumption for CPU workloads, and up to 58.15% improvement in makespan, and up to 28.92% reduction in energy consumption for GPU workloads.
Jobs that run on parallel systems that use gang scheduling for multiprogramming may interact with each other in various ways. These interactions are affected by system parameters such as the level of multiprogramming and the scheduling time quantum. A careful evaluation is therefore required in order to find parameter values that lead to optimal performance. We perform a detailed performance evaluation of three factors affecting scheduling systems running dynamic workloads: multiprogramming level, time quantum, and the use of backfilling for queue management --- and how they depend on offered load. Our evaluation is based on synthetic MPI applications running on a real cluster that actually implements the various scheduling schemes. Our results demonstrate the importance of both components of the gang-scheduling plus backfilling combination: gang scheduling reduces response time and slowdown, and backfilling allows doing so with a limited multiprogramming level. This is further improved by using flexible coscheduling rather than strict gang scheduling, as this reduces the constraints and allows for a denser packing.
Fine-grained parallel applications require all their processes to run simultaneously on distinct processors to achieve good efficiency. This is typically accomplished by space slicing, wherein nodes are dedicated for the duration of the run, or by gang scheduling, wherein time slicing is coordinated across processors. Both schemes suffer from fragmentation, where processors are left idle because jobs cannot be packed with perfect efficiency. Obviously, this leads to reduced utilization and sub-optimal performance. Flexible coscheduling (FCS) solves this problem by monitoring each job's granularity and communication activity, and using gang scheduling only for those jobs that require it. Processes from other jobs, which can be scheduled without any constraints, are used as filler to reduce fragmentation. In addition, inefficiencies due to load imbalance and hardware heterogeneity are also reduced because the classification is done on a per-process basis. FCS has been fully implemented as part of the STORM resource manager, and shown to be competitive with gang scheduling and implicit coscheduling.
In this paper, we explore the performance of gang scheduling on a cluster using the Quadrics interconnection network. In such a cluster, the scheduler can take advantage of this network's unique capabilities, including a network interface card-based processor and memory and efficient user-level communication libraries. We developed a micro-benchmark to test the scheduler's performance under various aspects of parallel job workloads: memory usage, bandwidth and latency-bound communication, number of processes, timeslice quantum, and multiprogramming levels. Our experiments show that the gang scheduler performs relatively well under most workload conditions, is largely insensitive to the number of concurrent jobs in the system and scales almost linearly with number of nodes. On the other hand, the scheduler is very sensitive to the timeslice quantum, and values under 30 seconds can incur large overheads and fairness problems.
Web datacenters and clusters can be larger than the world's largest supercomputers, and run workloads that are at least as heterogeneous and complex as their high-performance computing counterparts. And yet little is known about the unique job scheduling challenges of these environments. This article aims to ameliorate this situation. It discusses the challenges of running web infrastructure and describes several techniques to address them. It also presents some of the problems that remain open in the field.
The workshop on job scheduling strategies for parallel processing (JSSPP) studies the myriad aspects of managing resources on parallel and distributed computers. These studies typically focus on large-scale computing environments, where allocation and management of computing resources present numerous challenges. Traditionally, such systems consisted of massively parallel supercomputers, or more recently, large clusters of commodity processor nodes. These systems are characterized by architectures that are largely homogeneous and workloads that are dominated by both computation and communication-intensive applications. Indeed, the large majority of the articles in the first ten JSSPP workshops dealt with such systems and addressed issues such as queuing systems and supercomputer workloads. In this paper, we discuss some of the recent developments in parallel computing technologies that depart from this traditional domain of problems. In particular, we identify several recent and influential technologies that could have a significant impact on the future of research on parallel scheduling. We discuss some of the more specific research challenges that these technologies introduce to the JSSPP community, and propose to enhance the scope of future JSSPP workshops to include these topics.
There are many choices to make when evaluating the performance of a complex system. In the context of parallel job scheduling, one must decide what workload to use and what measurements to take. These decisions sometimes have subtle implications that are easy to overlook. In this paper we document numerous pitfalls one may fall into, with the hope of providing at least some help in avoiding them. Along the way, we also identify topics that could benefit from additional research. Keywords: parallel job scheduling, performance evaluation, experimental methodology, dynamic workload, static workload, simulation
Parallel and distributed processing are no longer the exclusive realm of supercomputers. The growing prevalence of systems with multiple processing units brings parallel hardware to commodity computers. Parallel hardware cannot be fully utilized unless running parallel software, which in turn depends on the operating system's ability to support various, often conflicting scheduling requirements. This proposal describes research toward achieving a fully flexible autonomous operating system (OS), that seamlessly supports the entire range of current and future applications: serial, multimedia, interactive, distributed, and parallel.
Performance variability in complex computer systems is a major challenge for accurate benchmarking and per- formance characterization, especially for tightly coupled large-scale high-performance computing systems. Point summaries of performance may be both uninformative, if they do not capture the full richness of its behavior, and inaccurate, if they are derived from an inadequate sam- ple set of measurements. Determining the correct sample set—and in particular, its size—requires balancing trade- offs of computation, methodology, and statistical power. In this paper, we treat the performance distribution as the primary target of the performance evaluation, from which all other metrics can be derived. We propose a meta-heuristic that characterizes the performance distri- bution as it is being measured, dynamically determining when enough samples have been collected to approximate the true distribution. Compared to predetermined fixed stopping criteria, this dynamic and adaptive method can be more efficient in resource use, since it can stop as early as the desired certainty level is obtained, and more accurate, since it does not stop prematurely. Importantly, it requires no advance knowledge or assumptions about the system under test or its performance characteristics. We evaluate a prototype of our proposal using a mix of synthetic and real benchmarks. For synthetic distribu- tions, this approach closely matches the true distribution. For actual benchmarks, the heuristic is overly conser- vative for some applications and overly lax for others, especially those using GPUs. But it still matches the overall shape of the distribution for benchmarks with very diverse distributions, which suggests that it is a viable approach for an adaptive stopping rule.
The successful development and deployment of large-scale Internet services depends critically on performance. Even small changes in processing time, bandwidth, and memory usage can translate directly into large financial and user experience costs. Despite the widespread use of traffic-based benchmarks, there is little research on how they should be run in order to obtain valid and precise inferences with minimal data collection costs. Correctly A/B testing Internet services can be surprisingly difficult because interdependencies between user requests (e.g., for search results, social media streams, ads) and hosts can lead to failures in estimating the significance and magnitude of performance differences. We develop multilevel models of Internet service performance that take in to account dependence due to user requests and hosts, and use them to design benchmarking routines that maximize precision subject to time and resource constraints. This design is then validated experimentally on a production system that is used to vet thousands of changes every day.
Key-value stores are a vital component in many scale-out enterprises, including social networks, online retail, and risk analysis. Accordingly, they are receiving increased attention from the research community in an effort to improve their performance, scalability, reliability, cost, and power consumption. To be effective, such efforts require a detailed understanding of realistic key-value workloads. And yet little is known about these workloads outside of the companies that operate them. This paper aims to address this gap. To this end, we have collected detailed traces from Facebook's Memcached deployment, arguably the world's largest. The traces capture over 284 billion requests from five different Memcached use cases over several days. We analyze the workloads from multiple angles, including: request composition, size, and rate; cache efficacy; temporal patterns; and application use cases. We also propose a simple model of the most representative trace to enable the generation of more realistic synthetic workloads by the community. Our analysis details many characteristics of the caching workload. It also reveals a number of surprises: a GET/SET ratio of 30:1 that is higher than assumed in the literature; some applications of Memcached behave more like persistent storage than a cache; strong locality metrics, such as keys accessed many millions of times a day, do not always suffice for a high hit rate; and there is still room for efficiency and hit rate improvements in Memcached's implementation. Toward the last point, we make several suggestions that address the exposed deficiencies.
Key-value (KV) stores have become a critical infrastructure component supporting various services in the cloud. Long considered an application that is memory-bound and network-bound, recent KV-store implementations on multicore servers grow increasingly CPU-bound instead. This limitation often leads to under-utilization of available bandwidth and poor energy efficiency, as well as long response times under heavy load. To address these issues, we present Hippos, a high-throughput, low-latency, and energy-efficient key-value store implementation. Hippos moves the KV store into the operating system's kernel and thus removes most of the overhead associated with the network stack and system calls. Hippos uses the Netfilter framework to quickly handle UDP packets, removing the overhead of UDP-based GET requests almost entirely. Combined with lock-free multithreaded data access, Hippos removes several performance bottlenecks both internal and external to the KV-store application. We prototyped Hippos as a Linux loadable kernel module and evaluated it against the ubiquitous Memcached using various micro-benchmarks and workloads from Facebook's production systems. The experiments show that Hippos provides some 20--200% throughput improvements on a 1Gbps network (up to 590% improvement on a 10Gbps network) and 5--20% saving of power compared with Memcached.
Demand for increasingly-higher computing capability is driving a similar growth in compute cluster sizes, soon to be reaching tens of thousands of processors. This growth is not matched however by system software, which has remained largely unchanged from the advent of clusters. The failure of system software to scale and develop in the same rate as the underlying hardware constrains the productivity of these machines by severely limiting their utilization, reliability, and responsiveness. The traditional approach to system software, namely, the use of loosely-coupled independent daemons on each node, is inadequate for the management of large-scale clusters, a problem which is inherently tightly-coupled and requires a high degree of synchronization. One model for large-scale system software is Buffered Coscheduling (BCS), wherein synchronization and scalability are obtained by means of global scheduling of all system activities and collective network operations. BCS represents a new methodology for the design of system software as a single, parallel program using traditional parallel constructs. As such, system software can be made orders of magnitude more scalable, simple, and easy to debug than the existing distributed solutions. The most important aspect of the BCS model and the overlying system software is the buffering and scheduling of all communication, resulting in highly controllable and deterministic system behavior. This chapter describes in detail the implementation of BCS-MPI, an MPI library designed after this model, and shows that the benefits of determinism need not come at a significant performance cost. Furthermore, BCS-MPI comes with a sophisticated monitoring and debugging subsystem that simplifies the analysis of system and application performance, and is covered in detail in this chapter. keywords: Cluster computing, system software, buffered coscheduling, MPI, communication protocol, parallel monitoring and debugging, Quadrics, QsNet.
Buffered CoScheduled (BCS) MPI is a novel implementation of MPI based on global synchronization of all system activities. BCS-MPI imposes a model where all processes and their communication are tightly scheduled at a very fine granularity. Thus, BCS-MPI provides a system that is much more controllable and deterministic. BCS-MPI leverages this regular behavior to provide a simple yet powerful monitoring and debugging subsystem that streamlines the analysis of parallel software. This subsystem, called Monitoring and Debugging System (MDS), provides exhaustive process and communication scheduling statistics. This paper covers in detail the design and implementation of the MDS subsystem, and demonstrates how the MDS can be used to monitor and debug not only parallel MPI applications but also the BCS-MPI runtime system itself. Additionally, we show that this functionality need not come at a significant performance loss.
Buffered CoScheduled MPI (BCS-MPI) introduces a new approach to design the communication layer for large-scale parallel machines. The emphasis of BCS-MPI is on the global coordination of a large number of communicating processes rather than on the traditional optimization of the point-to-point performance. BCS-MPI delays the interprocessor communication in order to schedule globally the communication pattern and it is designed on top of a minimal set of collective communication primitives. In this paper we describe a prototype implementation of BCS-MPI and its communication protocols. Several experimental results, executed on a set of scientific applications, show that BCS-MPI can compete with a production-level MPI implementation, but is much simpler to implement, debug and model.
In the near future large-scale parallel computers will feature hundreds of thousands of processing nodes. In such systems, fault tolerance is critical as failures will occur very often. Checkpointing and rollback recovery has been extensively studied as an attempt to provide fault tolerance. However, current implementations do not provide the total transparency and full flexibility that are necessary to support the new paradigm of autonomic computing -- systems able to self-heal and self-repair. In this paper we provide an in-depth evaluation of incremental checkpointing for scientific computing. The experimental results, obtained on a state-of-the art cluster running several scientific applications, show that efficient, scalable, automatic and user-transparent incremental checkpointing is within reach with current technology.
With the slowing of Moore’s law and decline of Dennard scaling, computing systems increasingly rely on specialized hardware ac- celerators in addition to general-purpose compute units. Increased hardware heterogeneity necessitates disaggregating applications into workflows of fine-grained tasks that run on a diverse set of CPUs and accelerators. Current accelerator delivery models cannot support such applications efficiently, as (1) the overhead of manag- ing accelerators erases performance benefits for fine-grained tasks; (2) exclusive accelerator use per task leads to underutilization; and (3) specialization increases complexity for developers. We propose adopting concepts from Function-as-a-Service (FaaS), which has solved these challenges for general-purpose CPUs in cloud computing. Kernel-as-a-Service (KaaS) is a novel serverless programming model for generic compute accelerators that aids heterogeneous workflows by combining the ease-of-use of higher- level abstractions with the performance of low-level hand-tuned code. We evaluate KaaS with a focus on the breadth of the idea and its generality to diverse architectures rather than on an in-depth implementation for a single accelerator. Using proof-of-concept prototypes, we show that this programming model provides per- formance, performance efficiency, and ease-of-use benefits across a diverse range of compute accelerators. Despite increased levels of abstraction, when compared to a naive accelerator implementa- tion, KaaS reduces completion times for fine-grained tasks by up to 96.0% (GPU), 68.4% (FPGA), 98.6% (TPU), and 34.9% (QPU) in our experiments.
Although clusters are a popular form of high-performance computing (HPC), they remain more difficult to manage than sequential systems, or even symmetric multiprocessors. Furthermore, as cluster sizes increase, resource management---essentially, everything that runs on a cluster other than the applications---becomes an increasingly large impediment to application efficiency. In this talk we present STORM, a resource-management framework designed for scalability and performance. The key innovation behind STORM is a software architecture that enables resource management to exploit low-level network features. As a result of this HPC-application-like design, STORM is orders of magnitude faster than the best reported results in the literature on two sample resource-management functions: job launching and process scheduling. Further, we identify a small set of network primitives that is sufficient for a scalable implementation of a resource manager if implemented itself in a scalable manner.
Clusters of workstations have emerged as an important platform for building cost-effective, scalable, and highly-available computers. Although many hardware solutions are available today, the largest challenge in making large-scale clusters usable lies in the system software. In this paper we present STORM, a resource management tool designed to provide scalability, low overhead, and the flexibility necessary to efficiently support and analyze a wide range of job-scheduling algorithms. STORM achieves these feats by using a small set of primitive mechanisms that are common in modern high-performance interconnects. The architecture of STORM is based on three main technical innovations. First, a part of the scheduler runs in the thread processor located on the network interface. Second, we use hardware collectives that are highly scalable both for implementing control heartbeats and to distribute the binary of a parallel job in near-constant time. Third, we use an I/O bypass protocol that allows fast data movements from the file system to the communication buffers in the network interface and vice versa. The experimental results show that STORM can launch a job with a binary of 12MB on a 64-processor, 32-node cluster in less than 250ms. This paper provides experimental and analytical evidence that these results scale to a much larger number of nodes. To the best of our knowledge, STORM significantly outperforms existing production schedulers in launching jobs, performing resource management tasks, and gang-scheduling tasks. Keywords: Cluster Computing, Resource Management, Job Scheduling, Gang Scheduling, Parallel Architectures, Quadrics Interconnect, I/O bypass
Scalable management of distributed resources is one of the major challenges in deployment of large-scale clusters. Management includes transparent fault tolerance, efficient allocation of resources, and support for all the needs of parallel applications: parallel I/O, deterministic behavior, and responsiveness. These requirements are daunting with commodity hardware and operating systems since they were not designed to support a global, single management view of a large-scale system. In this paper we propose a small set of hardware mechanisms in the cluster interconnect to facilitate the implementation of a simple yet powerful global operating system. This system, which can be thought of as a coarse-grain SIMD operating system, allows commodity clusters to grow to thousands of nodes while still retaining the usability and responsiveness of the single-node workstation. Our results on a software prototype show that it is possible to implement efficient and scalable system software using the proposed set of mechanisms. Keywords: Cluster computing, cluster operating system, fault tolerance, network hardware, debuggability, resource management.
Ever-increasing demand for computing capability is driving the construction of ever-larger computer clusters, soon to be reaching tens of thousands of processors. Many functionalities of system software have failed to scale accordingly---systems are becoming more complex, less reliable, and less efficient. Our premise is that these deficiencies arise from a lack of global control and coordination of the processing nodes. In practice, current parallel machines are loosely-coupled systems that are used for solving inherently tightly-coupled problems. This paper demonstrates that existing and future systems can be made more scalable by using BSP-like parallel programming principles in the design and implementation of the system software, and by taking full advantage of the latest interconnection network hardware. Moreover, we show that this approach can also yield great improvements in efficiency, reliability, and simplicity.
Women are acutely underrepresented in the HPC workforce. Addressing this gap requires accurate metrics on the repre- sentation of women and its associated factors. The goal of this paper is to provide current, broad, and reproducible data on this gender gap. Specifically, this study provides in-depth statistics on women’s representation in HPC conferences, es- pecially for authors of peer-reviewed papers, who serve as the keystone for future advances in the field. To this end, we analyzed participant data from nine HPC and HPC-related peer-reviewed conferences. In addition to gender distributions, we looked at post-publication citation statistics of the papers and authors’ research experience, country, and work sector. Our main finding is that women represent only 10% of all HPC authors, with large geographical variations and small variations by sector. Representation is particularly low at higher experience levels. This 10% ratio is lower than even the 20–30% ratio in all computer science.
Predicting the future is hard and risky. Predicting the future in the computer industry is even harder and riskier due to dramatic changes in technology and limitless challenges to innovation. Only a small fraction of innovations truly disrupt the state of the art. Some are not practical or cost-effective, some are ahead of their time, and some simply do not have a market. There are numerous examples of superior technologies that were never adopted because others arrived on time or fared better n the market. Therefore this document is only an attempt to better understand where technologies are going. The book Innovators Dilemma and its sequels best describe the process of innovation and disruption. Nine technical leaders of the IEEE Computer Society joined forces to write a technical report, entitled IEEE CS 2022, symbolically surveying 23 potential technologies that could change the landscape of computer science and industry by the year 2022. In particular, this report focuses on 3D printing, big data and analytics, open intellectual property movement, massively online open courses, security cross-cutting issues, universal memory, 3D integrated circuits, photonics, cloud computing, computational biology and bioinformatics, device and nanotechnology, sustainability, high-performance computing, the Internet of Things, life sciences, machine learning and intelligent systems, natural user interfaces, networking and inter-connectivity, quantum computing, software-defined networks, multicore, and robotics for medical care.