Jackson will present this on 2022-06-13
JPEG is a common encoding format for digital images. Applications that process large numbers of images can be accelerated by decoding multiple images concurrently. We examine the suitability of using a large array of in-memory processors (PIM) to obtain a high throughput of decoding. The main drawback of PIM processors is that they do not have the same architectural features that are commonly found on CPUs such as floating point, vector units and hardware-managed caches. Despite the lack of features, we demonstrate that it is feasible to build a JPEG decoder for PIM, and evaluate its quality and potential speedup. We show that the quality of decoded images is sufficient for real applications, and there is a significant potential for accelerating image decoding for those applications. We share our experiences in building such a decoder, and the challenges we faced while doing so.
USENIX ATC '21
We evaluate a new processing-in-memory (PIM) architecture from UPMEM that was built and deployed in an off- the-shelf server. Systems designed to perform computing in or near memory have been proposed for decades to overcome the proverbial memory wall, yet most never made it past blueprints or simulations. When the hardware is actually built and integrated into a fully functioning system, it must address realistic constraints that may be overlooked in a simulation. Evaluating a real implementation can reveal valuable insights. Our experiments on five commonly used applications highlight the main strength of this architecture: computing capability and the internal memory bandwidth scale with memory size. This property helps some applications defy the von-Neumann bottleneck, while for others, architectural limitations stand in the way of reaching the hardware potential. Our analysis explains why.
Network-connected accelerators (NCA), such programmable switches, ASICs, and FPGAs can speed up operations in data analytics manyfold. But so far, integration of NCAs into data analytics systems required manual effort: orchestrating their execution over the network and converting data to and from formats understood by NCAs. Clearly, this manual approach is neither scalable nor sustainable. We present JumpGate, a system that simplifies integration of existing NCA code into data analytics systems, such as Apache Spark or Presto. JumpGate places most of the integration code into the analytics system, which needs to be written once, leaving NCA programmers to write only a couple hundred lines of code to integrate of new NCAs. JumpGate relies on dataflow graphs that most analytics systems internally use for query processing, and takes care of the invocation of NCAs, the necessary format conversion, and orchestration of their execution via novel staged network pipelines.
Our implementation of JumpGate in Apache Spark made it possible, for the first time, to study the benefits and drawbacks of using NCAs across the entire range of queries in the TPC-DS benchmark. Since we lack hardware that can accelerate all analytics operations, we implemented NCAs in software. We report insights on how and when analytics workloads will benefit from NCAs to motivate future designs.
Since the end of Dennard scaling and Moore's Law have been foreseen, specialized hardware has become the focus for continued scaling of application performance. Programmable accelerators such as smart memory, smart disks, and smartNICs are now being integrated into our systems. Many accelerators can be programmed to process their data autonomously and require little or no intervention during normal operation. In this way, entire applications are offloaded, leaving the CPU with the minimal responsibilities of initialization, coordination and error handling. We claim that these responsibilities can also be handled in simple hardware other than the CPU and that it is wasteful to use a CPU for these purposes. We explore the role and the structure of the OS in a system that has no CPU and demonstrate that all necessary functionality can be moved to other hardware. We show that almost all of the pieces for such a system design are already available today. The responsibilities of the operating system must be split between self-managing devices and a system bus that handles privileged operations.
USENIX ;Login: Magazine
Optane™ storage-class memory and new Processing-in-Memory (PIM) hardware is on the verge of becoming a commercial product. We believe that combining PIM with SCM, that is, Processing in Storage Class Memory, is the right way to tap into the potential of modern storage. But first, a little history to help make the problems obvious.
Outstanding New Research Direction Award Finalist
Storage and memory technologies are experiencing unprecedented transformation. Storage-class memory (SCM) delivers near-DRAM performance in non-volatile storage media and became commercially available in 2019. Unfortunately, software is not yet able to fully benefit from such high-performance storage. Processing-in-memory (PIM) aims to overcome the notorious memory wall; at the time of writing, hardware is close to being commercially available. This paper takes a position that PIM will become an integral part of future storage-class memories, so data processing can be performed in-storage, saving memory bandwidth and CPU cycles. Under that assumption, we identify typical data-processing tasks poised for in-storage processing, such as compression, encryption and format conversion. We present evidence supporting our assumption and present some feasibility experiments on new PIM hardware to show the potential.