[hpc-announce] JCST--Special Section on Selected I/O Technologies for High-Performance Computing and Data Analytics

Wed Apr 8 14:02:54 CDT 2020

Dear All,

With the explosive growth of colossal data from various academic and industrial sectors, many High-Performance Computing (HPC) and data analytics systems have been developed to meet the needs of data collection, processing and analysis. Accordingly, many research groups around the world have explored unconventional and cutting-edge ideas for the management of storage and I/O.

For the I/O research community to get a global picture on the current state-of-the-art and vibrant progress, invited by Journal of Computer Science and Technology (JCST, http://jcst.ict.ac.cn <http://jcst.ict.ac.cn/>), Prof. Xian-He Sun of Illinois Institute of Technology and Prof. Weikuan Yu of Florida State University organized the Special Section on Selected I/O Technologies for High-Performance Computing and Data Analytics, which consists of the following eight high-quality papers from China, Europe, Japan, and the United States. 

Due to COVID-19, we make this special issue free. We hope that a great number of readers and users find this special section interesting and useful for their respective needs and endeavors. Thanks a lot for the authors' contributions and all the reviewers' valuable time and efforts.

Thank you.

Journal of Computer Science and Technology
05 January 2020, Volume 35 Issue 1
Special Section on Selected I/O Technologies for High-Performance Computing and Data Analytics

Preface <http://jcst.ict.ac.cn/EN/10.1007/s11390-020-0001-9> 
Xian-He Sun, Weikuan Yu
Journal of Computer Science and Technology, 2020, 35 (1): 1-3.  DOI: 10.1007/s11390-020-0001-9 <http://dx.doi.org/10.1007/s11390-020-0001-9>
PDF <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/2020-1-1-preface.pdf>

Ad Hoc File Systems for High-Performance Computing <http://jcst.ict.ac.cn/EN/10.1007/s11390-020-9801-1> 
André Brinkmann, Kathryn Mohror, Weikuan Yu, Philip Carns, Toni Cortes, Scott A. Klasky, Alberto Miranda, Franz-Josef Pfreundt, Robert B. Ross, Marc-André Vef
Journal of Computer Science and Technology, 2020, 35 (1): 4-26.  DOI: 10.1007/s11390-020-9801-1 <http://dx.doi.org/10.1007/s11390-020-9801-1>
PDF <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/2020-1-2-9801.pdf>    Highlights <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/9801-Highlights.pdf>    Chinese Summary <http://jcst.ict.ac.cn/CN/Y2020/V35/I1/4>Abstract Storage backends of parallel compute clusters are still based mostly on magnetic disks, while newer and faster storage technologies such as flash-based SSDs or non-volatile random access memory (NVRAM) are deployed within compute nodes. Including these new storage technologies into scientific workflows is unfortunately today a mostly manual task, and most scientists therefore do not take advantage of the faster storage media. One approach to systematically include nodelocal SSDs or NVRAMs into scientific workflows is to deploy ad hoc file systems over a set of compute nodes, which serve as temporary storage systems for single applications or longer-running campaigns. This paper presents results from the Dagstuhl Seminar 17202 "Challenges and Opportunities of User-Level File Systems for HPC" and discusses application scenarios as well as design strategies for ad hoc file systems using node-local storage media. The discussion includes open research questions, such as how to couple ad hoc file systems with the batch scheduling environment and how to schedule stage-in and stage-out processes of data between the storage backend and the ad hoc file systems. Also presented are strategies to build ad hoc file systems by using reusable components for networking and how to improve storage device compatibility. Various interfaces and semantics are presented, for example those used by the three ad hoc file systems BeeOND, GekkoFS, and BurstFS. Their presentation covers a range from file systems running in production to cutting-edge research focusing on reaching the performance limits of the underlying devices.

Design and Implementation of the Tianhe-2 Data Storage and Management System <http://jcst.ict.ac.cn/EN/10.1007/s11390-020-9799-4> 
Yu-Tong Lu, Peng Cheng, Zhi-Guang Chen
Journal of Computer Science and Technology, 2020, 35 (1): 27-46.  DOI: 10.1007/s11390-020-9799-4 <http://dx.doi.org/10.1007/s11390-020-9799-4>
PDF <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/2020-1-3-9799.pdf>    Highlights <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/9799-Highlights.pdf>    Chinese Summary <http://jcst.ict.ac.cn/CN/Y2020/V35/I1/27>Abstract With the convergence of high-performance computing (HPC), big data and artificial intelligence (AI), the HPC community is pushing for "triple use" systems to expedite scientific discoveries. However, supporting these converged applications on HPC systems presents formidable challenges in terms of storage and data management due to the explosive growth of scientific data and the fundamental differences in I/O characteristics among HPC, big data and AI workloads. In this paper, we discuss the driving force behind the converging trend, highlight three data management challenges, and summarize our efforts in addressing these data management challenges on a typical HPC system at the parallel file system, data management middleware, and user application levels. As HPC systems are approaching the border of exascale computing, this paper sheds light on how to enable application-driven data management as a preliminary step toward the deep convergence of exascale computing ecosystems, big data, and AI.

Lessons Learned from Optimizing the Sunway Storage System for Higher Application I/O Performance <http://jcst.ict.ac.cn/EN/10.1007/s11390-020-9798-5> 
Qi Chen, Kang Chen, Zuo-Ning Chen, Wei Xue, Xu Ji, Bin Yang
Journal of Computer Science and Technology, 2020, 35 (1): 47-60.  DOI: 10.1007/s11390-020-9798-5 <http://dx.doi.org/10.1007/s11390-020-9798-5>
PDF <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/2020-1-4-9798.pdf>    Highlights <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/9798-Highlights.pdf>    Chinese Summary <http://jcst.ict.ac.cn/CN/Y2020/V35/I1/47>Abstract It is hard for applications to make full utilization of the peak bandwidth of the storage system in highperformance computers because of I/O interferences, storage resource misallocations and complex long I/O paths. We performed several studies to bridge this gap in the Sunway storage system, which serves the supercomputer Sunway TaihuLight. To locate these issues and connections between them, an end-to-end performance monitoring and diagnosis tool was developed to understand I/O behaviors of applications and the system. With the help of the tool, we were about to find out the root causes of such performance barriers at the I/O forwarding layer and the parallel file system layer. An application-aware I/O forwarding allocation framework was used to address the I/O interferences and resource misallocations at the I/O forwarding layer. A performance-aware data placement mechanism was proposed to mitigate the impact of I/O interferences and performance variations of storage devices in the PFS. Together, applications obtained much better I/O performance. During the process, we also proposed a lightweight storage stack to shorten the I/O path of applications with N-N I/O pattern. This paper summarizes these studies and presents the lessons learned from the process.

Gfarm/BB—Gfarm File System for Node-Local Burst Buffer <http://jcst.ict.ac.cn/EN/10.1007/s11390-020-9803-z> 
Osamu Tatebe, Shukuko Moriwake, Yoshihiro Oyama
Journal of Computer Science and Technology, 2020, 35 (1): 61-71.  DOI: 10.1007/s11390-020-9803-z <http://dx.doi.org/10.1007/s11390-020-9803-z>
PDF <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/2020-1-5-9803.pdf>    Highlights <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/9803-Highlights.pdf>    Chinese Summary <http://jcst.ict.ac.cn/CN/Y2020/V35/I1/61>Abstract Burst buffer has become a major component to meet the I/O performance requirement of HPC bursty traffic. This paper proposes Gfarm/BB that is a file system for a burst buffer efficiently exploiting node-local storage systems. Although node-local storages improve storage performance, they are only available during the job allocation. Gfarm/BB should have better access and metadata performance while it should be constructed on-demand before the job execution. To improve the read and write performance, it exploits the file descriptor passing and remote direct memory access (RDMA). It improves the metadata performance by omitting the persistency and the redundancy since it is a temporal file system. Using RDMA, writes and reads bandwidth are improved by 1.7x and 2.2x compared with IP over InfiniBand (IPoIB), respectively. It achieves 14 700 operations per second in the directory creation performance, which is 13.4x faster than the fully persistent and redundant case. The construction of Gfarm/BB takes 0.31 seconds using 2 nodes. IOR benchmark and ARGOT-IO application I/O benchmark show the scalable performance improvement by exploiting the locality of node-local storages. Compared with BeeOND, Gfarm/BB shows 2.6x and 2.4x better performance in IOR write and read benchmarks, respectively, and it shows 2.5x better performance in ARGOT-IO.

GekkoFS—A Temporary Burst Buffer File System for HPC Applications <http://jcst.ict.ac.cn/EN/10.1007/s11390-020-9797-6> 
Marc-André Vef, Nafiseh Moti, Tim Süß, Markus Tacke, Tommaso Tocci, Ramon Nou, Alberto Miranda, Toni Cortes, André Brinkmann
Journal of Computer Science and Technology, 2020, 35 (1): 72-91.  DOI: 10.1007/s11390-020-9797-6 <http://dx.doi.org/10.1007/s11390-020-9797-6>
PDF <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/2020-1-6-9797.pdf>    Highlights <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/9797-Highlights.pdf>    Chinese Summary <http://jcst.ict.ac.cn/CN/Y2020/V35/I1/72>Abstract Many scientific fields increasingly use high-performance computing (HPC) to process and analyze massive amounts of experimental data while storage systems in today's HPC environments have to cope with new access patterns. These patterns include many metadata operations, small I/O requests, or randomized file I/O, while general-purpose parallel file systems have been optimized for sequential shared access to large files. Burst buffer file systems create a separate file system that applications can use to store temporary data. They aggregate node-local storage available within the compute nodes or use dedicated SSD clusters and offer a peak bandwidth higher than that of the backend parallel file system without interfering with it. However, burst buffer file systems typically offer many features that a scientific application, running in isolation for a limited amount of time, does not require. We present GekkoFS, a temporary, highly-scalable file system which has been specifically optimized for the aforementioned use cases. GekkoFS provides relaxed POSIX semantics which only offers features which are actually required by most (not all) applications. GekkoFS is, therefore, able to provide scalable I/O performance and reaches millions of metadata operations already for a small number of nodes, significantly outperforming the capabilities of common parallel file systems.

I/O Acceleration via Multi-Tiered Data Buffering and Prefetching <http://jcst.ict.ac.cn/EN/10.1007/s11390-020-9781-1> 
Anthony Kougkas, Hariharan Devarajan, Xian-He Sun
Journal of Computer Science and Technology, 2020, 35 (1): 92-120.  DOI: 10.1007/s11390-020-9781-1 <http://dx.doi.org/10.1007/s11390-020-9781-1>
PDF <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/2020-1-7-9781.pdf>    Highlights <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/9781-Highlights.pdf>    Chinese Summary <http://jcst.ict.ac.cn/CN/Y2020/V35/I1/92>Abstract Modern High-Performance Computing (HPC) systems are adding extra layers to the memory and storage hierarchy, named deep memory and storage hierarchy (DMSH), to increase I/O performance. New hardware technologies, such as NVMe and SSD, have been introduced in burst buffer installations to reduce the pressure for external storage and boost the burstiness of modern I/O systems. The DMSH has demonstrated its strength and potential in practice. However, each layer of DMSH is an independent heterogeneous system and data movement among more layers is significantly more complex even without considering heterogeneity. How to efficiently utilize the DMSH is a subject of research facing the HPC community. Further, accessing data with a high-throughput and low-latency is more imperative than ever. Data prefetching is a well-known technique for hiding read latency by requesting data before it is needed to move it from a high-latency medium (e.g., disk) to a low-latency one (e.g., main memory). However, existing solutions do not consider the new deep memory and storage hierarchy and also suffer from under-utilization of prefetching resources and unnecessary evictions. Additionally, existing approaches implement a client-pull model where understanding the application's I/O behavior drives prefetching decisions. Moving towards exascale, where machines run multiple applications concurrently by accessing files in a workflow, a more data-centric approach resolves challenges such as cache pollution and redundancy. In this paper, we present the design and implementation of Hermes:a new, heterogeneous-aware, multi-tiered, dynamic, and distributed I/O buffering system. Hermes enables, manages, supervises, and, in some sense, extends I/O buffering to fully integrate into the DMSH. We introduce three novel data placement policies to efficiently utilize all layers and we present three novel techniques to perform memory, metadata, and communication management in hierarchical buffering systems. Additionally, we demonstrate the benefits of a truly hierarchical data prefetcher that adopts a server-push approach to data prefetching. Our evaluation shows that, in addition to automatic data movement through the hierarchy, Hermes can significantly accelerate I/O and outperforms by more than 2x state-of-the-art buffering platforms. Lastly, results show 10%-35% performance gains over existing prefetchers and over 50% when compared to systems with no prefetching.

Mochi: Composing Data Services for High-Performance Computing Environments <http://jcst.ict.ac.cn/EN/10.1007/s11390-020-9802-0> 
Robert B. Ross, George Amvrosiadis, Philip Carns, Charles D. Cranor, Matthieu Dorier, Kevin Harms, Greg Ganger, Garth Gibson, Samuel K. Gutierrez, Robert Latham, Bob Robey, Dana Robinson, Bradley Settlemyer, Galen Shipman, Shane Snyder, Jerome Soumagne, Qing Zheng
Journal of Computer Science and Technology, 2020, 35 (1): 121-144.  DOI: 10.1007/s11390-020-9802-0 <http://dx.doi.org/10.1007/s11390-020-9802-0>
PDF <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/2020-1-8-9802.pdf>    Highlights <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/9802-Highlights.pdf>    Chinese Summary <http://jcst.ict.ac.cn/CN/Y2020/V35/I1/121>Abstract Technology enhancements and the growing breadth of application workflows running on high-performance computing (HPC) platforms drive the development of new data services that provide high performance on these new platforms, provide capable and productive interfaces and abstractions for a variety of applications, and are readily adapted when new technologies are deployed. The Mochi framework enables composition of specialized distributed data services from a collection of connectable modules and subservices. Rather than forcing all applications to use a one-size-fits-all data staging and I/O software configuration, Mochi allows each application to use a data service specialized to its needs and access patterns. This paper introduces the Mochi framework and methodology. The Mochi core components and microservices are described. Examples of the application of the Mochi methodology to the development of four specialized services are detailed. Finally, a performance evaluation of a Mochi core component, a Mochi microservice, and a composed service providing an object model is performed. The paper concludes by positioning Mochi relative to related work in the HPC space and indicating directions for future work.

ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems <http://jcst.ict.ac.cn/EN/10.1007/s11390-020-9822-9> 
Suren Byna, M. Scot Breitenfeld, Bin Dong, Quincey Koziol, Elena Pourmal, Dana Robinson, Jerome Soumagne, Houjun Tang, Venkatram Vishwanath, Richard Warren
Journal of Computer Science and Technology, 2020, 35 (1): 145-160.  DOI: 10.1007/s11390-020-9822-9 <http://dx.doi.org/10.1007/s11390-020-9822-9>
PDF <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/2020-1-9-9822.pdf>    Highlights <http://jcst.ict.ac.cn/fileup/1000-9000/PDF/9822-Highlights.pdf>    Chinese Summary <http://jcst.ict.ac.cn/CN/Y2020/V35/I1/145>Abstract Scientific applications at exascale generate and analyze massive amounts of data. A critical requirement of these applications is the capability to access and manage this data efficiently on exascale systems. Parallel I/O, the key technology enables moving data between compute nodes and storage, faces monumental challenges from new applications, memory, and storage architectures considered in the designs of exascale systems. As the storage hierarchy is expanding to include node-local persistent memory, burst buffers, etc., as well as disk-based storage, data movement among these layers must be efficient. Parallel I/O libraries of the future should be capable of handling file sizes of many terabytes and beyond. In this paper, we describe new capabilities we have developed in Hierarchical Data Format version 5 (HDF5), the most popular parallel I/O library for scientific applications. HDF5 is one of the most used libraries at the leadership computing facilities for performing parallel I/O on existing HPC systems. The state-of-the-art features we describe include:Virtual Object Layer (VOL), Data Elevator, asynchronous I/O, full-featured single-writer and multiple-reader (Full SWMR), and parallel querying. In this paper, we introduce these features, their implementations, and the performance and feature benefits to applications and other libraries.

Best Regards,

Editorial Office
Journal of Computer Science and Technology
P.O.Box 2704, Beijing 100190
P.R.China
Tel：(8610)62610746; 62600340 
Online Submission: https://mc03.manuscriptcentral.com/jcst <https://mc03.manuscriptcentral.com/jcst>
E-mail:jcst at ict.ac.cn <mailto:E-mail%3Ajcst at ict.ac.cn>
http://jcst.ict.ac.cn <http://jcst.ict.ac.cn/>
——————————————

--
Weikuan (Will) Yu, Computer Sci.
Florida State University
Office: +1 850-644-5442