Introduction

Cloud datacenter has become the most important IT infrastructure that people use every day. Building efficient future datacenters will require collective efforts of the entire global community. As an attempt to initiate a platform that will bring together the most important and forward-looking work in the area for intriguing and productive discussions, the 5th Workshop on Hot Topics on Data Centers (HotDC 2020) will be held in Beijing, China on January 18th, 2021.

HotDC 2020 consists of by-invitation-only presentations from top academic and industrial groups around the world. The topics include a wide range of datacenter related issues, including the state-of-the-art technologies for server architecture, storage system, datacenter network, resource management, etc. Besides, HotDC 2020 provides a special session including invited talks presenting recent research works from the datacenter team in Institute of Computing Technology, Chinese Academy of Sciences.
The HotDC workshop expects to provide a forum for the cutting edge in datacenter research, where researchers/engineers can exchange ideas and engage in discussions with their colleagues around the world. Welcome to HotDC 2020!


Join the Workshop with ZOOM
Time: GMT+8 8:50am ~ 5:45pm, 18th Jan, 2021
Meeting id: 663 9026 3532
Passcode: 965582
Online streamiing: https://live.bilibili.com/22391464


使用 ZOOM 加入会议
时间:2021年1月18日,上午 8:50 - 下午5:45
会议号:663 9026 3532
密码:965582
直播地址:https://live.bilibili.com/22391464

 


 

Organizing Committee

General Chair

Yungang Bao, Institute of Computing Technology, Chinese Academy of Sciences

TPC Chair

Biwei Xie, Institute of Computing Technology, Chinese Academy of Sciences
Wanling Gao, Institute of Computing Technology, Chinese Academy of Sciences

Publicity Committee

Zhiwei Lai, Institute of Computing Technology, Chinese Academy of Sciences
Zijian Wang, Institute of Computing Technology, Chinese Academy of Sciences
He Liu, Peking University

Organization Committee

Ke Zhang, Institute of Computing Technology, Chinese Academy of Sciences
Ke Liu, Institute of Computing Technology, Chinese Academy of Sciences
Sa Wang, Institute of Computing Technology, Chinese Academy of Sciences
Wenya Hu, Institute of Computing Technology, Chinese Academy of Sciences
Di Li, Institute of Computing Technology, Chinese Academy of Sciences
Yangyang Zhao, Institute of Computing Technology, Chinese Academy of Sciences
Dejun Jiang, Institute of Computing Technology, Chinese Academy of Sciences

 

Workshop Schedule

08:50 - 09:00 Opening remark

Session A: System and Network (Chair: Ke Liu)

09:00 - 09:30

Topic: High Performance Graph Mining Systems
Speaker: Xuehai Qian, University of Southern California

09:30 - 10:00

Topic: Network protocols are dead, long live networking abstractions!
Speaker: Theophilus A. Benson, Brown University

10:00 - 10:30

Topic: Programmable In-network Security: A Vision for Network Security in the Next Generation
Speaker: Ang Chen, Rice University

10:30 - 10:40

Break

Session B:  Heterogeneous Systems for DC (Chair: Ke Zhang)

10:40 - 11:10

Topic:  Terminus – Disaggregate The Cloud with Hardware Virtualization
Speaker: Ran Shu, Microsoft Research Asia

11:10 - 11:40

Topic: Efficient Heterogeneous Computing for Interactive Applications in Datacenter
Speaker: Shuo Wang, Tsinghua University 

Session C: Systems for Graph and AI  (Chair: Sa Wang)

14:00 - 14:30

Topic:Parallel Graph Processing Systems on Heterogeneous Architectures
Speaker: Bingsheng He, National University of Singapore

14:30 - 15:00

Topic: 基于软硬件协同设计的高效高可靠AI系统结构
Speaker: Jingwen Leng, Shanghai Jiao Tong University

15:00 - 15:30

Topic: Understanding and Optimizing Hierarchical Dataflow Scheduling for Scalable NN Accelerators
Speaker: Mingyu Gao, Tsinghua University 

15:30 - 15:40

Break

Ph.D. Student Session (Chair: Yangyang Zhao)

15:40 - 16:05

Topic: NfvInsight: A Framework for Automatically Deploying and Benchmarking VNF Chains
Speaker: Tianni Xu, ICT, CAS

16:05 - 16:30

Topic: LSP: Collective Cross-Page Prefetching for NVM
Speaker: Haiyang Pan, ICT, CAS

16:30 - 16:55

Topic: NUMA-Aware Thread Migration for High Performance NVMM File Systems
Speaker: Ying Wang, ICT, CAS

16:55 - 17:20

Topic: AIBench: AI Scenario, Training, and Inference Benchmarks across Datacenter, HPC, IoT and Edge
Speaker: Fei Tang, ICT, CAS

17:20 - 17:45

Topic: Toward Nearly-Non-Zero Error Sketching via Compressive Sensing
Speaker: Siyuan Sheng, ICT, CAS

 

 

Keynote Speakers

keynote2

Keynote 1Xuehai Qian, University of Southern California
Topic: High Performance Graph Mining Systems

Bio: Xuehai Qian is an assistant professor at University of Southern California. His research interests include domain-specific systems and architectures for emerging applications such as machine learning and graph analytics, and recently hardware security and quantum computing. He got his Ph.D from UIUC. He is the recipient of W.J Poppelbaum Memorial Award at UIUC, NSF CRII and CAREER Award, and the inaugural ACSIC (American Chinese Scholar In Computing) Rising Star Award. He is inducted to the "Hall of Fame" of ASPLOS and HPCA; and Computer Architecture Aggregated Hall-of-Fame. For more details, please visit his research group at: http://alchem.usc.edu/.

Abstract: Graph mining, which finds all embeddings matching specific patterns, is a fundamental task in many applications. In this talk, I will present the first graph mining system that decomposes the target pattern into several subpatterns, and then computes the count of each. The system addressed several key challenges including: a partial-embedding-centric programming model supporting advanced graph mining applications; an accurate and efficient cost model based on approximate graph mining; an efficient search method to jointly determine the decomposition of all concrete patterns of an application; and the partial symmetry breaking technique to eliminate redundant enumeration. Our experiments show that the system is significantly faster than all existing state-of-the-art systems and provides a novel and viable path to scale to large patterns.

 

keynote3

Keynote 2: Theophilus A. Benson, Brown University
Topic: Network protocols are dead, long live networking abstractions!

Bio: Theo is an assistant professor in the Department of Computer Science at Brown University. His group works on designing frameworks and algorithms for solving networking problems, speeding up the Internet, improving network reliability, and simplifying network management. He has won multiple awards, including best paper awards, AppliedNnetwork Research Prize, Yahoo!, Google, Facebook Faculty Awards, and an NSF Career award.

Abstract: The ossification of the networking layer has long limited the evolution of networking services and applications. The emergence of programmable data planes and their inherent flexibility has enabled the broader community to revisit the network's role.  However, this flexibility is limited, and we lack sufficient primitives to harness and manage this flexibility effectively. 
In this talk, I will discuss challenges that arise when the network is extended to support rich distributed systems abstractions, i.e., in-network computing, and sketch out a broad set of primitives for enabling in-network computing effectively. I will also describe ongoing work to extend our abstractions to manage traditional accelerators, e.g., GPUs and FPGAs.

keynote3

Keynote 3: Ang Chen, Rice University
Topic: Programmable In-network Security: A Vision for Network Security in the Next Generation

Bio: Ang Chen is an assistant professor in the Department of Computer Science at Rice University. His research interests span networking, security, and systems, with a particular focus on making networked systems more reliable, efficient, and secure. Ang loves life and hope that you do, too!

Abstract: Network attacks are on the rise, and many of them can be traced to a common root cause---the Internet does not have security support in its architecture. Existing proposals either need to make intrusive changes to the Internet, or resort to bolt-on protection for each discovered attack. In the Poise (Programmable In-network Security) project, we are rethinking how to develop a secure foundation for the next-generation Internet. Poise leverages technological advances in emerging programmable networking hardware, and it takes a three-step approach. First, Poise transforms a programmable switch into a defense platform by developing a suite of defenses that reside in the switch. Next, Poise transforms a programmable network into a defense fleet by synchronizing distributed defenses across the network. Furthermore, Poise reasons about the in-network defenses to ensure that they are themselves secure, both individually and in composition.

 

keynote1

Keynote 4: Ran Shu, Microsoft Research Asia
Topic: Terminus – Disaggregate The Cloud with Hardware Virtualization

Bio: Ran Shu is a Senior Researcher of Microsoft Research Asia. He received his Ph.D. from Tsinghua University in 2018. His research interests include datacenter network and network hardware. Ran is currently focusing on designing and implementing next generation data center network systems using hardware capability.

Abstract: From a customer perspective, an ideal public cloud provides private cloud security, performance isolation, and customizability that is instantaneously rentable at a minute granularity. Cloud providers use virtualization to provide such public cloud capabilities. Today’s virtualization capabilities, however, lack of one or more of the following characteristics: generality, efficiency, isolation, sharing capability, security and scalability. We propose Terminus, universal virtualization for computation cores, memory, and I/O that depends on hardware support that enables sub-component-level access control, network accessibility, and quality of service. Prototypes built on commodity devices demonstrate Terminus’s advantages compared to existing techniques.

 

keynote4

Keynote 5: Shuo Wang, Tsinghua University
Topic: Efficient Heterogeneous Computing for Interactive Applications in Datacenter

Bio: Shuo Wang is a Postdoctoral researcher in the Storage Research Group of Tsinghua University. He received his Ph.D.  in Computer Science from the Center for Energy-Efficient Computing and Application (CECA), Peking University. His current research interests include compilation techniques for heterogeneous computing platforms and near-storage computing. His works have been published at DAC, FPGA, HPCA, and TC.

Abstract: QoS-sensitive workloads, common in warehouse-scale datacenters, require a guaranteed   stable tail latency percentile response latency of the service. Unfortunately, the system load (e.g., RPS) fluctuates drastically during daily datacenter operations. In order to meet the maximum system RPS requirement, datacenter tends to overprovision the hardware accelerators, which makes the datacenter underutilized. Therefore, the throughput and energy efficiency scaling of the current accelerator-outfitted datacenter are very expensive for QoS-sensitive workloads. To overcome this challenge, this work introduces Poly, an OpenCL based heterogeneous system optimization framework that targets to improve the overall throughput scalability and energy proportionality while guaranteeing the QoS by efficiently utilizing GPUs and FPGAs based accelerators within the datacenter. Experiments using a variety of cloud QoS-sensitive applications show that Poly improves the energy proportionality by 23%(17%) without sacrificing the QoS compared to the state-of-the-art GPU (FPGA) solution, respectively.

 

keynote4

Keynote 6: Bingsheng He(何丙胜), National University of Singapore
Topic: Parallel Graph Processing Systems on Heterogeneous Architectures

Bio: Dr. Bingsheng He is currently an Associate Professor and Vice-Dean (Research) at School of Computing, National University of Singapore. Before that, he was a faculty member in Nanyang Technological University, Singapore (2010-2016), and held a research position in the System Research group of Microsoft Research Asia (2008-2010), where his major research was building high performance cloud computing systems for Microsoft. He got the Bachelor degree in Shanghai Jiao Tong University (1999-2003), and the Ph.D. degree in Hong Kong University of Science & Technology (2003-2008). His current research interests include cloud computing, database systems and high performance computing. His papers are published in prestigious international journals (such as ACM TODS and IEEE TKDE/TPDS/TC) and proceedings (such as ACM SIGMOD, VLDB/PVLDB, ACM/IEEE SuperComputing, ACM HPDC, and ACM SoCC). He has been awarded with the IBM Ph.D. fellowship (2008), NVIDIA Academic Partnership (2011), Adaptive Compute Research Cluster from Xilinx (2020) and ACM distinguished member (class 2020). Since 2010, he has (co-)chaired a number of international conferences and workshops, including IEEE CloudCom 2014/2015, BigData Congress 2018 and ICDCS 2020. He has served in editor board of international journals, including IEEE Transactions on Cloud Computing (IEEE TCC), IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), IEEE Transactions on Knowledge and Data Engineering (TKDE), Springer Journal of Distributed and Parallel Databases (DAPD) and ACM Computing Surveys (CSUR).

Abstract: Graphs are de facto data structures for many data processing applications, and their volume is ever growing. Many graph processing tasks are computation intensive and/or memory intensive. Therefore, we have witnessed a significant amount of effort in accelerating graph processing tasks with heterogeneous architectures like GPUs, FPGAs and even ASIC. In this talk, we will first review the literatures of large graph processing systems on heterogeneous architectures. Next, we present our research efforts, and demonstrate the significant performance impact of hardware-software co-design on designing high performance graph computation systems and applications. Finally, we outline the research agenda on challenges and opportunities in the system and application development of future graph processing. More details about our research can be found at http://www.comp.nus.edu.sg/~hebs/.

 

keynote4

Keynote 7: Jingwen Leng上海交通大学
Topic: 基于软硬件协同设计的高效高可靠AI系统结构

Bio: 冷静文现就职于上海交通大学计算机科学与技术系、John Hopcroft计算机科学中心,任长聘教轨副教授,主要研究方向为面向人工智能的新型计算系统的设计以及性能、能效、可靠性优化,并在国际一流的会议和期刊上发表了二十多篇论文和相关国内与国际专利。他于2016年12月毕业于德州大学奥斯汀分校电子与计算机工程系并获得博士学位,期间主要研究GPU处理器的体系结构优化,其主持设计的GPU功耗模型GPUWattch是目前学术界最为广泛使用的功耗模型;2010年7月毕业于上海交通大学,获得学士学位。

Abstract: 随着人工智能和计算技术的迅速发展,人工智能与计算系统的融合发展成为信息领域的重要趋势之一。计算系统为人工智能快速发展提供算力引擎,人工智能成为计算系统的重要应用,两者的融合发展对计算系统的体系结构、系统软件和开发方法等都带来了新的挑战与机遇。本次演讲将介绍我们近期利用软硬件协同优化的方法,设计高效与高可靠的AI系统结构。

 

keynote4

Keynote 8: Mingyu Gao, Tsinghua University
Topic: Understanding and Optimizing Hierarchical Dataflow Scheduling for Scalable NN Accelerators

Bio: Mingyu Gao is a tenure-track assistant professor of computer science in the Institute for Interdisciplinary Information Sciences (IIIS) at Tsinghua University in Beijing, China. Mingyu received his PhD and Master of Science degrees in Electrical Engineering at Stanford University, and Bachelor of Science degree in Microelectronics at Tsinghua University. His research interests lie in the fields of computer architecture and systems, including efficient memory architectures, scalable data processing, and hardware system security, with a special emphasis on data-intensive applications like artificial intelligence and big data analytics. He won the IEEE Micro Top Picks paper award in 2016 and was the recipient of Forbes China 30 Under 30 in 2019. He also served in the program committees of MICRO 2019, MICRO 2020, SoCC 2020, ASPLOS 2021, and ISPASS 2021.

Abstract: The use of increasingly larger and more complex neural networks (NNs) makes it critical to scale the capabilities and efficiency of NN accelerators. Tiled and multi-chip/chiplet architectures provide a scalable hardware solution that supports many types of parallelism in NNs, including data parallelism, intra-layer parallelism, and inter-layer pipelining. In this talk, I will discuss the software-level dataflow scheduling for such highly parallel hardware, with a comprehensive hierarchical taxonomy for NN dataflow scheduling composed of a rich set of tightly-coupled dataflow levels. I will zoom into some of these levels with more details. First, we use the Halide scheduling language to understand the dataflow choices in the loop transformation and spatial unrolling levels. Then, we introduce several intra-layer and inter-layer dataflow optimizations. The results show that these optimizations significantly improve the performance and energy efficiency of tiled NN accelerators across a wide range of NNs. This talk is mainly based on our recent work published in ASPLOS 2019 and 2020.

 

keynote4

Talk 1: Tianni Xu 
Topic: NfvInsight: A Framework for Automatically Deploying and Benchmarking VNF Chains

Bio: Tian-Ni Xu is a Ph.D. candidate at University of Chinese Academy of Sciences (UCAS), Institute of Computing Technology, Chinese Academy of Sciences (ICT-CAS), Beijing. She received her B.S. degree in network engineering from Beijing University of Posts and Telecommunications, Beijing, in 2013. Her research interests include computer network, network function virtualization, operating system, and system performance modeling and evaluation.

Abstract: With the advent of virtualization techniques and software defined networks, network function virtualization (NFV) shifts network functions (NFs) from hardware implementations to software appliances, between which exists a performance gap. How to narrow the gap is an essential issue of current NFV research. However, the cumbersomeness of deployment, the water pipe effect of virtual network function (VNF) chains, and the complexity of the system software stack together make it tough to figure out the cause of low performance in the NFV system.
To pinpoint the NFV system performance issue, we propose NfvInsight, a framework for automatic deployment and benchmarking VNF chains. Our framework tackles the challenges in NFV performance analysis. The framework components include chain graph generation, automatic deployment, and fine granularity measurement. The design and implementation of each component have its advantages. To our best knowledge, we make the first attempt to collect rules forming a knowledge base for generating reasonable chain graphs. NfvInsight deploys the generated chain graphs automatically, which frees the network operators from executing at least 391 lines of bash commands for a single test. To diagnose the performance bottleneck, NfvInsight collects metrics from multiple layers of the software stack. Specifically, we collect the network stack latency distribution ingeniously, introducing only less than 2.2% overhead. We showcase the convenience and usability of NfvInsight in finding bottlenecks for both VNF chains and the underlying system. Leveraging our framework, we find several design flaws of network stack which are unsuitable for packet forwarding inside one single server under the NFV circumstance. Our optimization for these flaws gains at most 3x performance improving.

 

keynote4

Talk 2: Haiyang Pan
Topic: LSP: Collective Cross-Page Prefetching for NVM

Bio: Haiyang Pan is PhD candidate of Institute of Computing Technology, Chinese Academy of Sciences. He received his BS from Huazhong University of Science and Technology. His research interests include memory systems, non-volatile memory, etc. His paper has been published in DATE’21.

Abstract: As an emerging technique, non-volatile memory (NVM) provides valuable opportunities for boosting the memory system, which is vital for the computing system performance. However, one challenge preventing NVM from replacing DRAM as the main memory is that NVM row activation’s latency is much longer (by approximately 10x) than that of DRAM. To address this issue, we present a collective cross-page prefetching scheme that can accurately open an NVM row in advance and then prefetch the data blocks from the opened row with low overhead. We identify a memory access pattern (referred to as a ladder stream) to facilitate prefetching that can cross page boundary, and propose the ladder stream prefetcher (LSP) for NVM. In LSP, two crucial components have been well designed. Collective Prefetch Table is proposed to reduce the interference with demand requests caused by prefetching through speculatively scheduling the prefetching according to the states of the memory queue. It is implemented with low overhead by using single entry to track multiple prefetches. Memory Mapping Table is proposed to accurately prefetch future pages by maintaining the mapping between physical and virtual addresses. Experimental evaluations show that LSP improves the memory system performance with no prefetching by 66%, and the improvement over the state-of-the-art prefetchers, Access Map Pattern Matching Prefetcher (AMPM), Best-Offset Prefetcher (BOP) and Signature Path Prefetcher (SPP) is 26.6%, 21.7% and 27.4%, respectively.

 

keynote4

Talk 3: Ying Wang
Topic: NUMA-Aware Thread Migration for High Performance NVMM File Systems

Bio: 王盈,中科院计算技术研究所博士,主要研究方向为基于新型存储器件的存储系统。

Abstract:  Emerging Non-Volatile Main Memories (NVMMs) provide persistent storage and can be directly attached to the memory bus, which allows building file systems on non-volatile main memory (NVMM file systems). Since file systems are built on memory, NUMA architecture has a large impact on their performance due to the presence of remote memory access and imbalanced resource usage. Existing works migrate thread and thread data on DRAM to solve these problems. Unlike DRAM, NVMM introduces extra latency and lifetime limitations. This results in expensive data migration for NVMM file systems on NUMA architecture. In this paper, we argue that NUMA-aware thread migration without migrating data is desirable for NVMM file systems. We propose NThread, a NUMA-aware thread migration module for NVMM file system. NThread applies what-if analysis to get the node that each thread performs local access and evaluate what resource contention will be if all threads access data locally. Then NThread adopts migration based on priority to reduce NVMM and CPU contention. In addition, NThread also considers CPU cache sharing between threads for NVMM file systems when migrating threads. We implement NThread in state-of-the-art NVMM file system and compare it against existing NUMA-unaware NVMM file system ext4-dax, PMFS and NOVA. NThread improves throughput by 166.5%, 872.0% and 78.2% on average respectively for filebench. For running RocksDB, NThread achieves performance improvement by 111.3%, 57.9%, 32.8% on average.

 

keynote4

Talk 4: Fei Tang
Topic: AIBench: AI Scenario, Training, and Inference Benchmarks across Datacenter, HPC, IoT and Edge

Bio: 汤飞,中科院计算所先进计算机系统研究中心博士生,导师是詹剑锋研究员,研究方向为基准测试、大规模系统仿真与验证、深度学习系统

Abstract: AI has been widely used in various application fields, and thus there is an urgent need for a comprehensive and representative benchmark suite to evaluate AI systems and architectures fairly. However, considering different AI benchmarking requirements for various applications like datacenter and HPC, different stages like training and inference, and different purposes like workload characterization and market ranking, how to achieve the representativeness and comprehensiveness raises a big challenge. In cooperation with 17 industrial partners, we propose the most comprehensive AI benchmarks—AIBench, covering AI Scenario, Training, and Inference Benchmarks across Datacenter, HPC, IoT and Edge. Meanwhile, we regularly publish AI performance rankings, aiming to promote the development of AI systems and architectures.

 

keynote4

Talk 5: Siyuan Sheng
Topic: Toward Nearly-Non-Zero Error Sketching via Compressive Sensing

Bio: I graduated from SJTU in 2018 with a bachelor's degree. Now, I am in the third year of master program in UCAS. My advisors are Yungang Bao and Qun Huang. I mainly focus on computer network especially for measurement.

Abstract: Sketch algorithms have been extensively studied in the area of network measurement, given their limited resource usage and theoretically bounded errors. However, error bounds provided by existing algorithms remain too coarse-grained: in practice, only a small number of flows (e.g., heavy hitters) actually benefit from the bounds, while the remaining flows still suffer from serious errors. In this paper, we aim to design nearly-zero-error sketch that achieves negligible per-flow error for almost all flows. We base our study on a technique named compressive sensing. We exploit compressive sensing in two aspects. First, we incorporate the near-perfect recovery of compressive sensing to boost sketch accuracy. Second, we leverage compressive sensing as a novel and uniform methodology to analyze various design choices of sketch algorithms. Guided by the analysis, we propose two sketch algorithms that seamlessly embrace compressive sensing to reach nearly zero errors.We implement our algorithms in OpenVSwitch and P4. Experimental results show that the two algorithms incur less than 0.1% per-flow error for more than 99.72% flows, while preserving the resource efficiency of sketch algorithms. The efficiency demonstrates the power of our new methodology for sketch analysis and design.

 

Contacts

xiebiwei@ict.ac.cn
gaowanling@ict.ac.cn