Cloud data center has become the most important IT infrastructure that people use every day. Building efficient future data centers will require collective efforts of the entire global community. As an attempt to initiate a platform that will bring together the most important and forward-looking work in the area for intriguing and productive discussions, the 3rd Workshop on Hot Topics on Data Centers (HotDC 2018) will be held in Beijing, China on October 20th, 2018. HotDC 2018 consists of by-invitation-only presentations from top academic and industrial groups around the world. The topics include a wide range of data-center related issues, including the state-of-the-art technologies for server architecture, storage system, data-center network, resource management etc. Besides, HotDC 2018 provides a special session including invited talks presenting recent research works from the data-center team in Institute of Computing Technology, Chinese Academy of Sciences. The HotDC workshop expects to provide a forum for the cutting edge in data center research, where researchers/engineers can exchange ideas and engage in discussions with their colleagues around the world. Welcome to HotDC 2018!
Yungang Bao, Institute of Computing Technology, Chinese Academy of Sciences
Qun Huang, Institute of Computing Technology, Chinese Academy of Sciences
Dejun Jiang, Institute of Computing Technology, Chinese Academy of Sciences Sa Wang, Institute of Computing Technology, Chinese Academy of Sciences Yuqing Zhu, Institute of Computing Technology, Chinese Academy of Sciences Yisong Chang, Institute of Computing Technology, Chinese Academy of Sciences Biwei Xie, Institute of Computing Technology, Chinese Academy of Sciences Wenya Hu, Institute of Computing Technology, Chinese Academy of Sciences
Conference Venue: Jingzhihu Holiday Hotel, Beijing, ChinaDates: October 20th, 2018
|9:00 – 9:10||Opening remark|
|Keynote Session 1, Chair: Sa Wang|
|9:10-9:50||Keynote 1：Accelerated Machine Intelligence: An Edge to Cloud ContinuumSpeaker: Hadi Esmaeilzadeh, UCSD|
|09:50-10:30||Keynote 2：Making Cloud Systems Reliable and Dependable: Challenges and OpportunitiesSpeaker: Lidong Zhou, MSRA|
|Keynote Session 2, Chair:Yuqing Zhu|
|10:50 – 11:30||Keynote 3：Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!Speaker: Haibo Chen, SJTU|
|11:30 – 12:10||Keynote 4：RDMA in Data Centers: from Cloud Computing to Machine LearningSpeaker: Chuanxiong Guo, Bytedance|
|12:10 – 13:30||Lunch|
|Highlighted Research Session, Chair:Biwei Xie|
|13:30 – 14:00||Congestion Control Mechanisms in Data Center Networks Speaker: Wei Bai, MSRA|
|14:00 – 14:30||Understanding the Challenges of Scaling Distributed DNN TrainingSpeaker: Cheng Li, USTC|
|14:30 – 15:00||Information Leakage in Encrypted Deduplication via Frequency AnalysisSpeaker: Jingwei Li, UESTC|
|15:00 – 15:30||Octopus: an RDMA-enabled Distributed Persistent Memory File SystemSpeaker: Youyou Lu, Tsinghua|
|15:30 – 15:50||Coffee break|
|Short Talk Session, Chair:Yisong Chang|
|15:50 – 17:00||Short talksComputer Organization and Design Course with FPGA Cloud, Ke Zhang (ICT, CAS) ActionFlow：A Framework for Fast Multi-Robots Application Development, Jimin Han (UCAS) Labeled Network Stack, Yifan Shen (ICT, CAS) Caching or Not: Rethinking Virtual File System for Non-Volatile Main Memory, Ying Wang (ICT, CAS) Data Motif-based Proxy Benchmarks for Big Data and AI Workloads, Chen Zheng (ICT, CAS)|
|Panel Session, Chair:Yuhang Liu|
|17:00 – 17:40||PanelTopic: experience for system and networking research – from young scholars’ perspective Panel members: Wei Bai (MSRA), Cheng Li (USTC), Jingwei Li (UESTC), Linpeng Tang (Moqi Inc.), Youyou Lu (Tsinghua)|
|17:40 – 17:50||Closing|
Keynote 1:Accelerated Machine Intelligence: An Edge to Cloud Continuum
Speaker: Hadi Esmaeilzadeh
Abstract:This talk presents, Project PHI (Accelerated System Design for Pervasive Hierarchical Intelligence) a holistic effort to provide a comprehensive solution for making immersive machine intelligence a reality. Our guiding principle is to retain as much generality and automation while delivering large performance and efficiency gains through specialization and acceleration for a wide range of learning and intelligence workloads. As the first milestones of Project PHI, we have developed Tabla and DnnWeaver, which publicly available (http://act-lab.org/artifacts/tabla/ and http://dnnweaver.org/). DnnWeaver is the very first open-source hardware acceleration stack for deep neural networks. Tabla is a cross-stack solution—spanning from programming language to the hardware—that rethinks the hardware/software abstraction by delving into the theory of machine learning. It leverages the insight that many learning algorithms can be solved using stochastic gradient descent that minimizes an objective function. The solver is fixed while the objective function changes with the learning algorithm. Therefore, Tabla uses stochastic optimization as the abstraction between hardware and software. Consequently, programmers specify the learning algorithm by merely expressing the gradient of the objective function in our domain specific language. Tabla then automatically generates the synthesizable implementation of the accelerator and the system software for scale-out FPGA realization using a set of template designs. Real hardware measurements show orders of magnitude higher performance and power efficiency while the programmer only writes 60 lines of code. Next, the talk ventures to the edge domain and shows how utilizing algorithmic insights enables us to match the server-grade GPU performance for DNN acceleration within milli-Watt regime and extend the discussion to our very recent work on complete stack for motion planning and control in robotics, dubbed RoboX. The encouraging results from these full-stack solutions in Project PHI shows that rethinking the hardware/software abstractions from an algorithmic perspective can open new dimensions in system design for Pervasive Hierarchical Intelligence.
Bio: Dr. Esmaeilzadeh was awarded early tenure at the University of California, San Diego (UCSD), where he is the inaugural holder of Halicioglu Chair in Computer Architecture with the rank of associate professor in Computer Science and Engineering. Prior to UCSD, he was an assistant professor in the School of Computer Science at the Georgia Institute of Technology from 2013 to 2017. There, he was the inaugural holder of the Catherine M. and James E. Allchin Early Career Professorship. Hadi is the founding director of the Alternative Computing Technologies (ACT) Lab, where his team is developing new technologies and cross-stack solutions to build the next generation computer systems. He is also the associate director of Center for Machine Integrated Compu=ng and Security (MICS) at UCSD. Dr. Esmaeilzadeh obtained his Ph.D. from the Department of Computer Science and Engineering at the University of Washington in 2013 where his Ph.D. work received the 2013 William Chan Memorial Best Dissertation Award. Prof. Esmaeilzadeh received the IEEE Technical Committee on Computer Architecture (TCCA) “Young Architect” Award in 2018 and was inducted to the ISCA Hall of Fame in the same year. He has received the Air Force Office of Scientific Research Young Investigator Award (2017), College of Computing Outstanding Junior Faculty Research Award (2017), Qualcomm Research Award (2017 and 2016), Google Research Faculty Award (2016 and 2014), Microsoft Research Award (2017 and 2016), and Lockheed Inspirational Young Faculty Award (2016). His teams were awarded the Qualcomm Innovation Fellowship in 2014 and 2018, one of his students was a Microsoft Research Fellow, and another won the 2017 National Center for Women & IT (NCWIT) Collegiate Award. Four of his undergraduate students have been awarded the Georgia Tech President’s Undergraduate Research Award (PURA). His research has been recognized by four Communications of the ACM Research Highlights, four IEEE Micro Top Picks, a nomination for Communications of the ACM Research Highlights, an honorable mention in IEEE Micro Top Picks, and a Distinguished Paper Award in HPCA 2016. Hadi’s work on dark silicon has also been profiled in New York Times. More information is available on his webpage, http://cseweb.ucsd.edu/~hadi/.
Keynote 2: Making Cloud Systems Reliable and Dependable: Challenges and Opportunities
Speaker: Lidong Zhou
Abstract:As our society increasingly depends on cloud services, the reliability of the underlying cloud systems has become a significant challenge. The well-established foundation for system reliability, such as consensus protocols, proves insufficient to immune cloud systems from catastrophic outages. In this talk, we share our experiences with real failures observed in production cloud systems, reveal that a new category of subtle failures, referred to as gray failures, is the source of major availability breakdowns and performance anomalies we see in cloud systems, and discuss how a new foundation can be laid out to address the challenges of gray failures.
Bio: Lidong Zhou is an Assistant Managing Director of Microsoft Research Asia, responsible for research in the System and Networking area. Previously, he was a Principal Researcher managing the Systems Research Group at Microsoft Research Redmond (2014-2017), a Principal Researcher managing the Systems Research Group at Microsoft Research Asia (2008-2014), and a Researcher at Microsoft Research Silicon Valley (2002-2008). His research has been advancing the state of art in both the theory and practice of scalable and reliable distributed systems powering on-line cloud services, while making key technical contributions to production large-scale platforms and services in Microsoft. Lidong is on the editorial board of ACM Transactions on Storage and served on the program committees for top system conferences such as SOSP and OSDI. He was the general co-chair of SOSP 2017 in Shanghai, after years of effort of bringing the top system conference to the Asia-Pacific region. Lidong received his Ph.D. and M.S. in Computer Science from Cornell University and B.S. in Computer Science from Fudan University.
Keynote3: Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!
Speaker: Haibo Chen
Abstract:In this paper, I will present a systematic comparison between different RDMA primitives with a combination of various optimizations using representative OLTP workloads. More specifically, we first implement and compare different RDMA primitives with existing and our new optimizations upon a single well-tuned execution framework. This gives us insights into the performance characteristics of different RDMA primitives. Then we investigate the implementation of optimistic concurrency control (OCC) by comparing different RDMA primitives using a phase-by-phase approach with various transactions from TPC-C, SmallBank, and TPC-E. Our results show that no single primitive (one-sided or two-sided) wins over the other on all phases. We further conduct an end-to-end comparison of prior designs on the same codebase and find none of them is optimal. Based on the above studies, we build DrTM+H, a new hybrid distributed transaction system that always embraces the optimal RDMA primitives at each phase of transactional execution. Evaluations using popular OLTP workloads including TPC-C and SmallBank show that DrTM+H achieves over 7.3 and 90.4 million transactions per second on a 16-node RDMA-capable cluster (ConnectX-4) respectively, without locality assumption. This number outperforms the pure one-sided and two- sided systems by up to 1.89X and 2.96X for TPC-C with over 49% and 65% latency reduction.
Bio: Haibo Chen is a Professor at the School of Software, Shanghai Jiao Tong University, where he co-founds and leads the Institute of Parallel and Distributed Systems (IPADS) (http://ipads.se.sjtu.edu.cn). He currently also serves as the Chief Scientist for OS and directs the OS Kernel Lab. Haibo’s main research interests are building scalable and dependable systems software, by leveraging cross-layering approaches spanning computer hardware system virtualization and operating systems. He is currently the steering committee co-chair of ACM APSys and Chair of ACM SIGOPS ChinaSys, serves on program committees of ASPLOS 2019, EuroSys 2019, Usenix ATC 2019 and SOSP 2019 and the editorial board of ACM Transactions on Storage.
Keynote 4: RDMA in Data Centers: from Cloud Computing to Machine Learning
Speaker: Chuanxiong Guo
Abstract:Data centers are being built around the world to meet the exponentially increasing demands for cloud computing. The same increasing demands drive the networking speed increase from 10Gb/s to 100Gb/s or higher and the e2e latency reduction from milliseconds to microseconds. The traditional software-based TCP/IP, however, cannot keep up with the increasing demand. Consequently, RDMA (Remote Direct Memory Access), once introduced in HPC, now is experiencing a renaissance in Ethernet-based data centers, at a much larger scale. In this talk, we will discuss the safety and performance challenges that we have addressed in deploying RDMA at scale. We will also discuss the new opportunities brought by RDMA and explore the new role that RDMA will play in a more heterogeneous system and networking infrastructure for machine learning. Specifically, we will study how RDMA helps accelerate various DNN models in distributed DNN training.
Bio: Chuanxiong Guo is a director of the AI Lab of Bytedance Inc. Before that he was a Principal Researcher at Microsoft Research. He is currently working on data center networking and machine learning systems. Several of his envisions including DCN virtualization, DCN monitoring, and ServerSwitch generated both academic and industrial impacts. Several of the systems that he designed and implemented, including Pingmesh and RDMA/RoCEv2 were widely adopted by the industry.