Interactive Large-Scale Data and Graph Analytics

Arkouda + Arachne logos.

Abstract

There is an ever-growing need for data analytical tools that can handle massive data sets. Arkouda is a Python framework with a Chapel back-end created with the intention to scale NumPy operations at scale for datasets that exceeds tens of terabytes in size. The Python front-end allows for data scientists to utilize the functionality of Arkouda to carry out expensive high-performance computing (HPC) kernels that require the usage of large distributed arrays. Arkouda is not designed with the intention to totally replace libraries like Pandas or NumPy, but rather provide the capability to handle datasets that are massive in size in a highly-scalable environment. The goal is to create an environment that is beneficial for exploratory data and graph analysis (EDA) while staying simple enough for all data scientists to be able to pick up without an issue. Recently, our group at NJIT has created a new graph analysis library based off Arkouda under the name Arachne. The purpose of this tutorial is to provide a comprehensive view of typical pipelines that can be built and integrated with Arkouda. We will first begin by introducing an overview of Arkouda for and then move to Arachne. Examples will be provided with the questions and problems data scientists may want to answer and how Arkouda and Arachne can fit in to solve said problems. We will conclude with questions and further work that our group is planning for Arachne. Both Arkouda and Arachne are open-source and found on GitHub.

Date
Feb 26, 2023 1:20 PM — 5:40 PM

Outline

Note: Outline timings and slides are subject to change.

  1. Introduction (1:20pm - 2:00pm) [slides]
  2. Break (2:00pm-2:20pm)
  3. Data Analytics (2:20pm - 3:20pm) [slides]
  4. Break (3:20pm - 3:40pm)
  5. Graph Analytics (3:40pm - 4:40pm) [slides]
  6. Break (4:40pm - 5:00pm)
  7. Conclusion (5:00pm - 5:20pm) [slides]
  8. Q&A (5:20pm - 5:40pm)

Author Biographies

  1. Oliver Alvarado Rodriguez - Oliver Alvarado Rodriguez is currently a computer science Ph.D. student at New Jersey Institute of Technology in Newark, NJ. He performs research under the supervision of Dr. David Bader. He received his B.S. in computer science with a minor in mathematics from William Paterson University in Wayne, NJ in May 2020 with summa cum laude honors. During his undergraduate studies, he was a member of the Honors College, a part of the Upsilon Pi Epsilon honor society for computing and information disciplines, and was also awarded the Omicron Omega award for excellence in computer science. His research interests involve the design and implementation of algorithms in the areas of high-performance analytics, machine learning, and graph theory. He has also dabbled with some cryptographical and computer security research during his undergraduate studies. He was recently awarded a best paper presentation award at the 2020 BDML/ICAIP conference for his presentation on the paper titled “A Study of Machine Learning Inference Benchmarks” done in collaboration with Dev Dave and under the tutelage of Dr. Weihua Liu and Dr. Bogong Su. Oliver recently served as the student keynote speaker at the Spring 2022 meeting of the Academic Data Science Alliance, where he presented the keynote talk: “Enabling Exploratory Large Scale Graph Analytics through Arkouda.”
  2. Naren Khatwani - Naren Khatwani is a Graduate Student majoring in Computer Science at NJIT in Newark, NJ. He has been working under the supervision of Dr David Bader’s Research Group as a Research Assistant. Naren has completed his B.E in Computer Engineering from University of Mumbai, India. His research interests lie in the domain of High Performance Computing and Data Analytics.
  3. Zhihui Du - Zhihui Du received the BE degree in 1992 in computer department from Tianjian University. He received the MS and PhD degrees in computer science, respectively, in 1995 and 1998, from Peking University. From 1998 to 2000, he worked at Tsinghua University as a postdoctor. From 2001 to 2019, he worked at Tsinghua University as an associate professor in the Department of Computer Science and Technology. In 2008, he visited Georgia Tech for one year. His research areas include cluster system design, parallel algorithm design, task and message scheduling, resource and QoS management in grid and cloud computing. He has authored/co-authored two books, translated two books and edited three books in parallel computing or related fields. As the PI, he has finished more than 10 parallel computing related projects and published more than 100 parallel computing or related papers. As a major contributor, he designed and built the “DeepSuper- 21C” supercomputer which was included in the top500 list (Nov. 2003, Rank 163). His book on MPI programming is widely used in China in the parallel programming fields. He has served as the Vice Chair/PC member of more than 10 parallel processing or related conferences. He is an IEEE/ACM member.
  4. David A. Bader - David A. Bader is a Distinguished Professor and founder of the Department of Data Science and inaugural Director of the Institute for Data Science at New Jersey Institute of Technology. Prior to this, he served as founding Professor and Chair of the School of Computational Science and Engineering, College of Computing, at Georgia Institute of Technology. Dr. Bader is a Fellow of the IEEE, ACM, AAAS, and SIAM, and a recipient of the IEEE Computer Society Sidney Fernbach Award. He advises the White House, most recently on the National Strategic Computing Initiative (NSCI) and Future Advanced Computing Ecosystem (FACE). Dr. Bader is a leading expert in solving global grand challenges in science, engineering, computing, and data science. His interests are at the intersection of high-performance computing and real-world applications, including cybersecurity, massive-scale analytics, and computational genomics, and he has co-authored over 300 scholarly papers and has best paper awards from ISC, IEEE HPEC, and IEEE/ACM SC.