Delta: System Research

System Research

Check back often to learn about research the Delta Project is conducting in these areas and others that may arise as the project progresses. 

GPU Computing

Delta continues the hard work of adapting and optimizing applications for GPU-based accelerated computing. Delta staff members will be surveying the research community for both existing and emerging research domains that could benefit from optimizations and/or application porting for GPU-accelerated computing.

Data: Moving Beyond POSIx

Strict adherence to POSIX semantics is challenging (and unnecessary for most applications) in parallel systems. Delta will be a pioneer in reducing the reliance on POSIX file systems to help improve system uptime and performance. We will work with our users and experts to further refine Delta’s non-POSIX file system that presents a POSIX-like interface in addition to a POSIX-compliant file system. Most applications will be able to take advantage of the new file system without modification.

Usability and Accessibility

Delta offers a rich variety of interfaces, from command-line to science gateways. To ensure a great user experience despite the varied system workload, we will develop leading practices for blending interactive and batch computing with visualization. We will further work with experts at the University of Illinois Urbana-Champaign’s School of Information Sciences and Disability Resources & Educational Services to advance our practices in providing accessible interfaces for advanced computing and data resources, both for Delta and the broader community.

The accessibility of advanced computing and data resources to individuals with visual or other impairments is an often overlooked aspect of system design. NCSA and Delta are partnering with researchers at the University of Illinois Urbana-Champaign to evaluate Delta’s interfaces and advise us on ways to both improve the overall accessibility of the resource and to share these advancements with the broader cyberinfrastructure community

System Operations Data

A great deal of information is being collected by the Delta team on the operations of the system. This data includes fault data, performance data, power data and much more. 

The Delta project is happy to make this data available to system researchers subject to some restrictions including:

  • Any resulting publication must include the Delta acknowledgement
  • No redistribution of the Delta data set.
  • The Delta project must be allowed to review publications prior to submission to ensure statements made about Delta and data provided by Delta are accurate.
  • Data including user identifiable items (including job data) will be subject to additional restrictions.

The Delta data set is currently available only by request. To request data send email to help@ncsa.illinois.edu with a subject of “Delta systems data access” and include in the email a description of the types of data you are interested in along with your planned usage of the data.

EXACOMM

ExaComm is a GPU-Aware communication library developed by NCSA and UIUC collaborators. It has a compositional API for separating the collective pattern design from the machine-specific optimizations. It offers hyper primitives to design complicated collective communication patterns. ExaComm then optimizes the primitive pattern for a specified machine, that is described by the user. The implementation is achieved by using the point-to-point functions of the native MPI or NCCL libraries.

  • ExaComm is available on Delta as a module. To learn how to add and enable ExaComm with your code use:
  • module show exacomm
  • Please see the ExaComm git repository for more information. 
  • A paper about the work is available at …
  • If you have questions about ExaComm please submit an issue to the ExaComm git repository. 

Large-scale applications often time suffer from communication bottleneck. Our research specializes in data movement across GPUs to make sure that Delta is performing well for large-scale scientific and AI applications. We work with Stanford researchers to generalize these techniques so that they work on other systems. The target network architecture comprises multiple levels of sub-networks, where the bandwidth varies across the levels, like Delta’s. Generalizing these optimizations is challenging—and we have the right people to overcome it—because the optimal data routing pattern can be drastically different depending on the specific hierarchical network topology. We develop portable software to generalize hierarchical communications for exascale systems so that the same optimized code will work on various vendors’ GPU system architectures and software.

We continually develop software for performance testing and verification, such as CommBench. CommBench is a micro-benchmarking software with a portable API for testing GPU communications. We use CommBench for stress-testing Delta and compare it with five other systems, including exascale ones, e.g., Frontier, Aurora, and the upcoming El-Capitan. We make sure Delta’s communication software stays sharp.