COPA 2023: Papers with Abstracts

Papers
Accelerating Scientific and Engineering Applications through Cloud-based GPU Computing Dipesh Rawat, Kopal Chakravarty, Neelaksh Singh, Vijaya Laxmi Pachva and Rakshit Anand Bhootham Abstract. As the extensibility of GPU computing rapidly increases, we often find them useful for different applications in the field of science and engineering. Libraries written for engineer- ing tasks such as CULA (Cuda Linear Algebra), cuFFT-Cuda Fast Fourier Transforms, and cuBLAS library- Cuda Basic Linear Algebra Subprograms) have made it easier for programmers to achieve a significant performance increase when solving problems in the fields of engineering and math. In signal processing we can use the GPU to perform discrete Fourier transforms on time-domain signal strength to represent the data in the frequency domain. With the data in this format, we can calculate signal strength of various frequen- cies very efficiently, and further determine if a transmission on a particular frequency has taken place. Speedups in excess of 70 were achievable using a GPU-based implementation utilizing the cuFFT library over a CPU implementation utilizing the most performance optimized CPU-based FFT library, FFTW.
Slurm Scheduling From Rules-Based Systems Mark Blomqvist and David Marchant Abstract. This paper explores how a rules based scheduling system can be integrated with a traditional workload manager such as slurm. This integration will be done with as minimal additional setup by a user as is feasible, while maintaining security of the network, as well as any machines involved. To this effect the processing component known as the Conductor in the rules-based scheduling system MEOW, has been extended with a remote option. This will enable MEOW to transmit jobs to a remote system, with the option of using slurm to orchestrate them. The new option is evaluated by comparing the execution time of using the remote solution without slurm, with that of existing components. Furthermore, the overhead of using various slurm methods for scheduling jobs with that of running the jobs remotely without using slurm is compared. It is shown that the remote solution adds a flat overhead to the execution time as both the number of jobs and size of the transmitted data increases, and that the overhead associated with using slurm on top of it is largely insignificant. This is considered to be an acceptable result given the test environment that was used.
Concurrency and Models of Abstraction: Past, Present and Future Jeremy Martin Abstract. I will present a personal view of some of the key historical developments in Concurrency Theory and how abstract models have been used to make it easier to develop concurrent systems. I will then provide an assessment of certain concurrency issues facing us today and make predictions as to how these will be solved in the future.
Evaluation of FPGA Acceleration of Neural Networks Emil Stevnsborg, Sture Oksholm, Carl-Johannes Johnsen and James Emil Avery Abstract. This paper explores real-time Convolutional Neural Network inference on Field Pro- grammable Gate Arrays (FPGAs) implemented in Synchronous Message Exchange (SME). We compare SME to the widespread FPGA tool, High-Level Synthesis (HLS), and com- pare both the SME and HLS implementations of CNNs with the PyTorch implementation for CNN on CPU/GPU. We find that the SME implementation is more flexible than the HLS implementation as it allows for more customization of the hardware. Programming with SME is more difficult than HLS, although easier than traditional Hardware Descrip- tion Languages. Finally, for a test use case, we find that the SME implementation on FPGA is approximately 2.8/1.4/2.0 times more energy efficient than CPU/GPU/ARM at larger batch sizes, with the HLS implementation on FPGA falling in between CPU/ARM and GPU in terms of energy efficiency. At a batch size of 1, appropriate for edge-device inference, the gap in energy efficiency between the FPGA and CPU/GPU/ARM imple- mentations becomes more pronounced, with the SME implementation on FPGA being approximately 83/47/8 times more energy efficient than the CPU/GPU/ARM implemen- tations, and with the HLS implementation on FPGA being approximately 40/23/4 times more energy efficient than the CPU/GPU/ARM implementations.
Race-Condition-Robust Hardware-Software Equivalence in nx Larry Dickson Abstract. Classic CSP communication channels behave equivalently whether connecting software processes or hardware devices. The static OCCAM language and the Transputer proces- sor, both based on finite CSP, exhibit this property, which results in Hardware-Software Equivalence between implementation of the same programs (including binaries). Programs written in OCCAM and run on the Transputer can be proven correct and their behavior characterized down to cycle count. This Fringe presentation shows a technique for extending this HSE to nx processes and devices in certain applications. Wherever certain capabilities of the ssh suite of pro- grams are found, it works to allow communication using sockets as the communicators known to the programs. An ssh tunnel between sockets on separate devices allows this to work in the hardware case, without any change in the programs including their binaries. Among the nx varieties we have shown to work are Linux, Mac BSD, and Termux over Android. The technique passes short messages via a server program. An investigation of race conditions, using the property of select() that it always finds a winner among file descriptor communication races, proves that the communication is robust across all possible timing differences between communicating client pairs. The current code works if two of the clients are communicating at a time. It requires further development before handling three independent racing systems.
Is it feasible to identify outputs of an arbitrary process at run time without excessively slowing down workflows? Philip Shun Jensen, Iben Lilholm and David Marchant Abstract. In this study, we explore the feasibility of identifying file events for any process in real-time without significant workflow slowdowns, to aid in generating a data provenance report for the dynamic workflow manager, MEOW. Unlike traditional workflow managers, MEOW’s output location isn’t pre-defined, and output can initiate another job. We es- tablished criteria and examined four Linux tools: strace, perf script, inotify, and fanotify. Our findings suggest that strace meets our requirements, and integrating an strace-based tracer into MEOW is both theoretically and practically viable. While the implemented tracer slows the workflow by approximately 1.3 times, worst-case scenarios show it could be up to 5 times. This research forms the base for constructing MEOW’s data provenance report.