PRISM2015
From HELIX
(Created page with "__NOTITLE__ = Unified Cache: A Case for Low-Latency Communication = Khalid Al-Hawaj, Simone Campanoni, Gu-Yeon Wei, David Brooks <br> ''International Workshop on Parallelism in...") |
|||
Line 11: | Line 11: | ||
We have developed a simplified version of the HELIX-RC architecture that we call unified cache, and we have implemented it on an FPGA board. Our design augments a multicore platform with a simplified ring cache—the architectural component of the HELIX-RC co-design. With the aid of microbenchmarks, our FPGA prototype confirms the HELIX-RC findings. | We have developed a simplified version of the HELIX-RC architecture that we call unified cache, and we have implemented it on an FPGA board. Our design augments a multicore platform with a simplified ring cache—the architectural component of the HELIX-RC co-design. With the aid of microbenchmarks, our FPGA prototype confirms the HELIX-RC findings. | ||
After describing both the ring cache and the parallel code generated by the HELIX compiler, we sketch the design of the unified cache and we evaluate its implementation on a Xilinx VC707 FPGA board. | After describing both the ring cache and the parallel code generated by the HELIX compiler, we sketch the design of the unified cache and we evaluate its implementation on a Xilinx VC707 FPGA board. | ||
- | |||
[ [[media:PRISM2015_Paper.pdf|Paper]] ] | [ [[media:PRISM2015_Paper.pdf|Paper]] ] |
Latest revision as of 18:50, 8 June 2015
Unified Cache: A Case for Low-Latency Communication
Khalid Al-Hawaj, Simone Campanoni, Gu-Yeon Wei, David Brooks
International Workshop on Parallelism in Mobile Platforms (PRISM), June, 2015
Increasing computational demand on mobile devices calls for energy-friendly solutions for accelerating single programs. In the multicore era, thread level parallelism (TLP) can accelerate single-threaded programs without requiring power-hungry cores. HELIX-RC, a recently proposed co-design between the HELIX parallelizing compiler and its target architecture, shows that substantial TLP can be extracted from loops with small bodies by optimizing core-to-core communication. Previously, the effectiveness of the HELIX-RC approach has been demonstrated through simulation. In this paper, we evaluate a HELIX-RC-like solution on a real platform.
We have developed a simplified version of the HELIX-RC architecture that we call unified cache, and we have implemented it on an FPGA board. Our design augments a multicore platform with a simplified ring cache—the architectural component of the HELIX-RC co-design. With the aid of microbenchmarks, our FPGA prototype confirms the HELIX-RC findings.
After describing both the ring cache and the parallel code generated by the HELIX compiler, we sketch the design of the unified cache and we evaluate its implementation on a Xilinx VC707 FPGA board.
[ Paper ]