We demonstrate, by examining both hand-assembled and HLS-compiled benchmarks, that our application-optimized memory system can improve pre- existing application runtime by 25% on average.įPGAs are known to be very effective at accelerat-ing certain classes of algorithms. To relieve timing pressure in large BRAM structures, we provide microarchitectural techniques to trade memory latency for design frequency. Although exploiting spare resources can be beneficial, na ̈ıvely consuming all memory resources may cause frequency degradation. In this work, we perform an initial exploration of methods for automating the construction of these application- specific memory hierarchies. Since application kernels often use few memory resources, substantial memory capacity may be available to the platform for use on behalf of the user program. In FPGAs, this platform-level malleability extends to the memory system: unlike general-purpose processors, in which memory hardware is fixed at design time, the capacity, associativity, and topology of FPGA memory systems may all be tuned to improve application performance. This sep- aration also provides system programmers and compilers an opportunity to optimize platform services for each application. High-level abstractions separate algorithm design from plat- form implementation, allowing programmers to focus on algo- rithms while building increasingly complex systems. Surprisingly, in hardware real-time collection is superior to stop-the-world collection on every performance axis, and even for stressful micro-benchmarks can achieve 100% MMU with heaps as small as 1.01 to 1.4 times the absolute minimum. For comparison we also implemented explicit (malloc/free) memory management, and show that real-time collection is about 4% to 17% slower than malloc, with comparable energy consumption. We have synthesized the collector to hardware and show that it never consumes more than 1% of the logic resources of a high-end FPGA. Using a completely concurrent snapshot algorithm, it provides single-cycle access to the heap, and never stalls the mutator for even a single cycle, achieving a deterministic mutator utilization (MMU) of 100%. We present the first implementation of a complete garbage collector in hardware (as opposed to previous "hardware-assist" techniques), using an FPGA and its on-chip memory. One of the fundamental high-level language features is automatic memory management in the form of garbage collection. The continued exponential increase in transistors, combined with the desire to implement ever more sophisticated algorithms, makes it imperative that such systems be programmed at much higher levels of abstraction. But such systems, programmed at a very low level in languages with impoverished abstractions, are orders of magnitude more complex to use than conventional CPUs. Programmers are turning to radical architectures such as reconfigurable hardware (FPGAs) to achieve performance. Two uses of FPGA scratchpads are analyzed: buffer man-agement in an H.264 decoder and memory management within a processor microarchitecture timing model. Like software developers, accelerator authors using scratchpads may focus more on core algorithms and less on memory management. Additional libraries support heap management within a storage set. In the LEAP framework, scratch-pads share the same interface as on-die RAM blocks and are plug-in replacements. Scratchpad accesses are cached au-tomatically in multiple levels, ranging from shared on-board, RAM-based, set-associative caches to private caches stored in FPGA RAM blocks. LEAP scratchpads for reconfigurable logic dynamically allocate and manage multiple, independent, memory arrays in a large backing store. Virtual memory pro-vides the illusion of very large arrays and processor caches reduce access latency without explicit programmer instruc-tions. Soft-ware developers expect a programming environment to in-clude automatic memory management. Each project typically includes tedious development of single-use memory management. Developers accelerating applications on FPGAs or other re-configurable logic have nothing but raw memory devices in their standard toolkits.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |