Here are the results of porting the STREAM memory-bandwidth benchmark to Nallatech H101 hardware using the DIME-C compiler tools. FPGAsMaxwellDIMEC describes the actual compilation process.
The timings for the hardware running at 100MHz are:
Function Rate (MB/s) Avg time Min time Max time Copy: 2948.7014 0.0055 0.0054 0.0055 Scale: 2954.2701 0.0055 0.0054 0.0056 Add: 4420.7964 0.0055 0.0054 0.0055 Triad: 4420.7297 0.0055 0.0054 0.0055
which compares not too badly to the software version running purely on the 2.8GHz Xeons:
Function Rate (MB/s) Avg time Min time Max time Copy: 4691.9527 0.0034 0.0034 0.0034 Scale: 4412.6311 0.0036 0.0036 0.0037 Add: 5321.5477 0.0045 0.0045 0.0045 Triad: 5388.3006 0.0045 0.0045 0.0045
The Nallatech hardware results in Phase I on mini looked like this:
Function Rate (MB/s) Avg time Min time Max time Copy: 2319.4962 0.0069 0.0069 0.0069 Scale: 67.1885 0.2382 0.2381 0.2382 Add: 79.8295 0.3007 0.3006 0.3007 Triad: 62.8453 0.3819 0.3819 0.3820
Some of this increase is probably due to the H101 having 4 banks of SRAM: we were able to put each of the 3 STREAMS arrays in its own memory bank, which increases bandwidth and has also simplified the inner loop, perhaps meaning DIME-C is able to pipeline it better.