Bad gpu performance
Posted: Mon Apr 01, 2019 2:04 pm
Hello everyone!
I have NanoPC-T4, with bionic ubuntu.
We are trying to run TVM opencl accelerated framework (link). But we spend approximately a couple of weeks to find out why NanoPC T4 so slow with TVM (and with plaidml also). So it actually 2x or 3x slower then firefly3399 with computations on gpu.
So finaly we found CLPEAK tool which allows to benchmark opencl hardware and calculate GFLOPS with memory bandwidth, here is the result:
And we compared T4 benchmark with the Firefly-RK3399, it has same gflops on differents types, but enqueueMapBuffer and enqueueUnmap alot faster than T4, and what is realy important, that these rk3399 was locked to 200 MHz.
Why this is so bad? What I can do with it?
I have NanoPC-T4, with bionic ubuntu.
We are trying to run TVM opencl accelerated framework (link). But we spend approximately a couple of weeks to find out why NanoPC T4 so slow with TVM (and with plaidml also). So it actually 2x or 3x slower then firefly3399 with computations on gpu.
So finaly we found CLPEAK tool which allows to benchmark opencl hardware and calculate GFLOPS with memory bandwidth, here is the result:
Code: Select all
Platform: ARM Platform
Device: Mali-T860
Driver version : 1.2 (Linux ARM64)
Compute units : 4
Clock frequency : 800 MHz
Global memory bandwidth (GBPS)
float : 3.76 | 3.73
float2 : 6.15 | 5.99
float4 : 7.26 | 6.98
float8 : 6.00 | 5.82
float16 : 5.30 | 5.14
Single-precision compute (GFLOPS)
float : 23.98 | 24.63
float2 : 45.76 | 46.73
float4 : 45.23 | 46.33
float8 : 40.22 | 41.17
float16 : 46.41 | 46.45
half-precision compute (GFLOPS)
half : 23.09 | 23.12
half2 : 48.87 | 49.25
half4 : 95.32 | 95.45
half8 : 93.11 | 93.32
half16 : 87.80 | 89.06
Double-precision compute (GFLOPS)
double : 11.59 | 11.62
double2 : 3.27 | 3.50
double4 : 20.35 | 20.71
double8 : 20.01 | 20.40
double16 : 19.77 | 19.95
Integer compute (GIOPS)
int : 22.66 | 18.44
int2 : 47.67 | 31.38
int4 : 46.97 | 30.97
int8 : 34.30 | 23.05
int16 : 47.66 | 30.71
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 1.04 | 0.76
enqueueReadBuffer : 1.03 | 0.85
enqueueMapBuffer(for read) : 4.70 | 3.96
memcpy from mapped ptr : 1.50 | 1.68
enqueueUnmap(after write) : 4.75 | 3.73
memcpy to mapped ptr : 1.68 | 1.73
Kernel launch latency : 96.29 us | 110.73 us
And we compared T4 benchmark with the Firefly-RK3399, it has same gflops on differents types, but enqueueMapBuffer and enqueueUnmap alot faster than T4, and what is realy important, that these rk3399 was locked to 200 MHz.
Why this is so bad? What I can do with it?
Code: Select all
Platform: ARM Platform
Device: Mali-T860
Driver version : 1.2 (Linux ARM)
Compute units : 4
Clock frequency : 200 MHz
Global memory bandwidth (GBPS)
float : 3.17
float2 : 6.07
float4 : 7.88
float8 : 6.55
float16 : 6.26
Single-precision compute (GFLOPS)
float : 25.09
float2 : 45.51
float4 : 46.22
float8 : 41.67
float16 : 46.40
half-precision compute (GFLOPS)
half : 23.11
half2 : 50.19
half4 : 98.30
half8 : 93.48
half16 : 93.94
Double-precision compute (GFLOPS)
double : 3.59
double2 : 3.30
double4 : 20.97
double8 : 20.65
double16 : 20.39
Integer compute (GIOPS)
int : 20.15
int2 : 49.64
int4 : 47.12
int8 : 49.17
int16 : 41.47
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 4.61
enqueueReadBuffer : 2.60
enqueueMapBuffer(for read) : 475.11
memcpy from mapped ptr : 2.50
enqueueUnmap(after write) : 2790.39
memcpy to mapped ptr : 1.92
Kernel launch latency : 190.64 us