Tuesday, 20 August 2013

Running several streams (instead of threads/blocks) in parallel

Running several streams (instead of threads/blocks) in parallel

I have a kernel which I want to start with the configuration "1 block x 32
threads". To increase parallelism I want to start several streams instead
of running a bigger "work package" than "1 block x 32 threads". I want to
use the GPU in a program where data comes from the network. I don't want
to wait until a bigger "work package" is available. The code is like:
for(i=0; i < 15; i++) {
- copy data Host -> GPU [cudaMemcpyAsync(.., stream i)]
- run kernel(stream i)
- copy data GPU -> Host [cudaMemcpyAsync(.., stream i)]
}
The real code is much more complex but I want to keep it simple (the
content of the for-loop usually runs in different CPU threads).
The code works but streams doesn't run concurrently as expected. The GTX
480 has 15 SMs where each SM has 32 shader processors. I expect that if I
start the kernel 15 times, all 15 streams run in parallel, but this is not
the case. I have used the Nvidia Visual Profiler and there is a maximum of
5 streams which run in parallel. Often only one stream runs. The
performance is really bad.
I get the best results with a "64 block x 1024 threads" configuration. If
I use instead a "32 block x 1024 threads" configuration but two streams
the streams are executed one after each other and performance drops. I am
using Cuda Toolkit 5.5.
Can somebody explain why this is the case and can give me some background
information? Should it work better on newer GPUs? What is the best way to
use the GPU in time critically applications where you don't want to buffer
data? Probably this is not possible, but I am searching for techniques
which bring me closer to a solution.

No comments:

Post a Comment