-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CuBERT not utilizing all threads with multi-cpu #68
Comments
CuBERT seems to be running on all threads of all CPUs now. It was an issue with the KMP flag it seems. But actually it's slower in benchmark when utilizing all threads. Anyway, seems I have to experiment a bit with the flags to get it running properly. CuBERT is still slower than in the benchmarks though when I use it in my other application with multi-processing. |
What CPU do you use? Do you run cuBERT inside docker with limited CPU quota? Does the caller have many threads and call cuBERT concurrently? Could you provide the running time of benchmark_tf.cpp and benchmark_cu.cpp? |
I am running both TF-BERT and CuBERT in python at the moment, because my server is also implemented in python. I included TF-BERT into the python benchmark script by loading the frozen graph into a TF session. Here are the results for seq_len=32, bsz=128:
So in this case, CuBERT is indeed faster than the TF version. I am running the test on 2 * Intel® Xeon® Processor E5-2637 v4 (16 threads in total) The problem I have right now seems to have something to do with the threading scheduling. When I set KMP_AFFINITY=compact in my python server (I am running my Bert Worker in a separate python process), the inference gets terribly slow and CuBERT seems to utilize only 1 thread (out of 16 available). When I set the KMP_AFFINITY=none, CuBERT actually utilizes all threads available, but in this case it is still slower than TF-BERT (probably the threading schedule strategy affects performance significantly). I am using your suggested flags: KMP_BLOCKTIME=0 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 MKL_NUM_THREADS=16 I would really appreciate your input |
Do you run cuBERT inside docker with limited CPU quota?I am not running inside a docker container. I use the same conda environment and the same CPU server for benchmarking/inference server. Does the caller have many threads and call cuBERT concurrently?At the moment I am testing without concurrency |
Hi there,
I was running
cuBERT_benchmark.py
and noticed that CuBERT does not utilize all threads when using multiple CPUs (even when setting MKL_NUM_THREADS and OMP_NUM_THREADS). It seems that only CPU#1 is fully utilized in my case, while CPU#2 is almost idle (see attached image). Is there a reason for this behaviour?I compared by running TF-BERT and it utilizes all threads of both CPUs.
Also, I am trying to use CuBERT in another application where I use multi-processing as well. Is it possible that python's multiprocessing is interfering with CuBERT's multi-threading? Somehow CuBERT is running slower in this application (and it utilizes only some threads totally irregularly) than TF-BERT, while it's faster when I run the benchmark.
Thanks for your help
The text was updated successfully, but these errors were encountered: