-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance needs to be evaluated #16
Comments
Thanks! Just as a reference, what is the overall time for running these sentences through CluProcessor? This is not exactly and apples-to-apples comparison because CluProcessor is doing more things, but it should be close. |
These are also for without native BLAS. Times for Linux server, first 100 sentences, serial:
Same, but parallel on 32 processors:
Times for Linux server, first 10000 sentences, serial:
Same, but parallel on 32 processors:
Parallelism isn't helping a great deal. |
With updated BLAS which issues a message that native version is being used doesn't make much of a difference... Times for Linux server, first 10000 sentences, serial:
Same, but parallel on 32 processors:
|
Good to know! |
Values for processors, time to mkDocument from existing tokens and then to annotate. Times for Linux server, first 100 sentences, serial:
Times for Linux server, all 10000 sentences, serial:
|
BLAS was helping then. In order to get the Scala 2.11 version to build, I lowered the Breeze version so that it may not have the native library support. I'll check on that and then only lower it when necessary. |
On CluProcessor: nice to see the same ballpark numbers. |
@MihaiSurdeanu, this measurement is being taken almost a year later. This is for 100 sentences running on a server with 32 processors for the parallel case. The balaur branch is 10 times slower for serial, but parallelizes better. I don't see direct comparisons above, but it's hard to tell. I'm going to go back a few commits and see if something changes.
|
This is WAY slower than it used to be... |
Interesting... I don't think I changed much there... It would be good to revert to older version to see when things began to change. |
Here are more timing measurements. I do not know why it was taking 1:20 mins. before and will look into that. I did update some blas-related packages in the meantime and maybe they all can talk to each other. In general, Breeze is hurting us fairly significantly in the single threading case, but it can make up for it when multi-threaded. Also, the NativeBLAS, which is only supported on Linux anyway, is not helping and generally hurts. Unfortunately, there is not a standard way to disable it in Breeze. I had to hack the jar file of netlib. We are not using much of the functionality of Breeze, and I wonder whether all the conversion to the DenseMatrix and DenseVector is worth it.
|
This is very very informative! Thank you!! |
Unless something is done, they will probably have Linux, balaur, T, because it can't readily be made F. I'll check on the split times. |
The slow times weren't all imagined. The new balaur version runs very slowly on our servers, particularly when there is native support. It's 8x slower than the current version.
|
This was run on aorus, Linux, balaur, native blas, serial. Deps Label takes 88% of the time.
|
Oh, I finally get it! Thanks for this! The old tests you did were performed before the stable version of Deps Label was deployed. The current one has higher overhead because it's the only one that runs in what I called "dual mode", i.e., it pairs the embedding of the both tokens involved in a syntactic dependency. This involves a vector concatenation operation + a linear layer that operates on the resulting vector that is twice as large as in the other case. Before I start messing with it, can you please check what is the cost of the vector concatenation? That is happening in this method: Thanks! |
The concatenation doesn't seem to take all that long:
|
Hmm, then I am not sure why Deps Label takes so much longer... Perhaps because the input to its linear layer is a vector that's twice as large as for Deps Head? As a parenthesis, it can't be the classifier's output because Deps Head has more classes then Deps Label (~100 vs. ~50). Can you please time its forward() pass here: https://github.com/clulab/scala-transformers/blob/main/encoder/src/main/scala/org/clulab/scala_transformers/encoder/LinearLayer.scala#L183 |
Isn't this the n^2 code that is looping through all head candidates in LinearLayer.predictDualWithScores? Since the native code is running slower than the Scala code, it may make sense to look at other Java/Scala implementations. There are some benchmarks here: https://lessthanoptimal.github.io/Java-Matrix-Benchmark/. |
This is with the Java implementation, so the overall time is lower.
|
The implementation in Deps Label is linear because for each word (as the modifier) we use the predicted head by Deps Head. Thanks to this, we explore just one head for each word in the sentence. So, linear. |
I will do some surgery on the Deps Label model soon. |
These times are with roberta, first with JavaBLAS (faster) and then with NativeBLAS (slower). That dual forward is taking the most time, which is to be expected with example values of headCandidatesPerSentence at 72, 47, 18, 47, 24, 34, 13, 48. For JavaBLAS, the ratio of Deps Label to Deps Head is 45, which seems reasonable. For NativeBLAS, it is 122, which might mean that the slow native version is doing more work in Deps Label than Deps Head. JavaBLAS
NativeBLAS
|
It looks like ONNX itself can do math. |
Maybe... ONNX is doing all sorts of other smart things in the background, and I wonder what the overhead of those is...
In any case, I suspect Breeze pays a high cost for Deps Label due to matrix operations where one dimension is twice as large as in Deps Head (bc everything else seems the same). The good news is that Deps Label performs nearly as well when using just 1 of the embeddings (that of the modifier). I will train a model with this setting, and see if it makes a difference. |
@myedibleenso, in case you are interested in this discussion. |
The OnnxTensors in Java do not seem to have any mathematical operations associated with them. I did not run into any library that does offer them. However, in the branch kwalcock/math, I have abstracted all the Math that Breeze supplies and reimplemented it once with EJML for comparison. The runtimes are favorable on at least Jenny, where today
On other machines that don't offer native support, they were about even. The major problem is that the native implementation shipped with Breeze does not run well on lots of computers. There isn't a good way to disable it or substitute a locally optimized version. I would like to try one other library to be sure. The library may be reused for the clustering in habitus. |
Nice! Thanks @kwalcock ! |
I started training a model that uses sum instead of concat for Deps Label. If our hypothesis is correct, this should speed it up. I'll send the model in a couple of days. |
Because they are asking about the size of the matrices being multiplied, I note that one sentence that has 42 words and maybe 57 subword tokens involves the following matrix products: |
The above is for predictWithScores. With just predict, the multiplications are the same |
I don't understand where 9128 comes from... Where in the code did you measure this? |
I had the matrix multiplier print out what it was doing and it printed that line 9128 times in a row for the sentence. I can add a breakpoint and check the stack. |
My count was a little off. It was 1920 which is 57 x 160. That's the product of the inputBatch size and headCandidatesPerToken. |
The code in question is here: scala-transformers/encoder/src/main/scala/org/clulab/scala_transformers/encoder/LinearLayer.scala Lines 166 to 196 in f16b818
|
These are times for without native BLAS.
Time for Windows laptop first 100 sentences, serial:
Same, but parallel on 8 processors. Times other than Elapsed were overlapping times, so should be divided by approximately 8:
The text was updated successfully, but these errors were encountered: