Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance needs to be evaluated #16

Open
kwalcock opened this issue Nov 21, 2022 · 36 comments
Open

Performance needs to be evaluated #16

kwalcock opened this issue Nov 21, 2022 · 36 comments

Comments

@kwalcock
Copy link
Member

These are times for without native BLAS.

Time for Windows laptop first 100 sentences, serial:

Name Time Nanoseconds
Elapsed 0:00:00:24.913 24913910400
Encoder.forward 0:00:00:23.917 23917464300
Encoder.predict 0 0:00:00:00.379 379726600
Encoder.predict 1 0:00:00:00.478 478235800
Tokenizer 0:00:00:00.097 97104000

Same, but parallel on 8 processors. Times other than Elapsed were overlapping times, so should be divided by approximately 8:

Name Time Nanoseconds
Elapsed 0:00:00:14.390 14390361900
Encoder.forward 0:00:01:31.702 91702800300
Encoder.predict 0 0:00:00:00.707 707846200
Encoder.predict 1 0:00:00:00.993 993069100
Tokenizer 0:00:00:00.407 407733000
@MihaiSurdeanu
Copy link
Contributor

Thanks!

Just as a reference, what is the overall time for running these sentences through CluProcessor? This is not exactly and apples-to-apples comparison because CluProcessor is doing more things, but it should be close.

@kwalcock
Copy link
Member Author

These are also for without native BLAS.

Times for Linux server, first 100 sentences, serial:

Name Time Nanoseconds
Elapsed 0:00:00:04.897 4897058170
Encoder.forward 0:00:00:04.654 4654979102
Encoder.predict 0 0:00:00:00.086 86504564
Encoder.predict 1 0:00:00:00.120 120512364
Tokenizer 0:00:00:00.022 22148964

Same, but parallel on 32 processors:

Name Time Nanoseconds
Elapsed 0:00:00:03.575 3575645530
Encoder.forward 0:00:01:29.494 89494416255
Encoder.predict 0 0:00:00:01.313 1313994450
Encoder.predict 1 0:00:00:00.569 569052836
Tokenizer 0:00:00:00.308 308239218

Times for Linux server, first 10000 sentences, serial:

Name Time Nanoseconds
Elapsed 0:00:07:55.954 475954729687
Encoder.forward 0:00:07:41.352 461352168472
Encoder.predict 0 0:00:00:02.448 2448010003
Encoder.predict 1 0:00:00:10.437 10437701680
Tokenizer 0:00:00:01.414 1414654220

Same, but parallel on 32 processors:

Name Time Nanoseconds
Elapsed 0:00:05:39.629 339629263458
Encoder.forward 0:02:39:22.615 9562615599667
Encoder.predict 0 0:00:00:11.108 11108428440
Encoder.predict 1 0:00:00:34.033 34033494825
Tokenizer 0:00:00:06.076 6076358064

Parallelism isn't helping a great deal.

@kwalcock
Copy link
Member Author

With updated BLAS which issues a message that native version is being used doesn't make much of a difference...

Times for Linux server, first 10000 sentences, serial:

Name Time Nanoseconds
Elapsed 0:00:07:58.278 478278524666
Encoder.forward 0:00:07:45.628 465628128141
Encoder.predict 0 0:00:00:02.168 2168806911
Encoder.predict 1 0:00:00:08.761 8761538409
Tokenizer 0:00:00:01.416 1416116462

Same, but parallel on 32 processors:

Name Time Nanoseconds
Elapsed 0:00:05:39.383 339383446418
Encoder.forward 0:02:39:41.831 9581831531735
Encoder.predict 0 0:00:00:09.896 9896565890
Encoder.predict 1 0:00:00:25.055 25055794401
Tokenizer 0:00:00:06.197 6197267482

@MihaiSurdeanu
Copy link
Contributor

Good to know!
Just FYI: BLAS is only used on the predict * methods.

@kwalcock
Copy link
Member Author

Values for processors, time to mkDocument from existing tokens and then to annotate.

Times for Linux server, first 100 sentences, serial:

Name Time Nanoseconds
Elapsed 0:00:00:07.447 7447058360

Times for Linux server, all 10000 sentences, serial:

Name Time Nanoseconds
Elapsed 0:00:08:11.268 491268843911

@kwalcock
Copy link
Member Author

BLAS was helping then. In order to get the Scala 2.11 version to build, I lowered the Breeze version so that it may not have the native library support. I'll check on that and then only lower it when necessary.

@MihaiSurdeanu
Copy link
Contributor

On CluProcessor: nice to see the same ballpark numbers.

@kwalcock
Copy link
Member Author

kwalcock commented Oct 2, 2023

@MihaiSurdeanu, this measurement is being taken almost a year later. This is for 100 sentences running on a server with 32 processors for the parallel case. The balaur branch is 10 times slower for serial, but parallelizes better. I don't see direct comparisons above, but it's hard to tell. I'm going to go back a few commits and see if something changes.

Serial Parallel Branch
1:20.887 6.426 kwalcock/balaur
1:22:299 6.653 kwalcock/balaur
1:21.289 6:360 kwalcock/balaur
7.258 9.395 master
7.415 9.586 master
7.348 9.494 master

@MihaiSurdeanu
Copy link
Contributor

This is WAY slower than it used to be...
Clearly something happened.

@kwalcock
Copy link
Member Author

kwalcock commented Oct 3, 2023

It seems to be taking the most time in LinearLayer.forward:

parallel

hotspot

@kwalcock kwalcock reopened this Oct 3, 2023
@MihaiSurdeanu
Copy link
Contributor

Interesting... I don't think I changed much there... It would be good to revert to older version to see when things began to change.

@kwalcock
Copy link
Member Author

kwalcock commented Nov 1, 2023

Here are more timing measurements. I do not know why it was taking 1:20 mins. before and will look into that. I did update some blas-related packages in the meantime and maybe they all can talk to each other.

In general, Breeze is hurting us fairly significantly in the single threading case, but it can make up for it when multi-threaded. Also, the NativeBLAS, which is only supported on Linux anyway, is not helping and generally hurts. Unfortunately, there is not a standard way to disable it in Breeze. I had to hack the jar file of netlib. We are not using much of the functionality of Breeze, and I wonder whether all the conversion to the DenseMatrix and DenseVector is worth it.

OSys Branch NativeBLAS Serial (sec.) Parallel (sec.) Notes
Windows master F 13 13 Intel chip, slow
Windows balaur F 16 5 Favorable because master is slow
Linux master F 7 9 AMD chip
Linux balaur T 21 7 How can this be disabled?
Linux balaur F 19 3 Disabling doesn't help much on serial
Mac/Intel master F 11 11 Intel version on ARM
Mac/Intel balaur F 26 10  
Mac/M1 master F 7 7 M1 version on ARM
Mac/M1 balaur F 18 6  

@MihaiSurdeanu
Copy link
Contributor

This is very very informative! Thank you!!
I suspect the most common setting will be (Linux, balaur, F). In this case, what pct. of the execution stays in the linear layer? Perhaps we should try to improve that.

@kwalcock
Copy link
Member Author

kwalcock commented Nov 1, 2023

Unless something is done, they will probably have Linux, balaur, T, because it can't readily be made F. I'll check on the split times.

@kwalcock
Copy link
Member Author

kwalcock commented Nov 1, 2023

The slow times weren't all imagined. The new balaur version runs very slowly on our servers, particularly when there is native support. It's 8x slower than the current version.

Computer OSys Branch NativeBLAS Serial (sec.) Parallel (sec.) Notes
NUC Windows master F 13 13 Intel chip, slow
NUC Windows balaur F 16 5 Perhaps because master is slow
aorus Linux master F 7 9 AMD chip
aorus Linux balaur T 21 7 How can this be disabled?
aorus Linux balaur F 19 3 Disabling doesn't help much on serial
MacMini Mac/Intel master F 11 11 Intel version on ARM
MacMini Mac/Intel balaur F 26 10  
MacMini Mac/M1 master F 7 7 M1 version on ARM
MacMini Mac/M1 balaur F 18 6  
river Linux master F 23 22  
river Linux balaur T 199 35 Very bad serial time!
river Linux balaur F 74 16  
clara Linux master F 12 19 Too many processors
clara Linux balaur T 95 6 Very bad serial time!
clara Linux balaur F 50 3  

@kwalcock
Copy link
Member Author

kwalcock commented Nov 2, 2023

This was run on aorus, Linux, balaur, native blas, serial. Deps Label takes 88% of the time.

Timer Layer Time
Elapsed   0:00:00:23.311
Encoder.forward   0:00:00:02.516
Encoder.predict 0 NER 0:00:00:00.047
Encoder.predict 1 POS 0:00:00:00.049
Encoder.predict 2 Chunking 0:00:00:00.025
Encoder.predict 3 Deps Head 0:00:00:00.169
Encoder.predict 4 Deps Label 0:00:00:20.400
Tokenizer   0:00:00:00.032

@MihaiSurdeanu
Copy link
Contributor

Oh, I finally get it! Thanks for this!

The old tests you did were performed before the stable version of Deps Label was deployed. The current one has higher overhead because it's the only one that runs in what I called "dual mode", i.e., it pairs the embedding of the both tokens involved in a syntactic dependency. This involves a vector concatenation operation + a linear layer that operates on the resulting vector that is twice as large as in the other case.

Before I start messing with it, can you please check what is the cost of the vector concatenation? That is happening in this method:
https://github.com/clulab/scala-transformers/blob/main/encoder/src/main/scala/org/clulab/scala_transformers/encoder/LinearLayer.scala#L107

Thanks!

@kwalcock
Copy link
Member Author

kwalcock commented Nov 2, 2023

The concatenation doesn't seem to take all that long:

Timer Layer Time
Elapsed   0:00:00:22.186
Encoder.forward   0:00:00:02.282
Encoder.predict 0 NER 0:00:00:00.050
Encoder.predict 1 POS 0:00:00:00.045
Encoder.predict 2 Chunking 0:00:00:00.023
Encoder.predict 3 Deps Head 0:00:00:00.154
Encoder.predict 4 Deps Label 0:00:00:19.530
LinearLayer.concatenation   0:00:00:01.339
Tokenizer   0:00:00:00.032

@MihaiSurdeanu
Copy link
Contributor

Hmm, then I am not sure why Deps Label takes so much longer... Perhaps because the input to its linear layer is a vector that's twice as large as for Deps Head? As a parenthesis, it can't be the classifier's output because Deps Head has more classes then Deps Label (~100 vs. ~50).

Can you please time its forward() pass here: https://github.com/clulab/scala-transformers/blob/main/encoder/src/main/scala/org/clulab/scala_transformers/encoder/LinearLayer.scala#L183

@kwalcock
Copy link
Member Author

kwalcock commented Nov 2, 2023

Isn't this the n^2 code that is looping through all head candidates in LinearLayer.predictDualWithScores?

Since the native code is running slower than the Scala code, it may make sense to look at other Java/Scala implementations. There are some benchmarks here: https://lessthanoptimal.github.io/Java-Matrix-Benchmark/.

@kwalcock
Copy link
Member Author

kwalcock commented Nov 2, 2023

This is with the Java implementation, so the overall time is lower.

Timer Layer Time
Elapsed   0:00:00:18.241
Encoder.forward   0:00:00:02.462
Encoder.predict 0 NER 0:00:00:00.076
Encoder.predict 1 POS 0:00:00:00.101
Encoder.predict 2 Chunking 0:00:00:00.060
Encoder.predict 3 Deps Head 0:00:00:00.337
Encoder.predict 4 Deps Label 0:00:00:15.100
LinearLayer.concatenation   0:00:00:01.295
LinearLayer.dualForward   0:00:00:13.615
Tokenizer   0:00:00:00.033

@MihaiSurdeanu
Copy link
Contributor

Isn't this the n^2 code that is looping through all head candidates in LinearLayer.predictDualWithScores?

Since the native code is running slower than the Scala code, it may make sense to look at other Java/Scala implementations. There are some benchmarks here: https://lessthanoptimal.github.io/Java-Matrix-Benchmark/.

The implementation in Deps Label is linear because for each word (as the modifier) we use the predicted head by Deps Head. Thanks to this, we explore just one head for each word in the sentence. So, linear.

@MihaiSurdeanu
Copy link
Contributor

I will do some surgery on the Deps Label model soon.
But, one last thing before this: can you please repeat the last analysis with the roberta model? The deberta transformer is doing something fancier, and I wonder if that component is slow...

@kwalcock
Copy link
Member Author

kwalcock commented Nov 3, 2023

These times are with roberta, first with JavaBLAS (faster) and then with NativeBLAS (slower). That dual forward is taking the most time, which is to be expected with example values of headCandidatesPerSentence at 72, 47, 18, 47, 24, 34, 13, 48. For JavaBLAS, the ratio of Deps Label to Deps Head is 45, which seems reasonable. For NativeBLAS, it is 122, which might mean that the slow native version is doing more work in Deps Label than Deps Head.

JavaBLAS

Timer Layer Time
Elapsed   0:00:00:18.268
Encoder.forward   0:00:00:01.834
Encoder.predict 0 NER 0:00:00:00.084
Encoder.predict 1 POS 0:00:00:00.103
Encoder.predict 2 Chunking 0:00:00:00.063
Encoder.predict 3 Deps Head 0:00:00:00.346
Encoder.predict 4 Deps Label 0:00:00:15.734
LinearLayer.concatenation   0:00:00:01.290
LinearLayer.dualForward   0:00:00:14.296
Tokenizer   0:00:00:00.030

NativeBLAS

Timer Layer Time
Elapsed   0:00:00:22.536
Encoder.forward   0:00:00:01.831
Encoder.predict 0 NER 0:00:00:00.055
Encoder.predict 1 POS 0:00:00:00.049
Encoder.predict 2 Chunking 0:00:00:00.027
Encoder.predict 3 Deps Head 0:00:00:00.165
Encoder.predict 4 Deps Label 0:00:00:20.305
LinearLayer.concatenation   0:00:00:01.399
LinearLayer.dualForward   0:00:00:18.720
Tokenizer   0:00:00:00.029

@kwalcock
Copy link
Member Author

kwalcock commented Nov 3, 2023

It looks like ONNX itself can do math. OnnxTensor.createTensor(env,sourceArray) is shown at get-started. Documentation is at OnnxTensor. Would it make sense to try that rather than to add Breeze and everything it brings in? encoderSession.run(inputs) seems to return a result from which a bunch of OnnxTensors can be extracted, probably without need to convert. It would be really great if the math was a component that could be replaced.

@MihaiSurdeanu
Copy link
Contributor

Maybe... ONNX is doing all sorts of other smart things in the background, and I wonder what the overhead of those is...
But maybe it's worth checking? That is, measure:

  • The overhead of converting ONNX tensors to Breeze.
  • The cost of multiplying tensors in ONNX vs. Breeze.

In any case, I suspect Breeze pays a high cost for Deps Label due to matrix operations where one dimension is twice as large as in Deps Head (bc everything else seems the same). The good news is that Deps Label performs nearly as well when using just 1 of the embeddings (that of the modifier). I will train a model with this setting, and see if it makes a difference.

@kwalcock
Copy link
Member Author

kwalcock commented Nov 6, 2023

@myedibleenso, in case you are interested in this discussion.

@kwalcock
Copy link
Member Author

The OnnxTensors in Java do not seem to have any mathematical operations associated with them. I did not run into any library that does offer them. However, in the branch kwalcock/math, I have abstracted all the Math that Breeze supplies and reimplemented it once with EJML for comparison.

The runtimes are favorable on at least Jenny, where today

Library Native Time (sec)
Breeze T 83
Breeze F 30
EJML F 26

On other machines that don't offer native support, they were about even. The major problem is that the native implementation shipped with Breeze does not run well on lots of computers. There isn't a good way to disable it or substitute a locally optimized version.

I would like to try one other library to be sure. The library may be reused for the clustering in habitus.

@MihaiSurdeanu
Copy link
Contributor

Nice! Thanks @kwalcock !

@MihaiSurdeanu
Copy link
Contributor

I started training a model that uses sum instead of concat for Deps Label. If our hypothesis is correct, this should speed it up. I'll send the model in a couple of days.

@kwalcock
Copy link
Member Author

Because they are asking about the size of the matrices being multiplied, I note that one sentence that has 42 words and maybe 57 subword tokens involves the following matrix products:
Four (57 * 768) x (768 * 768), which are probably for four linear layers
9128 (1 * 1536) x (1536 * 1536).
I'm not sure where the 9128 comes from. The number of multiplications for each is
4 * 57 * 768 * 768 = 134,479,872
9128 * 1 * 1536 * 1536 = 21,535,653,888
The second number is 160 times as large as the first. It's not so much that 1536 is twice 768, but that 9128 is 50 times 4 * 47.

@kwalcock
Copy link
Member Author

The above is for predictWithScores. With just predict, the multiplications are the same
four (57 * 768) x (768 * 768) and then just
one (57 * 1536) x (1536 * 1536).
Time for predictWithScores is about 22 seconds and for predict, just 2 seconds.

@MihaiSurdeanu
Copy link
Contributor

I don't understand where 9128 comes from... Where in the code did you measure this?

@kwalcock
Copy link
Member Author

I had the matrix multiplier print out what it was doing and it printed that line 9128 times in a row for the sentence. I can add a breakpoint and check the stack.

@kwalcock
Copy link
Member Author

My count was a little off. It was 1920 which is 57 x 160. That's the product of the inputBatch size and headCandidatesPerToken.

@kwalcock
Copy link
Member Author

The code in question is here:

def predictDualWithScores(inputBatch: Array[DenseMatrix[Float]],
batchHeads: Option[Array[Array[Array[Int]]]] = None,
batchMasks: Option[Array[Array[Boolean]]] = None): Array[Array[Array[(String, Float)]]] = {
assert(batchHeads.isDefined)
assert(batchMasks.isDefined)
val indexToLabel = labelsOpt.getOrElse(throw new RuntimeException("ERROR: can't predict without labels!"))
// dimensions: sent in batch x token in sentence x label per candidate head
// we process one sentence at a time because the dual setting makes it harder to batch
val outputBatch = inputBatch.zip(batchHeads.get).map { case (input, headCandidatesPerSentence) =>
// now process each token separately
headCandidatesPerSentence.zipWithIndex.map { case (headCandidatesPerToken, modifierAbsolutePosition) =>
// process each head candidate for this token
headCandidatesPerToken.map { headRelativePosition =>
// generate a matrix that is twice as wide to concatenate the embeddings of the mod + head
val concatInput = concatenateModifierAndHead(input, modifierAbsolutePosition, headRelativePosition)
// get the logits for the current pair of modifier and head
val logitsPerSentence = forward(Array(concatInput))(0)
val labelScores = logitsPerSentence(0, ::)
val bestIndex = argmax(labelScores.t)
val bestScore = labelScores(bestIndex)
val bestLabel = indexToLabel(bestIndex)
// println(s"Top prediction for mod $modifierAbsolutePosition and relative head $headRelativePosition is $bestLabel with score $bestScore")
(bestLabel, bestScore)
} // end head candidates for this token
} // end this token
} // end sentence batch
outputBatch
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants