Performance needs to be evaluated #16

kwalcock · 2022-11-21T20:54:23Z

These are times for without native BLAS.

Time for Windows laptop first 100 sentences, serial:

Name	Time	Nanoseconds
Elapsed	0:00:00:24.913	24913910400
Encoder.forward	0:00:00:23.917	23917464300
Encoder.predict 0	0:00:00:00.379	379726600
Encoder.predict 1	0:00:00:00.478	478235800
Tokenizer	0:00:00:00.097	97104000

Same, but parallel on 8 processors. Times other than Elapsed were overlapping times, so should be divided by approximately 8:

Name	Time	Nanoseconds
Elapsed	0:00:00:14.390	14390361900
Encoder.forward	0:00:01:31.702	91702800300
Encoder.predict 0	0:00:00:00.707	707846200
Encoder.predict 1	0:00:00:00.993	993069100
Tokenizer	0:00:00:00.407	407733000

MihaiSurdeanu · 2022-11-21T20:56:47Z

Thanks!

Just as a reference, what is the overall time for running these sentences through CluProcessor? This is not exactly and apples-to-apples comparison because CluProcessor is doing more things, but it should be close.

kwalcock · 2022-11-21T21:03:05Z

These are also for without native BLAS.

Times for Linux server, first 100 sentences, serial:

Name	Time	Nanoseconds
Elapsed	0:00:00:04.897	4897058170
Encoder.forward	0:00:00:04.654	4654979102
Encoder.predict 0	0:00:00:00.086	86504564
Encoder.predict 1	0:00:00:00.120	120512364
Tokenizer	0:00:00:00.022	22148964

Same, but parallel on 32 processors:

Name	Time	Nanoseconds
Elapsed	0:00:00:03.575	3575645530
Encoder.forward	0:00:01:29.494	89494416255
Encoder.predict 0	0:00:00:01.313	1313994450
Encoder.predict 1	0:00:00:00.569	569052836
Tokenizer	0:00:00:00.308	308239218

Times for Linux server, first 10000 sentences, serial:

Name	Time	Nanoseconds
Elapsed	0:00:07:55.954	475954729687
Encoder.forward	0:00:07:41.352	461352168472
Encoder.predict 0	0:00:00:02.448	2448010003
Encoder.predict 1	0:00:00:10.437	10437701680
Tokenizer	0:00:00:01.414	1414654220

Same, but parallel on 32 processors:

Name	Time	Nanoseconds
Elapsed	0:00:05:39.629	339629263458
Encoder.forward	0:02:39:22.615	9562615599667
Encoder.predict 0	0:00:00:11.108	11108428440
Encoder.predict 1	0:00:00:34.033	34033494825
Tokenizer	0:00:00:06.076	6076358064

Parallelism isn't helping a great deal.

kwalcock · 2022-11-21T21:08:43Z

With updated BLAS which issues a message that native version is being used doesn't make much of a difference...

Times for Linux server, first 10000 sentences, serial:

Name	Time	Nanoseconds
Elapsed	0:00:07:58.278	478278524666
Encoder.forward	0:00:07:45.628	465628128141
Encoder.predict 0	0:00:00:02.168	2168806911
Encoder.predict 1	0:00:00:08.761	8761538409
Tokenizer	0:00:00:01.416	1416116462

Same, but parallel on 32 processors:

Name	Time	Nanoseconds
Elapsed	0:00:05:39.383	339383446418
Encoder.forward	0:02:39:41.831	9581831531735
Encoder.predict 0	0:00:00:09.896	9896565890
Encoder.predict 1	0:00:00:25.055	25055794401
Tokenizer	0:00:00:06.197	6197267482

MihaiSurdeanu · 2022-11-21T21:47:22Z

Good to know!
Just FYI: BLAS is only used on the predict * methods.

kwalcock · 2022-11-21T22:00:36Z

Values for processors, time to mkDocument from existing tokens and then to annotate.

Times for Linux server, first 100 sentences, serial:

Name	Time	Nanoseconds
Elapsed	0:00:00:07.447	7447058360

Times for Linux server, all 10000 sentences, serial:

Name	Time	Nanoseconds
Elapsed	0:00:08:11.268	491268843911

kwalcock · 2022-11-21T22:03:46Z

BLAS was helping then. In order to get the Scala 2.11 version to build, I lowered the Breeze version so that it may not have the native library support. I'll check on that and then only lower it when necessary.

MihaiSurdeanu · 2022-11-21T22:09:20Z

On CluProcessor: nice to see the same ballpark numbers.

kwalcock · 2023-10-02T22:16:07Z

@MihaiSurdeanu, this measurement is being taken almost a year later. This is for 100 sentences running on a server with 32 processors for the parallel case. The balaur branch is 10 times slower for serial, but parallelizes better. I don't see direct comparisons above, but it's hard to tell. I'm going to go back a few commits and see if something changes.

Serial	Parallel	Branch
1:20.887	6.426	kwalcock/balaur
1:22:299	6.653	kwalcock/balaur
1:21.289	6:360	kwalcock/balaur
7.258	9.395	master
7.415	9.586	master
7.348	9.494	master

MihaiSurdeanu · 2023-10-02T22:37:45Z

This is WAY slower than it used to be...
Clearly something happened.

kwalcock · 2023-10-03T00:25:02Z

It seems to be taking the most time in LinearLayer.forward:

MihaiSurdeanu · 2023-10-03T00:45:58Z

Interesting... I don't think I changed much there... It would be good to revert to older version to see when things began to change.

kwalcock · 2023-11-01T04:02:27Z

Here are more timing measurements. I do not know why it was taking 1:20 mins. before and will look into that. I did update some blas-related packages in the meantime and maybe they all can talk to each other.

In general, Breeze is hurting us fairly significantly in the single threading case, but it can make up for it when multi-threaded. Also, the NativeBLAS, which is only supported on Linux anyway, is not helping and generally hurts. Unfortunately, there is not a standard way to disable it in Breeze. I had to hack the jar file of netlib. We are not using much of the functionality of Breeze, and I wonder whether all the conversion to the DenseMatrix and DenseVector is worth it.

OSys	Branch	NativeBLAS	Serial (sec.)	Parallel (sec.)	Notes
Windows	master	F	13	13	Intel chip, slow
Windows	balaur	F	16	5	Favorable because master is slow
Linux	master	F	7	9	AMD chip
Linux	balaur	T	21	7	How can this be disabled?
Linux	balaur	F	19	3	Disabling doesn't help much on serial
Mac/Intel	master	F	11	11	Intel version on ARM
Mac/Intel	balaur	F	26	10
Mac/M1	master	F	7	7	M1 version on ARM
Mac/M1	balaur	F	18	6

MihaiSurdeanu · 2023-11-01T06:21:20Z

This is very very informative! Thank you!!
I suspect the most common setting will be (Linux, balaur, F). In this case, what pct. of the execution stays in the linear layer? Perhaps we should try to improve that.

kwalcock · 2023-11-01T06:28:18Z

Unless something is done, they will probably have Linux, balaur, T, because it can't readily be made F. I'll check on the split times.

kwalcock · 2023-11-01T20:13:32Z

The slow times weren't all imagined. The new balaur version runs very slowly on our servers, particularly when there is native support. It's 8x slower than the current version.

Computer	OSys	Branch	NativeBLAS	Serial (sec.)	Parallel (sec.)	Notes
NUC	Windows	master	F	13	13	Intel chip, slow
NUC	Windows	balaur	F	16	5	Perhaps because master is slow
aorus	Linux	master	F	7	9	AMD chip
aorus	Linux	balaur	T	21	7	How can this be disabled?
aorus	Linux	balaur	F	19	3	Disabling doesn't help much on serial
MacMini	Mac/Intel	master	F	11	11	Intel version on ARM
MacMini	Mac/Intel	balaur	F	26	10
MacMini	Mac/M1	master	F	7	7	M1 version on ARM
MacMini	Mac/M1	balaur	F	18	6
river	Linux	master	F	23	22
river	Linux	balaur	T	199	35	Very bad serial time!
river	Linux	balaur	F	74	16
clara	Linux	master	F	12	19	Too many processors
clara	Linux	balaur	T	95	6	Very bad serial time!
clara	Linux	balaur	F	50	3

kwalcock · 2023-11-02T01:02:06Z

This was run on aorus, Linux, balaur, native blas, serial. Deps Label takes 88% of the time.

Timer	Layer	Time
Elapsed		0:00:00:23.311
Encoder.forward		0:00:00:02.516
Encoder.predict 0	NER	0:00:00:00.047
Encoder.predict 1	POS	0:00:00:00.049
Encoder.predict 2	Chunking	0:00:00:00.025
Encoder.predict 3	Deps Head	0:00:00:00.169
Encoder.predict 4	Deps Label	0:00:00:20.400
Tokenizer		0:00:00:00.032

MihaiSurdeanu · 2023-11-02T06:50:16Z

Oh, I finally get it! Thanks for this!

The old tests you did were performed before the stable version of Deps Label was deployed. The current one has higher overhead because it's the only one that runs in what I called "dual mode", i.e., it pairs the embedding of the both tokens involved in a syntactic dependency. This involves a vector concatenation operation + a linear layer that operates on the resulting vector that is twice as large as in the other case.

Before I start messing with it, can you please check what is the cost of the vector concatenation? That is happening in this method:
https://github.com/clulab/scala-transformers/blob/main/encoder/src/main/scala/org/clulab/scala_transformers/encoder/LinearLayer.scala#L107

Thanks!

kwalcock · 2023-11-02T16:25:06Z

The concatenation doesn't seem to take all that long:

Timer	Layer	Time
Elapsed		0:00:00:22.186
Encoder.forward		0:00:00:02.282
Encoder.predict 0	NER	0:00:00:00.050
Encoder.predict 1	POS	0:00:00:00.045
Encoder.predict 2	Chunking	0:00:00:00.023
Encoder.predict 3	Deps Head	0:00:00:00.154
Encoder.predict 4	Deps Label	0:00:00:19.530
LinearLayer.concatenation		0:00:00:01.339
Tokenizer		0:00:00:00.032

MihaiSurdeanu · 2023-11-02T17:26:39Z

Hmm, then I am not sure why Deps Label takes so much longer... Perhaps because the input to its linear layer is a vector that's twice as large as for Deps Head? As a parenthesis, it can't be the classifier's output because Deps Head has more classes then Deps Label (~100 vs. ~50).

Can you please time its forward() pass here: https://github.com/clulab/scala-transformers/blob/main/encoder/src/main/scala/org/clulab/scala_transformers/encoder/LinearLayer.scala#L183

kwalcock · 2023-11-02T17:36:01Z

Isn't this the n^2 code that is looping through all head candidates in LinearLayer.predictDualWithScores?

Since the native code is running slower than the Scala code, it may make sense to look at other Java/Scala implementations. There are some benchmarks here: https://lessthanoptimal.github.io/Java-Matrix-Benchmark/.

kwalcock · 2023-11-02T17:49:15Z

This is with the Java implementation, so the overall time is lower.

Timer	Layer	Time
Elapsed		0:00:00:18.241
Encoder.forward		0:00:00:02.462
Encoder.predict 0	NER	0:00:00:00.076
Encoder.predict 1	POS	0:00:00:00.101
Encoder.predict 2	Chunking	0:00:00:00.060
Encoder.predict 3	Deps Head	0:00:00:00.337
Encoder.predict 4	Deps Label	0:00:00:15.100
LinearLayer.concatenation		0:00:00:01.295
LinearLayer.dualForward		0:00:00:13.615
Tokenizer		0:00:00:00.033

MihaiSurdeanu · 2023-11-02T18:18:37Z

Isn't this the n^2 code that is looping through all head candidates in LinearLayer.predictDualWithScores?

Since the native code is running slower than the Scala code, it may make sense to look at other Java/Scala implementations. There are some benchmarks here: https://lessthanoptimal.github.io/Java-Matrix-Benchmark/.

The implementation in Deps Label is linear because for each word (as the modifier) we use the predicted head by Deps Head. Thanks to this, we explore just one head for each word in the sentence. So, linear.

MihaiSurdeanu · 2023-11-03T06:57:54Z

I will do some surgery on the Deps Label model soon.
But, one last thing before this: can you please repeat the last analysis with the roberta model? The deberta transformer is doing something fancier, and I wonder if that component is slow...

kwalcock · 2023-11-03T16:42:04Z

These times are with roberta, first with JavaBLAS (faster) and then with NativeBLAS (slower). That dual forward is taking the most time, which is to be expected with example values of headCandidatesPerSentence at 72, 47, 18, 47, 24, 34, 13, 48. For JavaBLAS, the ratio of Deps Label to Deps Head is 45, which seems reasonable. For NativeBLAS, it is 122, which might mean that the slow native version is doing more work in Deps Label than Deps Head.

JavaBLAS

Timer	Layer	Time
Elapsed		0:00:00:18.268
Encoder.forward		0:00:00:01.834
Encoder.predict 0	NER	0:00:00:00.084
Encoder.predict 1	POS	0:00:00:00.103
Encoder.predict 2	Chunking	0:00:00:00.063
Encoder.predict 3	Deps Head	0:00:00:00.346
Encoder.predict 4	Deps Label	0:00:00:15.734
LinearLayer.concatenation		0:00:00:01.290
LinearLayer.dualForward		0:00:00:14.296
Tokenizer		0:00:00:00.030

NativeBLAS

Timer	Layer	Time
Elapsed		0:00:00:22.536
Encoder.forward		0:00:00:01.831
Encoder.predict 0	NER	0:00:00:00.055
Encoder.predict 1	POS	0:00:00:00.049
Encoder.predict 2	Chunking	0:00:00:00.027
Encoder.predict 3	Deps Head	0:00:00:00.165
Encoder.predict 4	Deps Label	0:00:00:20.305
LinearLayer.concatenation		0:00:00:01.399
LinearLayer.dualForward		0:00:00:18.720
Tokenizer		0:00:00:00.029

kwalcock · 2023-11-03T17:17:02Z

It looks like ONNX itself can do math. OnnxTensor.createTensor(env,sourceArray) is shown at get-started. Documentation is at OnnxTensor. Would it make sense to try that rather than to add Breeze and everything it brings in? encoderSession.run(inputs) seems to return a result from which a bunch of OnnxTensors can be extracted, probably without need to convert. It would be really great if the math was a component that could be replaced.

MihaiSurdeanu · 2023-11-05T06:53:42Z

Maybe... ONNX is doing all sorts of other smart things in the background, and I wonder what the overhead of those is...
But maybe it's worth checking? That is, measure:

The overhead of converting ONNX tensors to Breeze.
The cost of multiplying tensors in ONNX vs. Breeze.

In any case, I suspect Breeze pays a high cost for Deps Label due to matrix operations where one dimension is twice as large as in Deps Head (bc everything else seems the same). The good news is that Deps Label performs nearly as well when using just 1 of the embeddings (that of the modifier). I will train a model with this setting, and see if it makes a difference.

kwalcock · 2023-11-06T17:41:53Z

@myedibleenso, in case you are interested in this discussion.

kwalcock · 2023-11-10T17:48:32Z

The OnnxTensors in Java do not seem to have any mathematical operations associated with them. I did not run into any library that does offer them. However, in the branch kwalcock/math, I have abstracted all the Math that Breeze supplies and reimplemented it once with EJML for comparison.

The runtimes are favorable on at least Jenny, where today

Library	Native	Time (sec)
Breeze	T	83
Breeze	F	30
EJML	F	26

On other machines that don't offer native support, they were about even. The major problem is that the native implementation shipped with Breeze does not run well on lots of computers. There isn't a good way to disable it or substitute a locally optimized version.

I would like to try one other library to be sure. The library may be reused for the clustering in habitus.

MihaiSurdeanu · 2023-11-10T18:04:18Z

Nice! Thanks @kwalcock !

MihaiSurdeanu · 2023-11-10T18:53:00Z

I started training a model that uses sum instead of concat for Deps Label. If our hypothesis is correct, this should speed it up. I'll send the model in a couple of days.

kwalcock · 2023-11-10T23:26:30Z

Because they are asking about the size of the matrices being multiplied, I note that one sentence that has 42 words and maybe 57 subword tokens involves the following matrix products:
Four (57 * 768) x (768 * 768), which are probably for four linear layers
9128 (1 * 1536) x (1536 * 1536).
I'm not sure where the 9128 comes from. The number of multiplications for each is
4 * 57 * 768 * 768 = 134,479,872
9128 * 1 * 1536 * 1536 = 21,535,653,888
The second number is 160 times as large as the first. It's not so much that 1536 is twice 768, but that 9128 is 50 times 4 * 47.

kwalcock · 2023-11-10T23:41:33Z

The above is for predictWithScores. With just predict, the multiplications are the same
four (57 * 768) x (768 * 768) and then just
one (57 * 1536) x (1536 * 1536).
Time for predictWithScores is about 22 seconds and for predict, just 2 seconds.

MihaiSurdeanu · 2023-11-11T16:40:21Z

I don't understand where 9128 comes from... Where in the code did you measure this?

kwalcock · 2023-11-11T19:07:53Z

I had the matrix multiplier print out what it was doing and it printed that line 9128 times in a row for the sentence. I can add a breakpoint and check the stack.

kwalcock · 2023-11-13T23:23:10Z

My count was a little off. It was 1920 which is 57 x 160. That's the product of the inputBatch size and headCandidatesPerToken.

kwalcock · 2023-11-14T20:00:21Z

The code in question is here:

scala-transformers/encoder/src/main/scala/org/clulab/scala_transformers/encoder/LinearLayer.scala

Lines 166 to 196 in f16b818

    
           def predictDualWithScores(inputBatch: Array[DenseMatrix[Float]],  
        
                                     batchHeads: Option[Array[Array[Array[Int]]]] = None, 
        
                                     batchMasks: Option[Array[Array[Boolean]]] = None): Array[Array[Array[(String, Float)]]] = { 
        
             assert(batchHeads.isDefined) 
        
             assert(batchMasks.isDefined) 
        
             val indexToLabel = labelsOpt.getOrElse(throw new RuntimeException("ERROR: can't predict without labels!")) 
        
             // dimensions: sent in batch x token in sentence x label per candidate head 
        
             // we process one sentence at a time because the dual setting makes it harder to batch 
        
             val outputBatch = inputBatch.zip(batchHeads.get).map { case (input, headCandidatesPerSentence) => 
        
               // now process each token separately 
        
               headCandidatesPerSentence.zipWithIndex.map { case (headCandidatesPerToken, modifierAbsolutePosition) => 
        
                 // process each head candidate for this token 
        
                 headCandidatesPerToken.map { headRelativePosition => 
        
                   // generate a matrix that is twice as wide to concatenate the embeddings of the mod + head 
        
                   val concatInput = concatenateModifierAndHead(input, modifierAbsolutePosition, headRelativePosition) 
        
                   // get the logits for the current pair of modifier and head 
        
                   val logitsPerSentence = forward(Array(concatInput))(0) 
        
                   val labelScores = logitsPerSentence(0, ::) 
        
                   val bestIndex = argmax(labelScores.t) 
        
                   val bestScore = labelScores(bestIndex) 
        
                   val bestLabel = indexToLabel(bestIndex) 
        
                   // println(s"Top prediction for mod $modifierAbsolutePosition and relative head $headRelativePosition is $bestLabel with score $bestScore") 
        
                   (bestLabel, bestScore) 
        
                 } // end head candidates for this token 
        
               } // end this token 
        
             } // end sentence batch 
        
             outputBatch                     
        
           }

kwalcock closed this as completed Nov 21, 2022

kwalcock reopened this Oct 3, 2023

Performance needs to be evaluated #16

Performance needs to be evaluated #16

Comments

kwalcock commented Nov 21, 2022

MihaiSurdeanu commented Nov 21, 2022

kwalcock commented Nov 21, 2022

kwalcock commented Nov 21, 2022

MihaiSurdeanu commented Nov 21, 2022

kwalcock commented Nov 21, 2022

kwalcock commented Nov 21, 2022

MihaiSurdeanu commented Nov 21, 2022

kwalcock commented Oct 2, 2023

MihaiSurdeanu commented Oct 2, 2023

kwalcock commented Oct 3, 2023

MihaiSurdeanu commented Oct 3, 2023

kwalcock commented Nov 1, 2023 • edited Loading

MihaiSurdeanu commented Nov 1, 2023

kwalcock commented Nov 1, 2023

kwalcock commented Nov 1, 2023 • edited Loading

kwalcock commented Nov 2, 2023

MihaiSurdeanu commented Nov 2, 2023

kwalcock commented Nov 2, 2023

MihaiSurdeanu commented Nov 2, 2023

kwalcock commented Nov 2, 2023

kwalcock commented Nov 2, 2023

MihaiSurdeanu commented Nov 2, 2023

MihaiSurdeanu commented Nov 3, 2023

kwalcock commented Nov 3, 2023

kwalcock commented Nov 3, 2023

MihaiSurdeanu commented Nov 5, 2023

kwalcock commented Nov 6, 2023

kwalcock commented Nov 10, 2023

MihaiSurdeanu commented Nov 10, 2023

MihaiSurdeanu commented Nov 10, 2023

kwalcock commented Nov 10, 2023

kwalcock commented Nov 10, 2023

MihaiSurdeanu commented Nov 11, 2023

kwalcock commented Nov 11, 2023

kwalcock commented Nov 13, 2023

kwalcock commented Nov 14, 2023

kwalcock commented Nov 1, 2023 •

edited

Loading

kwalcock commented Nov 1, 2023 •

edited

Loading