Document the thread count options (#126)

* Document the thread count options * Format fix * Apply suggestions from code review Co-authored-by: Jacky <[email protected]> --------- Co-authored-by: Jacky <[email protected]>
triton-inference-server · Apr 24, 2024 · 5c97507 · 5c97507
1 parent c50d65b
commit 5c97507
Show file tree

Hide file tree

Showing 2 changed files with 45 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -176,6 +176,47 @@ key: "ENABLE_CACHE_CLEANING"
 }
 ```
 
+* `INTER_OP_THREAD_COUNT`:
+
+PyTorch allows using multiple CPU threads during TorchScript model inference.
+One or more inference threads execute a model’s forward pass on the given
+inputs. Each inference thread invokes a JIT interpreter that executes the ops
+of a model inline, one by one. This parameter sets the size of this thread
+pool. The default value of this setting is the number of cpu cores. Please refer
+to [this](https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html)
+document on how to set this parameter properly.
+
+The section of model config file specifying this parameter will look like:
+
+```
+parameters: {
+key: "INTER_OP_THREAD_COUNT"
+    value: {
+    string_value:"1"
+    }
+}
+```
+
+* `INTRA_OP_THREAD_COUNT`:
+
+In addition to the inter-op parallelism, PyTorch can also utilize multiple threads
+within the ops (intra-op parallelism). This can be useful in many cases, including
+element-wise ops on large tensors, convolutions, GEMMs, embedding lookups and
+others. The default value for this setting is the number of CPU cores. Please refer
+to [this](https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html)
+document on how to set this parameter properly.
+
+The section of model config file specifying this parameter will look like:
+
+```
+parameters: {
+key: "INTRA_OP_THREAD_COUNT"
+    value: {
+    string_value:"1"
+    }
+}
+```
+
 * Additional Optimizations: Three additional boolean parameters are available to disable
 certain Torch optimizations that can sometimes cause latency regressions in models with
 complex execution modes and dynamic shapes. If not specified, all are enabled by default.

diff --git a/src/libtorch.cc b/src/libtorch.cc
@@ -476,8 +476,8 @@ ModelState::ParseParameters()
     // is made to 'intra_op_thread_count', which by default will take all
     // threads
     int intra_op_thread_count = -1;
-    err = ParseParameter(
-        params, "INTRA_OP_THREAD_COUNT", &intra_op_thread_count);
+    err =
+        ParseParameter(params, "INTRA_OP_THREAD_COUNT", &intra_op_thread_count);
     if (err != nullptr) {
       if (TRITONSERVER_ErrorCode(err) != TRITONSERVER_ERROR_NOT_FOUND) {
         return err;
@@ -500,8 +500,8 @@ ModelState::ParseParameters()
     // is made to 'inter_op_thread_count', which by default will take all
     // threads
     int inter_op_thread_count = -1;
-    err = ParseParameter(
-        params, "INTER_OP_THREAD_COUNT", &inter_op_thread_count);
+    err =
+        ParseParameter(params, "INTER_OP_THREAD_COUNT", &inter_op_thread_count);
     if (err != nullptr) {
       if (TRITONSERVER_ErrorCode(err) != TRITONSERVER_ERROR_NOT_FOUND) {
         return err;