Inference time changes based on how the model is loaded #238
-
Hello, I am really enjoying using the coqui TTS models and packages, but I have run into an odd quirk that I do not understand. I am deploying coqui "tts_models/multilingual/multi-dataset/xtts_v2" as an API using GCP cloud run with Docker. Since Cloud Run scales to 0, the model needs to be reloaded when the server scales to 0, which causes a cold start. Now as a way to make these cold starts less bad, I wanted to download the model into memory at deployment time, rather than at the time of first inference. Which I was able to do, and it did reduce the cold start times from about 200 seconds, to around 20 seconds which is great. But the odd quirk that arose is that the inference time is increased when I use the API when the model has been loaded using docker, vs when I load it from the web using the coqui API. So in option. A:
Then, in the actual TTS python code I add this: LOCAL_PATH_DOCKER = "/root/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2"
if os.path.isdir(LOCAL_PATH_DOCKER):
print("Model file exists")
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
else:
print("CANNOT FIND MODEL FILE")
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device) So this basically this just checks if the file exists in memory, and if so, it loads the model from memory, otherwise it downloads the model and saves it to memory. Now for some reason when the model file exists in memory, the inference for the same sentence is around 8-9 seconds, but when the model file is loaded from the web, it takes about 1.5 seconds. Here are some of my logs showing this:
I am not really sure what is going on here, and why the way the model is loaded would make any difference. I also verified that when I loaded the model in the docker container, that the audio was not corrupted or anything strange like that. I also verified in the logs that the GPU was still being accsessed in both versions, by printing out the active device right before calling the TTS package, and they were both using CUDA. At first I thought this was maybe a GCP issue, like my GPU was downgraded; however, this was debunked after switching back to downloading the model from the web, and getting back the fast response times. I would ideally like to be able to load the model directly into memory in my dockerfile as this does effectively reduce the cold start times by 10x, but I dont want to lose out on the fast inference. Looking for any advice/ideas. Here is the repo where the code can be viewed in more detail: https://github.com/fentresspaul61B/Deploy-Coqui-TTS-GCP Right now the model download is commented out in the Dockerfile. Thank you in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
You didn't add |
Beta Was this translation helpful? Give feedback.
You didn't add
.to(device)
when running in Docker, so it's probably just slow then because it runs on the CPU?