Inference time changes based on how the model is loaded #238

fentresspaul61B · 2024-12-28T06:30:28Z

fentresspaul61B
Dec 28, 2024

Hello,

I am really enjoying using the coqui TTS models and packages, but I have run into an odd quirk that I do not understand.

I am deploying coqui "tts_models/multilingual/multi-dataset/xtts_v2" as an API using GCP cloud run with Docker. Since Cloud Run scales to 0, the model needs to be reloaded when the server scales to 0, which causes a cold start. Now as a way to make these cold starts less bad, I wanted to download the model into memory at deployment time, rather than at the time of first inference. Which I was able to do, and it did reduce the cold start times from about 200 seconds, to around 20 seconds which is great.

But the odd quirk that arose is that the inference time is increased when I use the API when the model has been loaded using docker, vs when I load it from the web using the coqui API.

So in option. A:
I add this to my docker file:

RUN python -c "from TTS.api import TTS; tts = TTS('tts_models/multilingual/multi-dataset/xtts_v2')"

Then, in the actual TTS python code I add this:

LOCAL_PATH_DOCKER = "/root/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2"

if os.path.isdir(LOCAL_PATH_DOCKER):
    print("Model file exists")
    tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
else:
    print("CANNOT FIND MODEL FILE")
    tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

So this basically this just checks if the file exists in memory, and if so, it loads the model from memory, otherwise it downloads the model and saves it to memory.

Now for some reason when the model file exists in memory, the inference for the same sentence is around 8-9 seconds, but when the model file is loaded from the web, it takes about 1.5 seconds.

Here are some of my logs showing this:

Function Name: run_gcp_api
Call Time: 2024-12-27 21:21:36
Input Arguments: ('It seems the GPU is now slower or not working?',)
Input Keyword Arguments: éè
Input Data Types: °'str'§
Output Data Type: NoneType
Execution Time: 9.720515251159668

Function Name: run_gcp_api
Call Time: 2024-12-27 21:42:54
Input Arguments: ('Why did my inference speed slow down?',)
Input Keyword Arguments: éè
Input Data Types: °'str'§
Output Data Type: NoneType
Execution Time: 8.164638996124268

# initial cold start
Function Name: run_gcp_api
Call Time: 2024-12-27 22:11:00
Input Arguments: ('Why did my inference speed slow down?',)
Input Keyword Arguments: éè
Input Data Types: °'str'§
Output Data Type: NoneType
Execution Time: 96.4996931552887

# After cold start
Function Name: run_gcp_api
Call Time: 2024-12-27 22:11:24
Input Arguments: ('Why did my inference speed slow down?',)
Input Keyword Arguments: éè
Input Data Types: °'str'§
Output Data Type: NoneType
Execution Time: 1.2452139854431152

Function Name: run_gcp_api
Call Time: 2024-12-27 22:11:36
Input Arguments: ('Why did my inference speed slow down?',)
Input Keyword Arguments: éè
Input Data Types: °'str'§
Output Data Type: NoneType
Execution Time: 1.4146409034729004

I am not really sure what is going on here, and why the way the model is loaded would make any difference. I also verified that when I loaded the model in the docker container, that the audio was not corrupted or anything strange like that. I also verified in the logs that the GPU was still being accsessed in both versions, by printing out the active device right before calling the TTS package, and they were both using CUDA. At first I thought this was maybe a GCP issue, like my GPU was downgraded; however, this was debunked after switching back to downloading the model from the web, and getting back the fast response times.

I would ideally like to be able to load the model directly into memory in my dockerfile as this does effectively reduce the cold start times by 10x, but I dont want to lose out on the fast inference. Looking for any advice/ideas. Here is the repo where the code can be viewed in more detail:

https://github.com/fentresspaul61B/Deploy-Coqui-TTS-GCP

Right now the model download is commented out in the Dockerfile.

Thank you in advance!

Answered by eginhard

Dec 28, 2024

Then, in the actual TTS python code I add this:

LOCAL_PATH_DOCKER = "/root/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2"

if os.path.isdir(LOCAL_PATH_DOCKER):
    print("Model file exists")
    tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
else:
    print("CANNOT FIND MODEL FILE")
    tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

You didn't add .to(device) when running in Docker, so it's probably just slow then because it runs on the CPU?

View full answer

eginhard · 2024-12-28T11:54:53Z

eginhard
Dec 28, 2024
Maintainer

Then, in the actual TTS python code I add this:

LOCAL_PATH_DOCKER = "/root/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2"

if os.path.isdir(LOCAL_PATH_DOCKER):
    print("Model file exists")
    tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
else:
    print("CANNOT FIND MODEL FILE")
    tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

You didn't add .to(device) when running in Docker, so it's probably just slow then because it runs on the CPU?

1 reply

fentresspaul61B Dec 28, 2024
Author

Yes this was the issue! This is resolved. I was just getting confused, as when I printed the available device, it would say CUDA, but since I wasnt initiating the TTS package with that device, it was not using the GPU even thought it was available. Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference time changes based on how the model is loaded #238

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Inference time changes based on how the model is loaded #238

fentresspaul61B Dec 28, 2024

Replies: 1 comment · 1 reply

eginhard Dec 28, 2024 Maintainer

fentresspaul61B Dec 28, 2024 Author

fentresspaul61B
Dec 28, 2024

Replies: 1 comment 1 reply

eginhard
Dec 28, 2024
Maintainer

fentresspaul61B Dec 28, 2024
Author