Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mac M1 - multiple issues preventing transcription #255

Open
SamuelAierizer opened this issue Dec 7, 2024 · 3 comments · Fixed by huggingface/transformers#35295
Open

Comments

@SamuelAierizer
Copy link

SamuelAierizer commented Dec 7, 2024

I keep trying to run insanely-fast-whisper on M! but for some reason face some unknown issue which I can't trace to any existing problems. Thus creating a new issue.

Environment:

  • CPU/GPU: M1 Max
  • OS: macos 15.1.1
  • installed with pipx without any errors or warnings

Command that I tried to run:
insanely-fast-whisper --batch-size 4 --device-id mps --file-name <filename>

Error:

It starts to run but then crashes after some seconds and this is the trace.

(If it matters in any way, the file is a mp4 but only contains the audio track. For reference, I can run whisper with CPU or whisper-mps which runs just fine with CPU-GPU in tandem.)

~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py:512:
FutureWarning: The input name `inputs` is deprecated. Please make sure to use `input_features` instead.
  warnings.warn(
🤗 Transcribing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:03You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, None], [2, 50360]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
🤗 Transcribing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:05Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
🤗 Transcribing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:06The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
🤗 Transcribing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:24
Traceback (most recent call last):
  File "~/.local/bin/insanely-fast-whisper", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/insanely_fast_whisper/cli.py", line 159, in main
    outputs = pipe(
              ^^^^^
  File "~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 283, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/pipelines/base.py", line 1293, in __call__
    return next(
           ^^^^^
  File "~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "~.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/pipelines/pt_utils.py", line 269, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/pipelines/base.py", line 1208, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 515, in _forward
    tokens = self.model.generate(
             ^^^^^^^^^^^^^^^^^^^^
  File "~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 718, in generate
    segments, segment_offset = self._retrieve_segment(
                               ^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 1831, in _retrieve_segment
    "start": time_offset[prev_idx] + start_timestamp_pos.to(torch.float64) * time_precision,
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

Someone encountered this issue? Were you able to bypass it?

I am using the CLI, running it in a script is somewhat not an option at the moment.

@ashbuilds
Copy link

Facing the same issue:

    "start": time_offset[prev_idx] + start_timestamp_pos.to(torch.float64) * time_precision,
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

@ashbuilds
Copy link

After some debugging I was able to fix this using below workaround inside installed libs -

File: insanely-fast-whisper/lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py

        segments.append(
            {
                "start": time_offset[prev_idx].to(torch.float32) + start_timestamp_pos.to(torch.float32) * time_precision,
                "end": time_offset[prev_idx].to(torch.float32) + end_timestamp_pos.to(torch.float32) * time_precision,
                "tokens": sliced_tokens,
                "result": seek_outputs[idx],
            }
        )

@SamuelAierizer
Copy link
Author

SamuelAierizer commented Dec 7, 2024

Ok, I thought I'm the only one.

Seems that manually modifying the lib is working for me also.

Although I needed to modify these lines:
Line 1859: last_timestamp_pos = (timestamps[-1] - timestamp_begin).to(torch.float32)
Line 1863: "end": time_offset[prev_idx].to(torch.float32) + last_timestamp_pos * time_precision,

On line 1863 last_timestamp_pos doesn't need to be cast to float32 since it's an int.

It uses ~110% CPU and ~80% GPU

Running insanely-fast-whisper locally on M1

For a 120 minute file with --batch-size 4 and --model-name openai/whisper-large-v3 it takes up 2x as much RAM (30GB+) compared to whisper-mps, which had similar CPU/GPU usage and the transcription takes about the same time.
When I tried running it with --batch-size 2, the RAM usage stayed around 10-13GB but then (after about 6 mins) rocketed back to 30+GB with big memory compression and Swap usage. I think there might be some kind of memory leak here since based on the descriptions of the repo, it shouldn't happen.

I don't know if we should add a check for --device-id mps and use float 32 or 64 based on that? Should I make a PR with the change?

Hopefully this helps someone else as well even if we don't merge a fix in the project. At least there is a somewhat simple way to locally patch this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants