Mac M1 - multiple issues preventing transcription #255

SamuelAierizer · 2024-12-07T10:01:29Z

I keep trying to run insanely-fast-whisper on M! but for some reason face some unknown issue which I can't trace to any existing problems. Thus creating a new issue.

Environment:

CPU/GPU: M1 Max
OS: macos 15.1.1
installed with pipx without any errors or warnings

Command that I tried to run:
insanely-fast-whisper --batch-size 4 --device-id mps --file-name <filename>

Error:

It starts to run but then crashes after some seconds and this is the trace.

(If it matters in any way, the file is a mp4 but only contains the audio track. For reference, I can run whisper with CPU or whisper-mps which runs just fine with CPU-GPU in tandem.)

~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py:512:
FutureWarning: The input name `inputs` is deprecated. Please make sure to use `input_features` instead.
  warnings.warn(
🤗 Transcribing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:03You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, None], [2, 50360]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
🤗 Transcribing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:05Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
🤗 Transcribing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:06The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
🤗 Transcribing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:24
Traceback (most recent call last):
  File "~/.local/bin/insanely-fast-whisper", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/insanely_fast_whisper/cli.py", line 159, in main
    outputs = pipe(
              ^^^^^
  File "~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 283, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/pipelines/base.py", line 1293, in __call__
    return next(
           ^^^^^
  File "~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "~.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/pipelines/pt_utils.py", line 269, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/pipelines/base.py", line 1208, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 515, in _forward
    tokens = self.model.generate(
             ^^^^^^^^^^^^^^^^^^^^
  File "~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 718, in generate
    segments, segment_offset = self._retrieve_segment(
                               ^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/pipx/venvs/insanely-fast-whisper/lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 1831, in _retrieve_segment
    "start": time_offset[prev_idx] + start_timestamp_pos.to(torch.float64) * time_precision,
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

Someone encountered this issue? Were you able to bypass it?

I am using the CLI, running it in a script is somewhat not an option at the moment.

The text was updated successfully, but these errors were encountered:

ashbuilds · 2024-12-07T13:30:33Z

Facing the same issue:

    "start": time_offset[prev_idx] + start_timestamp_pos.to(torch.float64) * time_precision,
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

ashbuilds · 2024-12-07T13:43:23Z

After some debugging I was able to fix this using below workaround inside installed libs -

File: insanely-fast-whisper/lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py

        segments.append(
            {
                "start": time_offset[prev_idx].to(torch.float32) + start_timestamp_pos.to(torch.float32) * time_precision,
                "end": time_offset[prev_idx].to(torch.float32) + end_timestamp_pos.to(torch.float32) * time_precision,
                "tokens": sliced_tokens,
                "result": seek_outputs[idx],
            }
        )

SamuelAierizer · 2024-12-07T15:21:53Z

Ok, I thought I'm the only one.

Seems that manually modifying the lib is working for me also.

Although I needed to modify these lines:
Line 1859: last_timestamp_pos = (timestamps[-1] - timestamp_begin).to(torch.float32)
Line 1863: "end": time_offset[prev_idx].to(torch.float32) + last_timestamp_pos * time_precision,

On line 1863 last_timestamp_pos doesn't need to be cast to float32 since it's an int.

It uses ~110% CPU and ~80% GPU

Running insanely-fast-whisper locally on M1

For a 120 minute file with --batch-size 4 and --model-name openai/whisper-large-v3 it takes up 2x as much RAM (30GB+) compared to whisper-mps, which had similar CPU/GPU usage and the transcription takes about the same time.
When I tried running it with --batch-size 2, the RAM usage stayed around 10-13GB but then (after about 6 mins) rocketed back to 30+GB with big memory compression and Swap usage. I think there might be some kind of memory leak here since based on the descriptions of the repo, it shouldn't happen.

I don't know if we should add a check for --device-id mps and use float 32 or 64 based on that? Should I make a PR with the change?

Hopefully this helps someone else as well even if we don't merge a fix in the project. At least there is a somewhat simple way to locally patch this.

This was referenced Dec 16, 2024

[Whisper] patch float type on mps huggingface/transformers#35295

Merged

MPS Backend Not Working – CPU is Slow on Apple Silicon #258

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mac M1 - multiple issues preventing transcription #255

Mac M1 - multiple issues preventing transcription #255

SamuelAierizer commented Dec 7, 2024 •

edited

Loading

ashbuilds commented Dec 7, 2024

ashbuilds commented Dec 7, 2024

SamuelAierizer commented Dec 7, 2024 •

edited

Loading

Mac M1 - multiple issues preventing transcription #255

Mac M1 - multiple issues preventing transcription #255

Comments

SamuelAierizer commented Dec 7, 2024 • edited Loading

ashbuilds commented Dec 7, 2024

ashbuilds commented Dec 7, 2024

SamuelAierizer commented Dec 7, 2024 • edited Loading

SamuelAierizer commented Dec 7, 2024 •

edited

Loading

SamuelAierizer commented Dec 7, 2024 •

edited

Loading