Pure C, CPU-only inference with Mistral Voxtral Realtime 4B speech to text model

(github.com)

297 points | by Curiositry 1 day ago

13 comments

d4rkp4ttern 20 hours ago
I use the open source Handy [1] app with Parakeet V3 for STT when talking to coding agents and I’ve yet to see anything that beats this setup in terms of speed/accuracy. I get near instant transcription, and the slight accuracy drop is immaterial when talking to AIs that can “read between the lines”.
I tried incorporating this Voxtral C implementation into Handy but got very slow transcriptions on my M1 Max MacBook 64GB.
[1] https://github.com/cjpais/Handy
I’ll have to try the other implementations mentioned here.
[-]
- thethimble 13 hours ago
  Handy is great but I wish the STT was realtime instead of batch
  [-]
  - d4rkp4ttern 8 hours ago
    There’s a tradeoff here. If you want streaming output, then you lose the opportunity to clean it up in post processing such as removing filler words or removing stutters, etc., or any other AI based cleanup.
    The MacOS built-in dictation streams in real time and also does some cleanup, but it does awkward things, like the streaming text shows up at the bottom of the screen. Also I don’t think it’s as accurate as Parakeet V3, and there’s a start up lag of 1-2 secs after hitting the dictation shortcut, which kills it for me.
mythz 23 hours ago
Big fan of Salvatore's voxtral.c and flux2.c projects - hope they continue to get optimized as it'd be great to have lean options without external deps. Unfortunately it's currently too slow for real-world use (AMD 7800X3D/Blas) when adding Voice Input support to llms-py [1].
In the end Omarchy's new support for voxtype.io provided the nicest UX, followed by Whisper.cpp, and despite being slower, OpenAI's Whisper is still a solid local transcription option.
Also very impressed with both the performance and price of Mistral's new Voxtral Transcription API [2] - really fast/instant and really cheap ($0.003/min), IMO best option in CPU/disk-constrained environments.
[1] https://llmspy.org/docs/features/voice-input
[2] https://docs.mistral.ai/models/voxtral-mini-transcribe-26-02
[-]
- antirez 21 hours ago
  Hi! This model is great, but it is too big for local inference, Whisper medium (the "base" IMHO is not usable for most things, and "large" is too large) is a better deal for many environments, even if the transcription quality is noticeable lower (and even if it does not have a real online mode). But... It's time for me to check the new Qwen 0.6 transcription model. If it works as well as their benchmarks claim, that could be the target for very serious optimizations and a no deps inference chain conceived since the start for CPU execution, not just for MPS. Since, many times, you want to install such transcription systems on server rent online via Hetzner and other similar vendors. So I'm going to handle it next, and if it delivers, really, time for big optimizations covering specifically the Intel, AMD and ARM instructions sets, potentially also thinking at 8bit quants if the performance remain good.
  [-]
  - dust42 20 hours ago
    Same experience here with Whisper, medium is often not good enough. The large-turbo model however is pretty decent and on Apple silicon fast enough for real time conversations. The addition of the prompt parameter can also help with transcription quality, especially when using domain specific vocabulary. In general Whisper.cpp is better with transcribing full phrases than with streaming.
    And not to forget, for many use cases more than just English is needed. Unfortunately right now most STT/ASR and TTS focus on English plus 0-10 other languages. Thus being able to add with reasonable effort more languages or domain specific vocabulary would be a huge plus for any STT and TTS.
- mijoharas 23 hours ago
  One thing I keep looking for is transcribing while I'm talking. I feel like I need that visual feedback. Does voxtype support that?
  (I wasn't able to find anything at glance)
  Handy claims to have an overlay, but it seems to not work on my system.
  [-]
  - mythz 23 hours ago
    Not sure how it works in other OS's but in Omarchy [1] you hold down `Super + Ctrl + X` to start recording and release it to stop, while it's recording you'll see a red voice recording icon in the top bar so it's clear when its recording.
    Although as llms-py is a local web App I had to build my own visual indicator [2] which also displays a red microphone next to the prompt when it's recording. It also supports both Tap On/Off and hold down for recording modes. When using voxtype I'm just using the tool for transcription (i.e. not Omarchy OS-wide dictation feature) like:
    $ voxtype transcribe /path/to/audio.wav
    If you're interested the Python source code to support multiple voice transcription backends is at: [3]
    [1] https://learn.omacom.io/2/the-omarchy-manual/107/ai
    [2] https://llmspy.org/docs/features/voice-input
    [3] https://github.com/ServiceStack/llms/blob/main/llms/extensio...
    [-]
    - mijoharas 20 hours ago
      Ah, the thing I really want is to see the words that I'm speaking being transcribed (i.e. realtime) For some reason I rarely see that feature.
      [-]
      - bmn__ 19 hours ago
        The more things change…
        https://news.ycombinator.com/item?id=21711755
        [-]
        mijoharas 18 hours ago
        hahaha! plus ca change indeed.
        (I keep coming back to this one so I've got half a dozen messages on HN asking for the exact same thing!).
        It's a shame, whisper is so prevalent, but not great at actual streaming, but everyone uses it.
        I'm hoping one of these might become a realtime de facto standard so we can actually get our realtime streaming api (and yep, I'd be perfectly happy with something just writing to stdout. But all the tools always end up just batching it because it's simpler!)
  - Doman 21 hours ago
    I am using a window manager with Waybar. Voxtype can display a status icon on Waybar [1], it is enough for me to know what is going on.
    [1] https://github.com/peteonrails/voxtype/blob/main/docs/WAYBAR...
- grigio 17 hours ago
  +1 for voxtype with Whisper-base model it is quite fast an accurate
Curiositry 1 day ago
This was a breeze to install on Linux. However, I haven't managed to get realtime transcription working yet, ala Whisper.cpp stream or Moonshine.
--from-mic only supports Mac. I'm able to capture audio with ffmpeg, but adapting the ffmpeg example to use mic capture hasn't worked yet:
ffmpeg -f pulse -channels 1 -i 1 -f s16le - 2>/dev/null | ./voxtral -d voxtral-model --stdin
It's possible my system is simply under spec for the default model.
I'd like to be able to use this with the voxtral-q4.gguf quantized model from here: https://huggingface.co/TrevorJS/voxtral-mini-realtime-gguf
[-]
- jwrallie 1 day ago
  I am interested in a way to capture audio not only from the mic, but also from one of the monitor ports so you could pipe the audio you are hearing from the web directly for real-time transcription with one of these solutions. Did anyone manage to do that?
  I can, for example, capture audio from that with Audacity or OBS Studio and do it later, so it should be possible to do it in real time too assuming my machine can keep up.
  [-]
  - bebna 1 day ago
    Set -i 1 to -i default or to one of your monitors, look them up with pactl list short sources
    https://trac.ffmpeg.org/wiki/Capture/PulseAudio
- yjftsjthsd-h 1 day ago
  Does it work if you use ffmpeg to feed it audio from a file? I personally would try file->ffmpeg->voxtral then mic->ffmpeg->file, and then try to glue together mic->ffmpeg->voxtral.
  (But take with grain of salt; I haven't tried yet)
  [-]
  - Curiositry 1 day ago
    Recording audio with FFMPEG, and transcribing a file that’s piped from FFMPEG both work.
    Given that it took 19.64 mins to transcribe the 11 second sample wav, it’s possible I just didn’t wait long enough :)
    [-]
    - yjftsjthsd-h 1 day ago
      Ah. In that case... Yeah. Is it using GPU, and does the whole model fit in your (V)RAM?
      [-]
      - ekianjo 1 day ago
        This is a CPU implementation only.
        [-]
        yjftsjthsd-h 17 hours ago
        Oh, that's interesting. The readme talks about GPU acceleration on Apple Silicon and I didn't see anything explicit for other platforms, so I assumed it needs GPU everywhere, but it does BLAS acceleration which a web search seems to agree is just a CPU optimized math library. That's great; should really increase the places where it's useful:)
        [-]
        ekianjo 2 hours ago
        It should be possible to develop a cuBLAS backend to accelerate BLAS on Nvidia.
written-beyond 1 day ago
Funny, this and the Rust runtime implementation are neck and neck on the frontpage right now.
Cool project!
hrpnk 23 hours ago
There is also a MLX implementation: https://github.com/awni/voxmlx
sgt 1 day ago
I'm very interested in speech to text - but like tricky dialects and use of various terminologies but I'm still confused as to where to start in the best possible place, in order to train the models with a huge database of voice samples I own.
Any ideas from the HN crowd currently involved in speech 2 text models?
ks2048 14 hours ago
Should this work on a 16GB M3 MacBook Pro? It starts to load, but hangs or is too slow.
9999_points 16 hours ago
It seems so bizarre that we need a nearly 9gb model to do something you could do over 20 years ago with ~200mb.
sylware 22 hours ago
Finally a plain and simple C lib to run LLM opened weights?
MORPHOICES 23 hours ago
[dead]
genie3io 1 day ago
[dead]
alextray812 20 hours ago
From a cybersecurity perspective, this project is impressive not just for performance, but for transparency.