My messaging app golden age is that of Allo. Remember Allo? It was one of Google’s messaging product attempts, and it had the killer feature that all mobile messaging apps should have:

voice messages with transcription.

See, I was taking the bus a lot back then, a college student living far off-campus, and I liked messaging my mom while walking. Often it was far more convenient to send a voice message than to type with my thumbs. However, if she responded with another voice message, I didn’t always have headphones on that could blast it over the street noise. Google would kindly transcribe the messages so we could each emote and express with tone of voice whether or not the other could listen to it right at that second.

Now I host a Matrix server. Yay, open source, yay, encrypted defaults, yay, decentralization, whatever, but I’m not just an engineer: I’m also a young-ish woman with demands of the tech I use with my friends. I want stickers and I want voice messages, Matrix. They’re just starting to introduce the latter into the major clients, so I thought I’d see if I could duct tape on automatic transcription.

Now, let me get this out of the way: I am sending voice messages to Google which is obviously Bad in the way that sending big corporations your data is always bad. I recommend not enabling data logging, and putting up a disclaimer on any room with your bot enabled a la note: voice messages to this room will be transcribed by <your bot's ID> by sending the audio to Google's speech-to-text API <with/without> data logging enabled.

I already had a setup of maubot on my server, and I vaguely recall its installation being pretty straightforward, so I won’t walk through that part.

You will need to set up a Google Cloud Platform account and enable the speech API for a project and get its relevant credentials in a file1.

I will thus provide most of the systemd file I use2 so you can see where I set the environment variable.

Description=Matrix bot
#  probably should also be other stuff here but w/e 

ExecReload=/bin/kill -HUP $MAINPID

# I'm not saying I'm lazy about users, but I'm not not saying it

# this will be wherever you installed it obvi
ExecStart=/bin/bash -c "source ./bin/activate && python3 -m maubot"

# this seemed like the most straightforward way to do things to me. 
Environment=GOOGLE_APPLICATION_CREDENTIALS="<path to your google cloud credentials>.json"


Wow so great. (Is it great? I wouldn’t know, I am not a professional at systemd.)

Okay, let’s turn to the actual plugin you’re gonna need to do. Mostly the maubot instructions are good, so I’m not going to explain that part. Here’s the actual Python I ended up with:

from maubot import Plugin, MessageEvent
from maubot.handlers import event
from mautrix.types import EventType, MessageType
# pip3 install pydub
# make sure ffmpeg is installed 
from pydub import AudioSegment, silence
from os import remove
import os.path
# not sure if this is really right but don't care
from tempfile import gettempdir

# pip3 install google-cloud-speech
from import speech

class SpeechToTextBot(Plugin):
    _google_client = speech.SpeechClient()

    # I've taken out the actual content here. I threw in people's names and some other phrases I wanted to cue Google might be used -- but it hasn't seemed to work all that well. 
    _speech_context = speech.SpeechContext(phrases=["hey there"])

    # yield <59s chunks of the audio with silence clipped out
    def get_chunks(self, seg, intervals, max_len=59000):
        time_cursor = 0
        length = 0
        interval_cursor = 0
        advance = True
        chunk = AudioSegment.empty()
        while interval_cursor < len(intervals):

            a, b = intervals[interval_cursor]

            # if we had to split something, start at split
            a = max(a, time_cursor)

            # if we can't add the next whole interval and
            # we have nothing to yield, split the next interval
            # and back up the cursor so we'll process the interval again
            if length == 0 and (b - a) >= max_len:
                b = max_len - a
                interval_cursor -= 1
            if length + (b - a) <= max_len:
                chunk += seg[a:b]
                time_cursor = b
                length += b - a
                interval_cursor += 1
                yield chunk
                chunk = AudioSegment.empty()
                length = 0
        if length > 0:
            yield chunk

    # I don't remember whether this method has to be named this
    async def handle_tombstone(self, evt: MessageEvent) -> None:
        # there is a way to filter message types in the bot framework,
        # but it also uses a regex that's unnecessary here
        if evt.content.msgtype == MessageType.AUDIO:
            original_bytes = await self.client.download_media(evt.content.url)

            # this bit is hacky/bad because pydub wants to work with files
            original_format_filename = os.path.join(gettempdir(), evt.event_id[1:])
            with open(original_format_filename, "wb") as fp:
            # google documentation likes .flac, so .flac they get
            tgt_filename = original_format_filename + ".flac"
            seg = AudioSegment.from_file(original_format_filename)

            # since we have to process the audio to convert it, we might
            # as well crop out the silence. this does risk chopping out
            # quiet speech or decreasing accuracy, but it'll lower the
            # amount of audio you have to send over the wire.

            # if you want to play it safer, lower silence_thresh further
            intervals = silence.detect_nonsilent(seg,
            # the easy speech API that doesn't need to upload into their file storage
            # only works with up to 60s chunks, so we want to be able to handle more
            for chunk in self.get_chunks(seg, intervals):
                # this is where ffmpeg does its magic
                # more unnecessary file hideousness
                chunk.export(tgt_filename, format="flac")
                with open(tgt_filename, "rb") as fp:
                    converted_bytes =

                goog_audio = speech.RecognitionAudio(content=converted_bytes)
                goog_config = speech.RecognitionConfig(language_code="en-US",
                response = self._google_client.recognize(config=goog_config, audio=goog_audio)
                # if there are big pauses, it'll chunk up the response, so join them back together
                msg = " ".join([result.alternatives[0].transcript for result in response.results])
                # you can just send it into the room if you prefer but I like this better
                await evt.reply(msg)
            # not sure if this is safe before the get_chunks stuff is over, too lazy to find out
            # error handling what's that

If you use Google Cloud Platform for anything that isn’t this, it may make more sense for you to not bother chunking things up into <60s bits and instead upload into their file storage nonsense, but I don’t anticipate doing that, so this way it’s all simple from that angle.

One final caveat: this only works in non-encrypted rooms. I don’t know how baked into the bot framework that is or whether I just need to twiddle with it some more.

  1. But Maaaya, you’re not supposed to use long-living credentials. I can hear you now. You know what? When cloud providers really care about your credential use, then they’ll correct the documentation where they tell you to use the long-lived ones, won’t they. 

  2. Ooh, wait, SEO opportunity: this is my systemd file for maubot, my maubot systemd file. There you go.