Support low-latency Opus audio for speech recognition
I am super happy to hear that Opus audio can be used for uploading speech to the speech-to-text API. However, I have a concern: because of the way Ogg pages + framing works, Opus packets are buffered for several seconds before being sent on the stream. This makes OggOpus useless for real-time speech transcription (though for the REST API it is fine).
For real-time transcription over websocket I would appreciate an Opus protocol that works around the ogg buffering issue, for example by using RTP headers or a custom size prefix scheme that frames the raw Opus packets. I have spoken with the opus developers about this and they have a few recommended practices.