Add support for Speaker Diarization for untrained speakers.
Distinguish between multiple speakers in a conversation without training the system first. IBM Watson currently supports this: https://www.ibm.com/blogs/bluemix/2017/05/whos-speaking-speaker-diarization-watson-speech-text-api/
Given an audio recording of a conversation the minimuim I'm looking for is:
Speaker 1 (0:01-0:03): Hi Ted, how are you today?
Speaker 2 (0:04-0:05): I'm doing well, how about you?
Speaker 1 (0:05-0:10): Good thanks. So the reason I called you today was to discuss your recent sales performance.
Ideally each word would be timestamped so we could highlight the spoken word when displaying the transcription next to the playing audio. Also it would be nice if each word had a confidence (0.0-1.0) associated with it.
We now have speaker diarization/separation option available https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/batch-transcription
Please let us know if any feedbacks. Thanks!
Gurucharan Subramani commented
Hi, when I try speaker diarization feature with below request body,
"AddWordLevelTimestamps" : "True",
"AddDiarization" : "True"
I get a 400 response with below error msg.
"message":"This locale does not support diarization."
I get the same error on existing Speech Services instance and on a new one as well. It was in West US Region (if it matters).
Andrew Khazanovich commented
Any update on support for this. This is looking like a deal breaker for utilizing Cognitive Services for our transcription needs.