Microsoft

Speaker Recognition

Welcome to the Speaker Recognition Forum

Categories

API – Any ideas or feedback pertaining to features or enhancements to Speaker Recognition API.

Documentation – Any ideas or suggestions for the API Reference or Documentation.

Language Support – Submit a request to have a particular language supported.

Samples & SDK Request – Let us know if you would like to see a Code sample or SDK provided.


  • Hot ideas
  • Top ideas
  • New ideas
  • My feedback
  1. Speaker diarization for more than 2 speakers

    Speaker diarization for more than 2 speakers.

    See this one: https://cognitive.uservoice.com/forums/555925-speaker-recognition/suggestions/34823824-add-support-for-speaker-diarization-for-untrained

    I dont feel this should be marked as resolved. Would expect support for at least 10 speakers. Additionally its currently really poor and switches between speaker 1 and 2 almost randomly. Please make this more intelligent. Its a deal breaker for us and I'm sure many others. Especially considering the google alternative can handle unlimited speakers and is far more accurate at identifying them.

    https://cloud.google.com/speech-to-text/docs/multiple-voices

    And no... expecting a sample to train it for each voice is not an option. We literally just need it to assign a number…

    1 vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)

    We’ll send you updates on this idea

    0 comments  ·  Speaker Identification  ·  Flag idea as inappropriate…  ·  Admin →
  2. Add support for Speaker Diarization for untrained speakers.

    Distinguish between multiple speakers in a conversation without training the system first. IBM Watson currently supports this: https://www.ibm.com/blogs/bluemix/2017/05/whos-speaking-speaker-diarization-watson-speech-text-api/

    Given an audio recording of a conversation the minimuim I'm looking for is:
    Speaker 1 (0:01-0:03): Hi Ted, how are you today?
    Speaker 2 (0:04-0:05): I'm doing well, how about you?
    Speaker 1 (0:05-0:10): Good thanks. So the reason I called you today was to discuss your recent sales performance.

    Ideally each word would be timestamped so we could highlight the spoken word when displaying the transcription next to the playing audio. Also it would be nice if each word had a…

    12 votes
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)

    We’ll send you updates on this idea

    8 comments  ·  Speaker Identification  ·  Flag idea as inappropriate…  ·  Admin →
  3. Process speaker identification immediately for short audio samples

    First off, this is an awesome API that I would love to use in my app. The big problem I have, though, is that it's not really usable for real-time, low latency identification from short samples because:
    1. The asynchronous callback method requires me to make constant polls to the operation result endpoint, which takes (from my measurement) about 1200ms in the ideal case, whereas I would really prefer results within 400-500 ms.


    1. Each poll on the operation status costs me QPS, which triggers throttling if I poll to often

    I would propose the following change to the speaker identification…

    2 votes
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)

    We’ll send you updates on this idea

    0 comments  ·  Speaker Identification  ·  Flag idea as inappropriate…  ·  Admin →
  4. How to reduce response time for identification requests?

    When testing in python, each identification request would take eight to nine seconds to get a response. Is this due to the Internet or the identification model processing itself would take that long? And is there any way to get a response faster? Thank you.

    2 votes
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)

    We’ll send you updates on this idea

    0 comments  ·  Speaker Identification  ·  Flag idea as inappropriate…  ·  Admin →
  5. Handle if the voice match is 100% or very close to 100%. This is to avoid some one using the prerecorded audio of others

    Microsoft Speaker Identification should handle, if the voice match is 100% or very close to 100%. This could happen if some one has the voice recordings of others and trying to authenticate or verify.

    2 votes
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)

    We’ll send you updates on this idea

    0 comments  ·  Speaker Identification  ·  Flag idea as inappropriate…  ·  Admin →
    Planned  ·  Luke Bayler responded

    Hello,

    We have plans for a new verification feature that prompts customers with random verification phrases to be robust against replay attacks.

    Thanks,
    Luke

  6. Real time Speaker Recognition

    Hello, Is it possible to use the Speaker Recognition API to perform real time identification.I have been trying to get some help on this. But with not much success.

    2 votes
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)

    We’ll send you updates on this idea

    1 comment  ·  Speaker Identification  ·  Flag idea as inappropriate…  ·  Admin →
  7. Define accuracy level of each request to accept or reject the outcome

    It has been observed on rigorous testing that the accuracy level of speaker identification api is not that handsome. It is giving erroneous output in both positive and negative scenarios. In the positive scenario when same user with different voice samples are tested against each other its not giving expected results many times. For negative scenario when different users voice samples are tested against each other it wrongly identifies as same user on many occasions. I feel adding a accuracy level of the output received will help the end user to an extend to decide whether to accept the outcome…

    3 votes
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)

    We’ll send you updates on this idea

    Resolved  ·  1 comment  ·  Speaker Identification  ·  Flag idea as inappropriate…  ·  Admin →
  8. Give percentage match instead of categorizing them to High,Medium or Low

    Instead of categorizing the speaker Identification api response to High,medium or low give the percentage match the service has given. The user should decide what should be a cut off for a potential match.

    7 votes
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)

    We’ll send you updates on this idea

    2 comments  ·  Speaker Identification  ·  Flag idea as inappropriate…  ·  Admin →
    Planned  ·  Luke Bayler responded

    Hello,

    This is currently planned for a future release.

    Thanks,
    Luke

  9. Recognize multiple speakers in audio file and when they speak

    For example 2 minutes audio file. First 30 seconds Speaker A, then Speaker B from 30 to 1.30 and then again speaker A from 1.30 to 2 mins.

    34 votes
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)

    We’ll send you updates on this idea

    8 comments  ·  Speaker Identification  ·  Flag idea as inappropriate…  ·  Admin →
  10. Speaker Recognition with shorter phrase?

    I would love to create a pug-in for my home automation. which already uses Kinects, that can utilize the speaker Identification from Oxford. Main issue is most statements are short - ie: Computer, turn on family room light. So I never generate a 20 Second clip - Recognition with at least a 5 second clip or so would be great, even if recognition is only say 80% accurate for this case....

    7 votes
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)

    We’ll send you updates on this idea

    2 comments  ·  Speaker Identification  ·  Flag idea as inappropriate…  ·  Admin →
    Completed  ·  Raymond responded

    We have release a new feature that allows you to waive the audio limit. Just add “ShortAudio” parameter to instruct the service to waive the recommended minimum audio limit needed for enrollment. Set value to “true” to force enrollment using any audio length.

    More details can be found here,
    - https://dev.projectoxford.ai/docs/services/563309b6778daf02acc0a508/operations/5645c3271984551c84ec6797
    - https://dev.projectoxford.ai/docs/services/563309b6778daf02acc0a508/operations/5645c523778daf217c292592

  • Don't see your idea?

Feedback and Knowledge Base