14 March 2017

An Introduction to Azure Cognitive Services – Speech to Text Sentiment Analysis


The interaction between Man-Machine is not something new these days. As time goes by, new systems are developed making the machines capable of performing tasks performed by humans.

These systems need supervision, since a very autonomous system is something that does not exist, because, machines cannot think or make decisions as humans can. However, by applying methods that subdivide more complex models into simpler models and by applying mathematical algorithms, machines can perform tasks that a human would do, but in a much faster and less susceptible way to fail.

Microsoft Cognitive Services offers a range of services that enable the enrichment of solutions through cognitive functionalities, such as image recognition, search engines and translation from voice to text, supporting multiple languages.

These services have evolutionary capabilities in terms of the results obtained. In other words, the more often the services are used (machine training), the more accurate the output will be.


In the scope of this demo, we will create a scenario where the main goal is to evaluate speech files and store the results in a database for later use.

Imagine that you have stored customer service calls in your possession. We need to evaluate these calls.

In order to achieve this goal, we will use the services provided by Microsoft Cognitive Services: Speech to Text and Text Analytics.

These services offer a range of variables, like keywords that are extracted from the obtained text, language identification, sentiment analysis and speech translation confidence.


Microsoft has a lot of demos that can be used to start new projects. These projects can be found in GitHub repository at https://github.com/Microsoft.

To start to test these APIs we have to create an account in https://www.microsoft.com/cognitive-services/en-us/apis and subscribe for a free daily quota. Usually the amount of free quota is more than enough for test purposes.

Let us start by selecting an audio file into our application and send it to Microsoft Speech Service.

First create an HTTP WebRequest to the following endpoint https://speech.platform.bing.com/recognize creating the following URL


Before sending the request, the audio file needs to be converted into a FileStream to be uploaded in the Request.

If the request is successful, a string will be returned in a JSON format and we will use this string to extract the information that is of interest to us.

The variable lexical contains the text that was captured in the audio file. The second variable, confidence, has values ​​between 0 and 1, which correspond to the degree of confidence of the translation performed. Where 0 is not sure of the translation performed, and 1 means that there is total certainty in the translation performed.

Once the text is obtained, we will now move on to the sentiment evaluation service. It receives an input in JSON format with the format shown in the image below.

The C# class below translates the JSON object shown above.

With the text obtained in the previous answer, we use this same variable to make a new request for the Text Analysis service, in order to identify its language.

Again, with the same input, we make another request in order to extract information about the feeling. This service returns values ​​between 0 and 1, which indicate the happiness percentage of the text evaluated. Where 0 is unhappy and 1 is very happy.

With all the values ​​that are of interest to us, we will introduce a new record in the database so that we can use this data in the future.



As you can see, we got from the Speech to Text and Text Analysis service a pretty good translation, in the order of 89% (Confidence). I can guarantee the totality of the translation 100%. We can also see that Portuguese has been identified as language and that the person speaking in the audio file is very happy, in the order of 98.9% (Score).

Contrary to what has been obtained above, in this evaluation it is clearly seen a high sadness in the sentence evaluated, about 0.09% of Happiness.

Once the results are stored in a database, we have the autonomy to work this information. For example, we can analyze the average feelings of the evaluated files or evaluate the text itself that has been detected with personal or third-party algorithms. There is a wide applicability where this information can be used.





   Bruno Rodrigues

Software Engineering