Language Detection And Automatic Metadata Insertion On Audio Input:Main Page

From ACT4D Project Wiki

Jump to: navigation, search


What is it

One of the most exciting behavior with computer, is interacting with human and machine through speech. In the past, recognition rate was not sufficiently good. So lots of studies on speech recognition has done and it gave us acceptable recognition rate but it was not successful in terms of the number of users. However, recently these speech recognition technique has spot lighted again due to mobile handset. Emergence on smart phone(Android, iPhone, Window mobile 7, Blackberry, Symbian, Bada, etc) brought to people new type of input method. Because mobile handset is too small to input properly, speech recognition input is on demand. There are many categories on speech recognition. Firstly, it could be categorized as below.

Community Radio

There are lots of demand for CR(Community Radio) in rural area. CR is a type of radio service which offers a third model of radio broadcasting beyond commercial broadcasting and public broadcasting. CR is well deployed all over the world and their feature is different from country to country. India is one of the well structured country from 1990s. CR is socialized radio for specific region and they can share their experience on agriculture, fishing, animal farming, and so on. And they could report any Government's corruption, children's educational issue, women's right, cultural things, infrastructure, gender, environment, etc. Or just common stories like other radio broadcasting channel. It is totally localized contents so that people could feel sympathy more which means more powerful.

Solution for CR : Speech Recognition

However, many people who live in rural area are illiterate, so they cannot write down some contents to send. Thus new method for collecting the stories in on demand. Voice input could be one of the solutions. And more over, many broadcast station using realtime interview through phone. If we could record that as data and categorize by some keyword(Government's corruption, children's educational issue, women's right, cultural things, infrastructure, gender, environment, experience on agriculture, fishing, animal farming, the place to record(e.g. Gujarat, Delhi, Bihar, etc), etc) it could be very powerful tools. People just need to make a single call for share their thought.

Language Independent SR : Keyword Spotting

English SRE(Speech Recognition Engine) is available now(by Android, google SRE shows very good recognition rate). As you could imagine, rural area's voice input easy to be Hindi or their dialect. Thus we need Language Independent SRE. Keyword Spotting can be a solution for this issue. It searches audio data and find the keyword from the audio data. Thus, if we have trained keyword only which is much lessor work for continuous speech database such as TIMIT, this job can be done on any languages. As a purpose of M.S thesis work, I choose for English, and Hindi.


All over the world, people use roughly more than 6,500 languages. About 2,000 of those languages have fewer than 1,000 speakers. Which means we cannot expect to emerge any speech recognition engine with these minor languages, in terms of the number of speakers. Only 50 - 100 languages are supported by automatic speech recognition(ASR) so far. However the technique of keyword spotting could be used for these minor languages. It matches certain keyword in the audio files and it could be used for language independent. The number of Hindi speaker is very high(1st : Chinese Mandarin(1025 million), 2nd : Spanish(390 million), 3rd : English(328 million), 4th : Hindi(240 million)), but still Hindi doesn't have ASR. Thus this research could be useful for that.

Background Study about Speech Recognition

There are lots of tools available nowadays, among these, I used HTK for MFCC feature extraction from CUED(Cambridge University Engineering Department), quicknet for ANN(Artificial Neural Network) training and generating Gaussian posteriorgram probability through MLP(Multi-Layer Perceptron). Quicknet comes from the Speech Group at the ICSI(International Computer Science Institute), Berkeley. I adopted modified DTW(Dynamic Time Warping) algorithm for comparing keywords and query statement.

The Workflow

Rough sketch of workflow

Current Work

User end : mobile handset

Front end : Webserver

Back end : SRE


List of Keywords

Test sentence files for performance measure


Identity is a term used to describe a person's conception and expression of their individuality or group affiliations.

Bharat ki sunskirti bahut purani hai

기반시설의 부족으로 빠른 성장이 이루어 지지 않고 있다.

Weekly Updates


Personal tools