
In order to select the appropriate languages and acoustic environments the global market of consumer devices and typical application fields are regarded. Consideration of the market leads to the selection of the languages. Considerations of the typical applications lead to the specification of typical acoustic environment in which the voice driven interface must work. For the typical applications the functionality of the voice driven interfaces i.e. the functionality of the integrated recognizers will be specified. Examples of those functionalities are recognition of commands, numbers, names dates etc. For transferring the recognizers to other languages and specific acoustic environments, spoken language resources (SLR) i.e. language and application specific speech databases are used. (The SLR is used to train the acoustic models of the recognizers). Within the consortium a set of speech databases will be produced allowing to transfer the voice driven interfaces to 18 languages (dialectal zones) and typical acoustic environments as found at home, car and public places. Details of the specifications of the SLRs are derived from the analysis of the functionalities of the voice driven interfaces. In order to broaden the range of applications, specific adaptation techniques are developed which allow to adapt the speech databases to other acoustic environments. With these techniques the degradation of the recognizers performance in not SLR-trained environments should be diminished as much as possible. Given the known adaptation techniques innovative ideas are needed to achieve this goal. In a final step the chosen transfer approach is demonstrated by the voice driven interfaces for some languages and prototypical applications. Future devices will be equipped with a multi-modal man-machine interface that allows a more user friendly communication mode with machines. A key element of these interfaces are their capabilities to recognize speech. Purely speech driven interfaces have been introduced successfully in the area of centralized telephone applications (interactive voice response systems). The success of these applications can be seen from a successfully evolving market, which reached a world wide volume of 610 million Euro in 1998 and grows with 25% per year.Due to the rapid progress in semiconductor technology it is now possible to integrate speech driven interfaces in consumer devices e.g. mobile telephones, TV-control sets, PDAs, car navigation kits. The market of speech interfaces for consumer applications is new (in 1998 the market had a volume of 27 million Euro) but grows very rapidly with about 100% . For Europe this new market has two major impacts:
In order to develop successfully the market of speech driven interfaces for consumer devices two essential technical obstacles have to be removed:
The project SPEECON is focused to solve these two problems using a language transfer technology based on spoken language resources (SLR) and using specific know-how concerning acoustic adaptation techniques. The partners involved in the project are
leading players in the market of consumer devices and have profound know-how in speech
recognition technology and rich experience in running this kind of EU-funded projects. A further SME related benefit of SPEECON is its action to make the SLR produced within the project commercially available via the distribution channel ELRA (European Language Resource Association). This allows SMEs to play an active role in the market of speech driven interfaces for consumer applications and to stimulate the market with innovative ideas.
These varying conditions for consumer devices can be described by the more typical means of
but further more important language and people related factors become relevant:
Whereas over the last years a lot of research and language resource acquisition efforts have been invested to come up with speech recognition systems operating in rather quiet environments with high quality given noise reduction microphones (e.g. desktop systems), or very small vocabularies but more noisy environments or recognition via land line and, more recently, wireless telephone lines with rapidly improving transmission quality, the market of consumer devices requires to operate these speech enabled applications in almost any environmental condition (irrelevant to the noise level) under inherent constraints on manufacturing costs (e.g. very cheap microphones, small CPU resources, multi-linguality). Todays speech recognition systems for consumer devices have limited capabilities to dynamically adapt to changing acoustic conditions. The main method to minimize degradation of recognition performance is achieved by ensuring the close match of acoustic data for system building and for actual usage to satisfy users operating the device. Typically for each specific environmental constellation acoustic data needs to be acquired and a specific speech recognizer is to be trained. This process is enormously expensive in terms of data collection cost and time to make the system available to the market. Thus the ability of technologies for the rapid generation of properly matching acoustic training data for the target environment is key for the wide distribution of speech enabled consumer devices. The data can be used for the actual building of a speech recognizer but also for the long term exploration and studying of new algorithms for dynamic adaptation of speech recognizer to changes in the acoustic environment. To get a systematic handle for approaching these diverse conditions we will focus not only on language transfer technologies based on spoken language resources relevant for consumer devices operating in various acoustic environments (defined by several applications and languages) but also on transfer of recognition systems built on such data across various applications, languages and acoustic environments. To allow acoustic data production for virtually any target environment we will develop and apply adaptation technologies which allow merging of acoustic data and target environmental noise (e.g. background noise of a public place, hands-free). The optimization criteria for these algorithms is to maintain recognizer performance at an accuracy level comparable to a system built with acoustic data from the matching environment. Given the known adaptation techniques innovative ideas are needed to achieve this goal. The quality of the noise adaptation techniques will be evaluated by systems trained on acoustic data collected from the target domain and compared against systems built on simulated (merged) data. The proof of concept of these new approaches will be shown via building of prototypes focusing on the demonstration of transfer across languages and applications. The prototypes will be operated within off-line experiments (on independent test set data) and in online experiments with potential users that are naive with respect to the use of speech processing systems. Thus the main innovation aspects advancing the state of the art can be summarized as
In addition the results of the project allow
One of the major concerns of the European Unions Fifth Framework Programme and especially of the IST Programme is the goal of a user-friendly information society. Particularly in Europe, an essential aspect is the issue of linguistic and cultural diversity in global information and communication systems. SPEECON makes an effort to contribute to the framework laid out by the European Commission along the following lines: Multilinguality and acoustic diversity:
Natural interactivity:
Industrial benefit:
ELRA (European Language Resources Association, not a partner of the consortium) will benefit by disseminating speech databases that will be created within the project, and by doing so will also foster additional R&D opportunities.
One of the primary challenges the
Information Society has to face is the diversity inherent in the European Union. One of
EU's key objectives is to keep national characteristics as a source of cultural and
intellectual wealth. SPEECON partners are based in many European countries and as a multi-national consortium, SPEECON is determined to develop speech-driven applications in one of the businesses that promise future economic growth. Lying on the borderline between traditional electronics and high-end computing, the field of Consumer Electronics (CE) today brings together two research and business traditions - so does the consortium. Even though the field is currently being tilled by each partner separately, the decision to pool expertise in speech processing appears to be a meaningful step to boost further progress in the field, to create user-friendly, multi-lingual interfaces for CE devices to come, to make such devices easily affordable for the broad public and to ensure a larger European share in the world market.
Due to the rapid progress in semiconductor technology it is now possible to integrate speech driven interfaces in those devices. This progress allows European citizens to interact with consumer devices in a more user friendly way. Developing the necessary databases for enabling products that provide this natural interaction is one of the SPEECON objectives.
The consortium has the necessary critical mass (leading companies) and commitment (the project objectives clearly match the strategy of the consortium members). As such the consortium is confident that it is well positioned to achieve the envisaged goals and to create and penetrate the market of voice driven consumer devices to the benefit of the European citizens. Developing the SPEECON databases is very expensive. As such this activity is not feasible for an SME. With this in mind SPEECON will make the databases produced within the project commercially available via the distribution channel ELRA (European Language Resource Association). This will allow SMEs to play an active role in the market of speech driven interfaces for consumer applications and to stimulate the market with innovative ideas, and to increase their competitiveness in the global marketplace.
Many predict that voice-activated services for consumer applications, like e.g. mobile phones, car navigation systems or VCRs, will become THE user interface of the near future. The bright prospects of an extensive dissemination and exploitation of SPEECON's principal outcomes i.e. speech driven interfaces (SDIs) of consumer applications can be supported by the following evidences: SDIs will be one of the future key features of consumer electronics, whose annual sales figures exceed already several million pieces. They are considered to be faster and easier to use also for non-technical users (utter simply a word instead of pressing a large sequence of buttons or turning several knobs), and often they are even the only means of steering a device (e.g. think on handicapped people or if an appliance shall be operated in absolute darkness or with no hands free). It is widely accepted, that SDIs will make everyday life safer in many ways. So, they offer e.g. car drivers the ability to handle their radio without taking their hands off the steering wheel (which is already a crucial point in many countries due to legal conditions), or a person simply has to say "emergency call" to his phone and the connection will immediately be built up without the necessity to recall the correct emergency number (which usually depends on the location you are situated). Consumer devices incessantly become smaller and smaller. Interaction with buttons, knobs, and switches, etc. will thus become more and more complicated and cumbersome simply because there will be no space left on the device to realize these steering appliances. A very obvious way out will be the use of SDI. On the other hand, the possible abandonment of steering appliances then in turn will create new ideas for innovative applications. The companies involved in SPEECON are focusing on practicable products, and they are working on many different applications into which SDI are to be introduced. They have realized that a continuous development and application of their innovative technologies is the best means for a success in the global competitions. The main target regarding an exhaustive exploitation of the speech databases will surely be a robust speaker independent recognition of utterances needed for steering the different consumer applications. But there will also be other spin-offs from the collection of speech databases, like e.g. products for speaker identification and verification as well as multilingual speech understanding and translation systems. All these aspects lead to the persuasion that there will soon be a demand for SDIs in a large and growing variety of applications through which the dissemination and exploitation of SDIs is ensured in a natural way. A massive production of SDIs on many, small, cheap, and highly integrated circuits will induce a price cutting which in turn will provoke the development of even more applications with integrated SDI. Finally this spiral will also conduct to a continual grow of the dissemination of SDI in all kinds of consumer applications. Another aspect is the improvement of competitiveness in the field of multilingual speech recognition as well as the creation of new market opportunities. One of the basic concepts in SPEECON is the collection of speech data for distinct applications in all main European languages, which already cover a market of about 700 mio inhabitants. This will pave the way for multilingual consumer electronics working in all European languages - and working uniformly in the same way, so that every European can communicate with all the developed consumer electronics in his own native language, which constitutes an important milestone towards the growing together of Europe's multilingual and multicultural society.
In order to achieve the objectives of the project, work is divided into five Workpackages: Market Analysis (WP1), Specification of Databases (WP2), Creation of speech databases (WP3), Assessment & Evaluation (WP4) and Dissemination & Exploitation (WP5). Management of the project is being handled as WP0. The project can be divided in three main phases: a preparation phase, a main phase and a dissemination phase. In the preparation phase a thorough Market analysis is made leading to the specification of the databases. This work is handled in WP1 and WP2 with a duration of about 6 months. Parallel to these 2 workpackages, WP4 starts with the definition of validation criteria for the speech databases and an evaluation of the potential of database adaptation methods. Within this preparation phase there is a close link between the database adaptation activity and WP1 and WP2, as the potential of such adaptation methods influences the decisions about which environmental situation should be recorded and where databases can be adapted algorithmically to certain environments. In the main phase with a duration of about 14 months, 18 databases are created (WP3) and validated (WP4). In parallel the tools to adapt the databases are developed and evaluated (WP4). The main activities of the dissemination phase, which lasts till the end of the project, are located in WP5. In this workpackage the feasibility to transfer the speech recognition technology to other languages and environments is shown via 3 demonstrators. Further the databases are disseminated via ELRA, the adaptation tools are disseminated within the project, the Dissemination and Use Plan and the Technology Implementation Plan (TIP) are delivered.Activities within the Workpackages Workpackage WP1: Market analysis The recent advances of speech recognition technology now enable to build products and applications with more user-friendly man machine interfaces in many languages. This considerable improvement is undoubtedly linked to the availability of large multilingual speech databases. The goal of this project is to tackle the consumer applications market which represents a new domain for speech recognition, in which different types of words and dialogues will be used compared to more traditional speech recognition applications (interactive voice response system, dictation, name dialer etc.). It is well known that robust speech recognition still needs training data that are representative for the operating conditions. Usually this is achieved by using training data that are recorded in context and most of the currently available databases focus on a limited set of environments with no or few possibilities to study the real effects of the different acoustic environments.The goal of this workpackage will be to determine the needs of voice-driven interfaces based on application and market analysis. The project targets consumer market applications with voice driven interfaces. They include:
The first two classes of applications are characterized by a high degree of user mobility. Therefore, these devices will be used in many different acoustic environments such as at home, in cars, in streets, in airport and train station halls, or in trains. The terminals will also be used in different conditions (close talk / hands free). The typical users will mostly be adults, although mobile phones are already being used by children. For the third class of applications, the typical environments will be living rooms (or offices) and eventually kitchens. These environments are less diverse than for the applications of the first two classes but still encompasses situations with babble noise (such as conversational noise), background music or other domestic noises. For some applications, the distance between speaker and microphone is much longer as for the applications of the first two classes, and thus can cause a low signal to noise ratio. For all kinds of applications, it is important to keep a valid voice-driven interface for all potential users irrespective of gender, regional accent and age. Workpackage WP2: Specification of databases A prerequisite for a successful acquisition of spoken language resources is a comprehensive specification of the speech data to be collected. The main issue of this workpackage is to specify databases which cover adequately the wide range of potential consumer applications. Given a restricted number of utterances recorded from a restricted number of speakers, speech databases have to be specified from which competitive recognizers can be trained for all the envisaged applications.The specification of the databases is done in a joint effort of all partners and is based on the analysis of the market and of the target applications described in WP1. Main issues are the definition of the corpus, the recording platforms for the different environmental conditions, the number of speakers to be recorded, the characterization of the speakers concerning age, sex, dialect and the definition of the transcription criteria to be applied. A further important source of know-how will be the deliverables from the various SpeechDat projects. Major changes of the SpeechDat deliverables have to be made concerning the specification of the content of the databases, the design of the platforms and the validation criteria. All the available skill of the consortium is needed to come up with a fruitful solution, which hopefully will lead to an industrial standard in the specification of speech databases for consumer applications. The specification itself is performed in four stages, namely
Workpackage WP3: Creation of databases The input to language transfer are speech databases of different dialectal zones. The speech databases will be created in two stages: First the platform installation and the 10 speaker database will be created, and secondly speaker recruitment, recording and annotation will be performed. Each principal contractor (except DCAG) will be responsible for the creation of at least two speech databases. Following dialectal zones are envisaged:
According to the specifications derived in WP2, recording platforms have to be built up, appropriate speakers recruited and recorded in the predefined acoustic environments. Finally the recordings have to be annotated as specified. This database creation process has to be done for each dialectal zone specified in WP2. WP3 is split into 2 tasks, namely
Workpackage WP4: Assessment and evaluation This workpackage aims to validate the recorded speech databases of SPEECON and to develop/assess transformation techniques designed to adapt clean speech databases to specific noise and "room effect" conditions. Therefore, the objectives of WP4 are as follows :
Workpackage WP5: Dissemination and exploitation The objective of this Workpackage is to demonstrate language transfer of speech driven interfaces for prototypical consumer devices using the results of WP1, WP2, WP3 and WP4 and make the outcome and progress of the project available to the public. The Workpackage is divided into four tasks, namely:
In WP5, language transfers of speech driven interfaces will be performed over a substantial part of the languages that are addressed in the project. Three applications, operating in different environmental conditions, will demonstrate if the strategies for handling language transfer are feasible.
Additional information can be obtained from the SPEECON consortium by sending a mail to info@speecon.com (see also the mail symbol at the bottom of the left frame).
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
[Home] [Project] [Contact] [Public Documents] [Related Sites] [Internal]
© SPEECON 2000-2001. Page last updated on Januar 26, 2004.