Short and mid-term databases for applications
The objectives of this report is twofold:
To analyse the market demands in the range of teleservices and their impact on the design of the speech database.
To specify the impact from current and near-future speech recognition technology on the design of the speech database.
-
a pdf version is available from: D11.pdf
This catalogue describes some speech databases in 4 languages that should be delivered to ELRA to be distributed. For some of them speech samples are embedded in the document.
This document describes two different procedures that can be followed to exchange the six speech databases described in D1.2.1. The central approach via ELRA is recommended but also a draft contract for bilateral exchanges is supplied.
-
a pdf version is available from: D122.pdf
This deliverable report documents the specification of an agreed and prioritised set of databases needed in the short to mid-term within Europe.
This document provides a specification of the telephone speech databases to be collected by the industrial partners in the SpeechDat project in 8 languages - Danish, English, French, Swiss French, German, Italian, Portuguese and Spanish.
-
a pdf version is available from: D141.pdf
Working standards, distribution and production of SLR
Topics adressed: physical recordings, physical conditions, linguistic contents, database and storage issues, transcription and validation, assessment.
As a statement of general working standards, this report presents the EAGLES
handbook of standards and resources for spoken language systems which is
approaching its draft form for wide dissemination to the European spoken
language R&D community.
The report describes the background to the general EAGLES activities,
explaining the organisational structure and outlining the workplan for the
project. In particular, attention is also drawn to the newly created European
Language Resources Association.
Computer-coding the IPA
What follows is a proposed keyboard-compatible coding for the entire set of
IPA symbols. It covers everything on the 1993 IPA Chart, including diacritics
and tone marks, and is put forward as a proposed standard way to transmit
IPA-transcribed material by e-mail and for similar purposes.
These proposals are fully set out with a reasoned explanation in
a 7000-word draft article "Computer-coding the IPA: a proposed
extension of SAMPA".
A review of the current state of the SAM multi-lingual speech input/output
assessment tools and ways of supporting them in the future. The Speechdat
corpora and the SAM tools are fundamental and essential resources for those
working to keep European speech and language technology advancing ahead of its
competitors. The future of the SAM tools is vital to the successful use of the
Speechdat corpora and to the promotion of European standards of assessment.
This report discusses the feasibility of automatic annotation and presents the PHONYP and PHONSEG applications as an example of an automatic segmentation and labelling system.
This document presents a list of guidelines for validation procedures to be
carried out in order to ascertain a certain quality standard of spoken
language resources to be distributed by the ELRA. The methods proposed are
chosen such that they are a good balance between achievable quality
standards and associated costs of the validation procedure.
-
a pdf version is available from: D313.pdf
This report outlines the distribution of Spoken Language Corpora on
traditional CD-ROM media and a new approach via network. High capacity CD-ROMs
are being introduced, but this is only a marginal improvement in respect to
the distribution of SLC. Network access however offers many opportunities:
customised SLC, on-line access, and a high degree of protection. However, for
network access to be feasible, the bandwidth of existing networks will have to
be increased.
-
a pdf version is available from: D314.pdf
-
a pdf version is available from: D322.pdf