Latest News
- New LRs in the ELRA Catalogue July 25, 2024
- New LRs in the ELRA Catalogue June 5, 2024
- New LRs in the ELRA Catalogue Dec. 7, 2023
- New LRs in the ELRA Catalogue Nov. 13, 2023
- The LDS vision by Philippe Gelin Oct. 17, 2023
A general study of evaluation methods, measurement and related projects through different language technologies.
Please scroll horizontally on the right arrow (or on the left arrow) to see the tabs that are not displayed.
Speech Synthesis, also often referred to as Text-To-Speech (TTS) processing, consists in converting written input into spoken output by automatically generating synthetic speech.
TTS systems generally consist of 3 modules:
Text Processing
The first step in a TTS system is text processing. The input text is analyzed and transformed into a linguistic representation containing all the necessary information needed in the subsequent TTS steps. Typical text processing operations are:
Prosody Generation
Prosody is the set of speech features that makes that a same phonetic sound can be uttered in very different ways. These features include intonation (tone, pitch contour), speech rate, segment duration, phrase break, stress level and voice quality. Prosody plays a fundamental role to elicit the meaning, attitude and intention and to produce natural speech.
The objective of the prosodic TTS module is to generate prosodic features that will make the intonation of the final synthesized speech as close as possible to a natural human voice intonation. In most TTS applications it is of essentiel to produce expressive speech.
Acoustic Synthesis
The acoustic module physically generates the final speech signal (the synthesized voice) by implementing the appropriate sequence of phonetic units and the desired prosodic features resulting from the previous, afore-mentioned processing steps.
Approach
A first approach is to evaluate separately the components of these different modules (glass box evaluation):
Another (complementary) approach consists in measuring the global overall quality of the synthesized speech (black box evaluation).
TTS evaluation campaigns generally combine both approaches to investigate all objective and subjective aspects of speech synthesis technologies.
The complexity of TTS evaluation comes from the fact that it consists of separate evaluation tasks, each requiring a specific protocol and test collection.
In addition, other specific methods are required to evaluate other TTS-related research tasks (voice conversion, expressive speech synthesis, etc.).
Measures
Objective Evaluation
The evaluation of the text processing components is done through automatic metrics (objective measures) by comparing the outputs with a reference:
Subjective Listening Tests
The global (black-box) evaluation and the evaluation of the other modules (Prosody and Acoustic Synthesis) mainly rely on subjective tests conducted by human judges.
A typical subjective evaluation procedure is as follows:
Subjects are asked to rate the quality of the synthesized sentences they listen to, according to a series of pre-defined criteria (naturalness, intelligibility, pleasantness, etc.). The TTS systems or modules under scrutiny are compared based on these scores.
Projects
ECESS (European Centre of Excellence in Speech Synthesis)
Festvox is CMU’s TTS project. It organizes the Blizzard Challenge
MBROLA Project
MUSSLAP (Multimodal Human Speech and Sign Language Processing for Human-Machine Communication)
HUMAINE (Human-Machine Interaction Network on Emotion)
TC-STAR (Technology and Corpora for Speech to Speech Translation) included a text-to-speech task
EvaSy (in French "Evaluation des systèmes de Synthèse de parole": Speech Synthesis System Evaluation): Evaluation of speech synthesis in French.
Events
SSW-7: 7th ISCA Speech Synthesis Workshop
Blizzard Challenge: 2010, 2009, 2008, 2007, 2006, 2005.
ISCA Speech Synthesis Workshops (SSW): SSW7, SSW-6, SSW-5, SSW-4, SSW-3, SSW-2, SSW-1.
Tools
MBROLA: a toolkit to build TTS systems in many different languages.
Festival: University of Edinburgh’s Festival Speech Synthesis Systems is a free software multi-lingual speech synthesis workbench.
Festvox tools: Festvox documentation and scripts.
Praat: speech analysis, synthesis, and manipulation package which can perform general numerical and statistical analysis.
LRs
TC-STAR-TTS Evaluation Package: distributed via the ELDA/ELRA catalogue.
Evasy Evaluation Package: distributed via the ELDA/ELRA catalogue.
References
Multimodal technologies refer to all technologies combining features extracted from different modalities (text, audio, image, etc.).
This covers a wide range of component technologies:
Approach
There is no generic evaluation approach for such a wide and heterogeneous range of technologies. In some cases, the evaluation paradigm is basically the same as for the equivalent mono-modal technology (e.g. traditional IR vs. multimodal IR). For very specific applications (e.g. 3D person tracking in a particular environment), ad hoc evaluation methodologies have to be pre-defined before the start of the evaluation campaign.
A good example of this is the multimodal evaluation framework set up for the CHIL project (Computers in the Human Interaction Loop). Different test collections (production of ground truth annotations) and specific evaluation metrics were defined to address a large range of audio-visual technologies:
For a complete overview of these tasks, see the book that was published at the end of the project.
Related Projects
Events
LRs
References
Information Retrieval (IR) deals with the representation, storage, organization of, and access to information items. The representation and organisation of the information items should provide the user with easy access to the information in which he is interested.
IR systems allow a user to retrieve the relevant documents which (partially) match his information need (expressed as a query) from a data collection. The system yields a list of documents, ranked according to their estimated relevance to the user’s query. It is the user’s task to look for the information within the relevant documents themselves once they are retrieved.
An IR system is generally optimized to perform in a specific domain: newswire, medical reports, patents, law texts, etc.
In recent years, the impressive growth of available multimedia data (audio, video, photos…) has required the development of new Multimodal IR strategies in order to deal with:
- annotated image collections (images with captions, etc.).
- multimedia documents combining text and pictures.
- speech transcriptions (e.g. transcribed TV programs), etc.
Multimedia and audio-visual data are processed by combining information extracted from different modalities: text, audio transcriptions, images, video key-frames, etc.
Moreover, in a globalized world, IR systems have more and more to cope with multilingual information sources. In a multilingual context, we talk of Cross-Language Information Retrieval (CLIR) (See Moreau, Nicolas et al. Best Practices in Language Resources for Multilingual Information Access, Public report of the TrebleCLEF project (Deliverable 5.2), March 2009). The language of the query (source language) is not necessarily the same as the language(s) used in the documents (target language(s)).
Question Answering (QA) is another, particular approach to IR. In a QA system information needs are expressed as natural language statements or questions. In contrast to classical IR where complete documents are considered relevant to the information need, QA systems return concise answers. Often, automated reasoning is needed to identify potential correct answers.
The explosive demand for better information access for a large public of users fosters the R&D for QA systems. The interest of QA is to provide inexperienced users with a flexible access to information allowing them to write a natural question and obtain directly a concise answer.
Other types of applications are often considered to be part of the IR domain: Information Extraction, Document Filtering, etc.
Approach
Most IR evaluation campaigns carried out until now rely on a comparative approach. Unlike objective evaluation (How well does a method work?), comparative evaluation focuses on the comparison of the results obtained with different systems (Which method works best?). To be compared, IR systems must be tested under similar conditions.
An IR comparative evaluation usually relies on a test collection consisting of:
Once a test collection has been created, the general evaluation methodology is done in 3 main steps:
As long as they are tested on the same test collection (same set of documents and queries) the performance of different systems can be compared based on their final performance measures.
The human relevance judgment step represents the most time- and resource-consuming part of an IR evaluation procedure:
Measures
Early in 1966, Cleverdon (See Cleverdon, Cyril; Keen, Michael. Factors Affecting the Performance of Indexing Systems, Vol 2. ASLIB, Cranfield Research Project. Bedford, UK: C. Cleverdon, 1966, 37-59) listed six measurable features that reflect users’ ability to use an IR system:
In general, the objective evaluation of IR performance relies on the 2 last effectiveness measures (Precision and Recall), based on the number of relevant documents retrieved.
Considering the ranked list of retrieved document for a given query, these 2 values are computed by considering the first N retrieved documents only (let’s call it the N-list):
Precision and Recall measures are computed for different values of N resulting in a Precision/Recall curve that reflects the IR effectiveness.
Usually, a single value metric is derived from the Precision/Recall plots and used as a final indicator of retrieval efficiency. Usual metrics are:
These metrics can be computed for a single query, but they are generally averaged over the whole set of test queries.
Detailed descriptions of classical IR evaluation measures can be found in [[Ricardo Baeza-Yates, Berthier Ribiero-Neto. Modern Information Retrieval. Addison-Wesley, 1999»>«>
Other specific IR tasks may require other performance measures. For example, the performance of QA systems is measured upon the percentage of correct answers obtained from a set of test questions.
Projects
Events
Tools
Language Resources
Test Collections
Domain Specific Corpora
References
Moreau, N. et al., “Best Practices in Language Resources for Multilingual Information Access”, Public report of the TrebleCLEF project (Deliverable 5.2), March 2009.
Cleverdon, C., Keen, M., “Factors Affecting the Performance of Indexing Systems”, Vol 2. ASLIB, Cranfield Research Project. Bedford, UK: C. Cleverdon, 1966, 37-59.
Baeza-Yates R., Ribiero-Neto B., “Modern Information Retrieval“, Ed. Addison-Wesley, 1999.
Machine Translation (MT) technologies convert text from a source language (L1) into a target language (L2). One of the most difficult things in Machine Translation is the evaluation of a proposed system. The problem with language is that language has some degree of ambiguity which makes it hard to run an objective evaluation. For example, with Machine Translation one problem is that there is not only one good translation for a given source text. Van Slype (1979) distinguished {macro evaluation}, designed to measure product quality and {micro evaluation}, assess the improvability of the system. The macro evaluation, also called total evaluation enables comparison of the performance of two translation systems or two versions of the same system. The micro evaluation, also known as detailed evaluation seeks to assess the improvability of the translation system.
Approach
The performance of a translation system is usually measured by the quality of its translated texts. Since there is no absolute translation for a given text, the challenge of the machine translation evaluation is to provide an objective and economic assessment. Given the difficulty of the task, most of the translation quality assessments were based on human judgement in the history of MT evaluation. However, automatic procedures allow a quicker, repeatable, objective and cheaper evaluation. Automatic MT evaluation consists in comparing the MT system output to one or more human reference translations. Human scores (manual evaluation) are assigned according to the adequacy, the fluency or the informativeness of the translated text. In automatic evaluation, the fluency and adequacy of MT output can be measured by n-gram analysis.
Measures
Some of the most common automatic evaluation metrics are:
Metrics | Description | Reference |
BLEU | IBM BLEU for BiLingual Evaluation Understudy is an n-gram co-occurrence scoring procedure. | (Papineni et al., 2001) |
NIST | A variation of BLEU used in NIST HLT evaluation | (Doddington, 2002) |
EvalTrans | Tool for the automatic and manual evaluation of translations | (Niessen et al., 2000) |
GTM | General Text Matcher based on accuracy measures as precision, recall and F-measure | (Turian et al., 2003) |
mWER | Multiple reference Word Error Rate is the average number of MT system output and several human reference translation | (Niessen et al., 2000) |
mPER | Multiple reference Position independent word Error Rate | (Tillmann et al., 1997) |
METEOR | Metric for Evaluation of Translation with Explicit ORdering, based on the harmonic mean of unigram precision and recall | (Banerjee & Lavie, 2005) |
ROUGE | Recall-Oriented Understudy for Gisting Evaluation based on N-gram co-occurrence measure | (Lin, 2004) |
TER | Translation Error Rate | (Snover et al., 2006) |
For human evaluation, Fluency and adequacy are two commonly used translation quality notions (LDC2002, White et al. 1994). Fluency refers to the degree to which the system output is well-formed according the target language’s grammar. Adequacy refers to the degree to which the output communicates the information present in the reference translation. Recently, other measures have been tested, such as the comprehensibility of a MT translated segment (NIST MT09), or the preference between MT translations of different systems (NIST MT08).
Projects
Events
Tools
Open-source Machine Translation Systems
Automatic Metrics
Language Resources @ ELRA
References
For further information on research, campaigns, conferences, software and data regarding statistical machine translation and its evaluation, please refer to the European Association for Machine Translation
The Machine Translation Archive is also offering a repository and bibliography about machine translation.
Bibliography
Automatic Summarization aims to extract and present the most important content to the user from an information source. Generally two types of summaries are generated: extract, i.e., a summary which contains text segments copied from the input, and abstract, i.e., a summary consisting of text segments which is not present in the input.
One of summary evaluation issues is that it involves human judgments of different quality criteria like coherence, readability and content. There is no absolute unique correct summary and it is possible that a system output a good summary quite different from a human reference summary (the same problems for machine translation, speech synthesis, etc.).
Approach
Traditionally, summarization evaluation compares the tool output summaries with sentences previously extracted by human assessors or judges. The basic idea is that automatic evaluation should collerate to the human assessment.
Two main methods are used for evaluating text summarization. Intrinsic evaluation compares machine generated summaries with human generated summaries, it is considered as system focused evaluation. Extrinsic evaluation measures the performance of summarization in various tasks, and it is also considered as task specific evaluation.
Both methods require significant human resources, using key sentence (sentence fragment) mark-up and human generated summaries for source documents. Summarization evaluation measures provide a ranking score which can be used to compare different summaries of a document.
Measures
- Sentence precision/recall based evaluation
- Content similarity measures
- ROUGE (Lin, 2004), cosine similarity, n-gram overlap, LSI (Latent Semantic Indexing), etc.
- Sentence Rank
- Utility measures
Related Projects
NTCIR (NII Test Collection for IR Systems) includes Text Summarization tasks, e.g. MuST (Multimodal Summarization for Trend Information) at NTCIT-7.
TAC (Text Analysis Conference): Recognizing Textual Entailment (RTE), Summarization, etc.
TIPSTER : See the TIPSTER Text Summarization Evaluation: SUMMAC
TIDES (Translingual Information Detection Extraction and Summarization).
TIDES included several evaluation projects:
CHIL (Computer in the Human Interaction Loop) included a Text Summarization task.
GERAF (Guide pour l’Evaluation des Résumés Automatiques Français): Guide for the Evaluation of Automatic Summarization in French.
Events
ACL-IJCNLP 2009 Workshop: Language Generation and Summarisation
TAC 2009 Workshop (Text Analysis Conference).
Language Generation and Summarisation Workshop at ACL 2009
RANLP 2009
CLIAWS3 (3rd Workshop on Cross Lingual Information Access)
Multi-source, Multilingual Information Extraction and Summarization Workshop at RANLP2007
TSC-3 (Text Summarization Challenge) at NTCIR-4
Text Summarization Branches Out Workshop at ACL 2004
DUC 2003 (HLT-NAACL Text Summarization Workshop)
References
Bibliography
Lin C.-Y. (2004) ROUGE: a Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26.
The goal of the Speech-to-Speech Translation (SST) is to enable real-time, interpersonal communication via natural spoken language for people who do not share a common language. It aims at translating a speech signal in a source language into another speech signal in a target language.
The evaluation of SST systems can be considered as a extended task of MT evaluation (namely Spoken Language Translation, SLT), in including speech recognition and speech synthesis in the evaluation loop as an end-to-end evaluation.
Approach
SLT component usually operate on output produced by ASR component and provide input for the speech synthesis component. The speech translation evaluation can be single component or end-to-end. The former uses the respective component output to provide quality evaluation, while the latter uses the final output of the whole system to provide its quality evaluation.
End-to-end evaluations examine a system in its whole configuration and functionality. Single component evaluations are focused on the different speech translation modules: speech recognition, speech synthesis, and machine translation. The own component metrics are then used, although the interpretation might remains different.
Measures
According to different evaluation criteria, several measures can be used for the end-to-end evaluation, which are typically merged into two main categories: the first one estimate the audio quality of the output, while the second one estimate its meaning preservation. The evaluation of the audio quality is rather simple since it uses very similar metrics from the speech synthesis evaluation. Meaning preservation is more complex and can be done either with {subjective} or {objective} measures.
Subjective evaluation uses human judges assessments (from users and/or experts) to compute the loss of preservation between the input, in the source language, and the output, in the target language. Several ways can be employed, like asking questions about the content, rewrite what the judge heard, etc. Generally, the SST system is compared (directly or not) o a reference, likely a human interpreter.
Objective evaluation produces the same kind, but without assessment. One or several experts check the SST output, in going by a reference, in order not to bias the results by human behaviour (such as fatigue, noises, etc.)
Related Projects
TC-STAR: Technology and Corpora for Speech to Speech Translation, 6th FP project (2004-2007).
LC-STAR: Lexica and Corpora for Speech-to-Speech Translation Components (2002-2005).
NESPOLE!: NEgotiating through SPOken Language in E-commerce, 5th FP project (2000-2002).
TONGUES: Rapid Development of Speech-to-Speech Translation System (2000-2002).
Verbmobil: German project on Mobile Speech-to-Speech Translation of Spontaneous Dialogs (1996-2000).
Events
First TC-Star Evaluation Workshop on Speech-to-Speech Translation
Second TC-Star Evaluation Workshop on Speech-to-Speech Translation
Third TC-Star Evaluation Workshop on Speech-to-Speech Translation
LRs
TC-STAR 2007 Evaluation Package - End-to-End Spanish-to-English.
TC-STAR 2006 Evaluation Package - End-to-End Spanish-to-English.
References
For further information on research, campaigns, conferences, software and data regarding speech-to-speech translation and its evaluation, please refer to Machine Translation Archive.
Speech Recognition, also known as automated speech recognition (ASR) or speech-to-text (STT) is a process by which a program or a system transcribes an acoustic speech signal to text.
Systems generally perform two different types of recognition: single-word and continuous speech recognition. Continuous speech is more difficult to handle because of a variety of effects such as speech rate, coarticulation, etc. Today's state-of-the-art systems are able to transcribe unrestricted continuous speech from broadcast data with acceptable performance.
Approach
Evaluation of ASR systems is mainly performed by computing the Word Error Rate (WER) or Character Error Rate (CER) for some languages like Chinese or Japanese.
WER is derived from the Levenshtein distance (or edit distance) and measures the distance between the hypothesis transcription produced by the ASR module and the reference transcription.
The WER is computed after the alignment between the hypothesis and the reference transcriptions have been done by dynamic programming (the optimal alignment being the one which minimises the Levenshtein distance). Usually the costs for insertion, deletion and substitution are respectively 3,3,4.
After alignment between the hypothesis and the reference, WER counts the number of recognition errors.
Three kinds of errors are taken into account when computing the word error rate, i.e. substitution, deletion and insertion errors.
Substitution: a reference word is replaced by another word in the best alignment between the reference and the system hypothesis.
Deletion: a reference word is not present in the system hypothesis in the best alignment.
Insertion: Some extra words are present in the system hypothesis in the best alignment between the reference and the hypothesis.
Although word is the basic unit for assessing ASR systems, the same computation can be made using different granularities (phonemes, syllables, etc.)
WER can be greater than 100%, if the number of errors is more important than the number of words.
Prior to scoring both hypothesis and reference have to be normalized. The normalisation consists of converting the transcription into a more standardised form. This step is language dependent and applies a number of rules for transforming each token into its normalised form. For instance numbers are spelled out, punctuation marks are removed, contractions are expanded, multiple orthographies are converted to a unique form, etc.
Although WER is the main metric for assessing ASR system, its major drawback is that all word errors are equally penalized, regardless the importance and meaning of the word, eg an empty word has the same importance as a named entity.
Performance of ASR systems are also measured in terms of speed by measuring the processing time and computing the real time factor on a specific hardware configuration.
This is an important factor for some applications that may require a real-time processing speed or some devices that are limited in terms of memory or processor speed.
Measures
For ASR evaluation, the criterion is recognition accuracy, one commonly used measure is word error rate or the related metric word accuracy rate (WER), also used in machine translation evaluation.
The method used in the current DARPA speech recognition evaluation involves comparing system transcription of the input speech to the reference (i.e., transcription by a human expert), using algorithms to score agreement at the word level. More higher-level metrics such as sentence error rate as concept error rate can be applied regarding different applications.
Communication style (i.e., speaker independent, spontaneous speech, etc), vocabulary size, language model and usage conditions are also important features which can affect the performance of a speech recognizer for a particular task.
Related Projects
NIST Rich Transcription evaluations:
TC-STAR evaluation campaigns for ASR (evaluation packages are available from ELRA’s catalogue)
The ESTER evaluation campaigns (evaluation packages are available from ELRA’s catalogue)
CENSREC (Corpus and Environment for Noisy Speech RECognition ) Japanese noisy speech recognition evaluation framework
CORETEX: Improving Core Speech Recognition Technology
EARS: Effective Affordable Reusable Speech-To-Text, DARPA’s research program
NESPOLE: NEgotiating through SPOken Language in E-commerce.
Events
ICASSP’10: International Conference on Acoustics, Speech, and Signal Processing March 14 – 19, 2010, Dallas, USA
LREC’10: International conference on Language Resources and Evaluation, May 17 – 23, 2010, Malta.
InterSpeech 2010: September 26 - 30, 2010, Makuhari, JAPAN.
Tools
Scoring evaluation tools such as SCLITE are available on NIST’s speech group website: NIST tools
LRs
AURORA Project Database 2.0 - Evaluation Package
AURORA Project database - Subset of SpeechDat-Car - Finnish database - Evaluation Package
AURORA Project database - Subset of SpeechDat-Car - Spanish database - Evaluation Package
AURORA Project database - Subset of SpeechDat-Car - German database - Evaluation Package
AURORA Project database - Subset of SpeechDat-Car - Danish database - Evaluation Package
AURORA Project database - Subset of SpeechDat-Car - Italian database - Evaluation Package
AURORA Project Database - Aurora 4a - Evaluation Package
AURORA Project Database - Aurora 4b - Evaluation Package
TC-STAR 2007 Evaluation Package - ASR English
TC-STAR 2007 Evaluation Package - ASR Spanish - CORTES
TC-STAR 2007 Evaluation Package - ASR Spanish - EPPS
TC-STAR 2007 Evaluation Package - ASR Mandarin Chinese
TC-STAR 2006 Evaluation Package - ASR English
TC-STAR 2006 Evaluation Package - ASR Spanish - EPPS
TC-STAR 2006 Evaluation Package - ASR Mandarin Chinese
TC-STAR 2006 Evaluation Package - ASR Spanish - CORTES
TC-STAR 2005 Evaluation Package - ASR English
TC-STAR 2005 Evaluation Package - ASR Spanish
TC-STAR 2005 Evaluation Package - ASR Mandarin Chinese
Multilingual texts alignment consists in identifying correspondences between different text units, e.g., words, sentences, paragraphs, etc. in parallel texts.
Approach
The main approach of alignment evaluation is to compare a system-computed alignment with a manually produced reference alignment, usually called a gold standard. Different tasks have been defined in previous evaluation exercises such as Blinker, ARCADE, HLT-NAACL and ACL.
Measures
Alignment evaluations were generally performed by using traditional IR measures:
Projects
Events
Tools
Aligners
Language Resources @ ELRA
References
Och F. J. and Ney H. (2000) A Comparison of Alignment models for statistical machine translation. In Proceedings of the 18th International Conference on Computational Linguistics (COLING-ACL 2000), p1086-1090, Saarbrücken, Germany.
Parsing is the process of structuring a linear representation in accordance with a given grammar (Grune and Jacobs, 1990).
Approach
The basic idea of parsing evaluation consists in measuring the similarity between the parser-generated tree-structure (also called labelled bracketings) and the manually constructed tree-structure.
Adequacy evaluation involves determining the fitness of a parsing system for a particular task. Efficiency evaluation is to compare the parse time of a given parser on a common test data set with a reference parser.
Measures
(Carroll et al. 1998) made the distinction between evaluation methods that are useful in leading the development of a parsing system (intrinsic evaluation) and those that are appropriate for comparing different systems (comparative evaluation). They divided parser evaluation methods into non-corpus and corpus based methods:
Projects
PASSAGE, French evaluation campaign for syntactic parsing (2007-2009).
The Parsing Task of EVALITA 2009
The Parsing Task of EVALITA 2007
EASY, Evaluation campaign for syntactic parsing organized by French Technolangue action EVALDA (2003-2006).
XTAG, wide-coverage grammar development project for English using a lexicalized Tree Adjoining Grammar (TAG) formalism (1998).
SPARKLE, Shallow Parsing and Knowledge extraction for Language Engineering, European project (1997-2000).
GRACE, Grammars and Resources for Analyzers of Corpora and their Evaluation, part of the French CCIIL program (1994-1997).
Events
Workshop on Parsing with Categorial Grammars
11th International Conference on Parsing Technologies (IWPT’09)
TLT 7, The 7th International Workshop on Treebanks and Linguistic Theories (2009).
CoNLL Shared Task 2009: Syntactic and Semantic Dependencies in Multiple Languages (2009).
COLING 2008, workshop on "Cross-Framework and Cross-Domain Parser Evaluation".
LREC 2008, workshop on "Partial Parsing Between Chunking and Deep Parsing".
ACL 2008, workshop on "Parsing German".
IJCAI, workshop on "Shallow Parsing in South Asian Languages".
COLING ACL 2006, tutorial on "Dependancy Parsing".
MSPIL-06, First National Symposium on Modeling and Shallow Parsing of Indian Languages.
LREC 2002, workshop on "Beyond PARSEVAL Towards Improved Evaluation Measures for Parsing Systems".
COLING 2000 Workshop on "Efficiency in Large-scale Parsing Systems".
LREC 1998, workshop on "The Evaluation of Parsing Systems".
Tools
Tagging
Lemmatisation
Evaluation
LRs
References
More about evaluation measures.
Bibliography
Information Extraction (IE) is a technology which extracts pieces of information that are salient to the user's needs. The kinds of information that systems extract vary in detail and reliability: named entities, attributes, facts and events.
Approach
Due to the complexity of the IE task, the limited performance of tools, there are few comparative evaluations in IE.
One can consider the Message Understanding Conference (MUC) as the starting point where most of IE evaluation methodology was defined.
The performance of a system is measured by scoring filled templates with the classical information retrieval (IR) evaluation metrics: precision, recall and the F-measure. Another evaluation metric, based on the classification error rate is also used for IE evaluation. The annotated data are required for training and testing.
Measures
Given a system response and a human-generated answer key, the system's precision is defined as the number of slots it filled correctly, divided by the number of slots it attempted. Recall is defined as the number of slots it filled correctly, divided by the number of possible correct fills taken from the human-generated key. One general issue is how to define filler boundaries which is related to the question of how to assess an extracted fragment? Three criteria for matching reference occurrences and extracted ones are proposed (Freitag 1998):
In Automatic Content Extraction (ACE) and MUC evaluation conferences, the criteria used for assessing each system output item are: correct, partial, incorrect, spurious, missing and non committal.
Projects
Events
Tools
BALIE
JULIE Labs NLP Toolsuite
Language Resources
Domain-independent annotated corpora:
Domain-specific annotated data:
References
RISE
Bibliography
Maynard D., Peters W., and Li Y. (2006). Metrics for evaluation of ontology-based information extraction. In WWW 2006 Workshop on "Evaluation of Ontologies for the Web" (EON), Edinburgh, Scotland.
Freitag D. (1998). Information Extraction From HTML: Application of a General Learning Approach. In the Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98).
Califf M. E. and Mooney R. J. (1998). Relational Learning of Pattern-Match Rules for Information Extraction. In Proceedings of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, Stanford, CA, March.