Reuters has announced that it is, for the first time, making available
free of charge large quantities of archived Reuters news stories for use
by research communities around the world. The first Reuters Corpus archive
includes over 800,000 English-language news stories, equivalent to Reuters’
annual global news output.
According to the announcement, the Reuters Corpus offers researchers
a unique body of static information upon which to research, test, and benchmark
emerging technologies such as language processing, speech synthesis, voice
recognition, indexation, search, and information retrieval.
Richard Willis, head of research and standards at the Reuters Chief
Technology Office, said: "Reuters has always been heavily involved in language
and data research. And to strengthen our links with the research community
around the world, we have made available one of the most complete news
archives ever released. The data provided will aid research into many aspects
of language processing and information retrieval."
The archive includes all English-language stories produced by Reuters
globally between August 1996 and August 1997. The news data is available
on two CD-ROMs and is formatted in XML. All the news stories are fully
referenced using a total of 775 different category codes for topic, geography,
and industry sector.
Marc Moens, head of Edinburgh University’s Language Technology Group,
said: "Because of its size and the amount of preparation that has gone
into it, the Reuters collection provides scope for many new types of research
and development work. It allows for the systematic evaluation of progress
and comparison of results between different development groups. I am sure
this corpus will soon be seen as a standard in document-related work."
Yorick Wilks, a professor at Sheffield University, said: "We can already
see the potential benefits of such a corpus for stylistic language analysis.
The topic codes would also give us the opportunity to analyze the geographic
location, industry area, or topic that received news coverage from Reuters.
Areas such as semantic Web applications, categorization research, and machine
learning of topic routings would also benefit. This will be a very useful
resource."
As part of the research agreement covering use of the archive, researchers
will supply Reuters with a copy of any material published using the data.
Working with this feedback from research groups, Reuters hopes to introduce
other corpora, including multilingual versions and volumes covering other
date ranges. Further information on the Corpus is available at http://www.reuters.com/researchandstandards/corpus.
Source: Reuters, London, 011-44-20-7542-6487; http://www.reuters.com. |