1 Apr 2005 18:12
Answers to domain corpora request
Carlos Rodriguez <crodriguezp <at> gmail.com>
2005-04-01 16:12:19 GMT
2005-04-01 16:12:19 GMT
Thanks to everyone who answer my request for open-source domain corpora. Leonel Ruiz and Stella Tagnin pointed me to corpora in Spanish and Brazilian Portuguese. For English, Ylva Berglund mentioned OPUS (an open source parallel corpus). From the text mining front, big textual collections of Bio-Medical full-text articles are now available, as pointed out by Paul Buitelaar (http://muchmore.dfki.de/resources1.htm) and Kevin Cohen (http://www.biomedcentral.com/info/about/datamining/ [8,000 plus articles in xml]), among other data collections. Also, the Linux Documentation Project provides a quite big, typological homogeneous collection. Unfortunately, big textual collections from other disciplines are more difficult to obtain in dowloadable form. I am now compiling a 300 article collection from Sociology journals, in case anyone is also interested in cross-genre comparatives and lexical acquisition. Carlos RodrÃguez National Autonomous University, Mexico
RSS Feed