10 Aug 2010 23:21
summary: tokenization & sentence boundary detection
Joerg Tiedemann <jorg.tiedemann <at> lingfil.uu.se>
2010-08-10 21:21:47 GMT
2010-08-10 21:21:47 GMT
I just realized that this mail (from some time ago) went to the wrong e-mail address. Here it is again (see below). By the way, are there freely available test sets for evaluating tokenization and sentence boundary detection? I would like to check performance for several languages and various domains. Thanks again! Jörg -------- Original Message -------- Subject: (preliminary) summary: tokenization & sentence boundary detection Date: Wed, 30 Jun 2010 15:16:54 +0200 From: Jörg Tiedemann <jorg.tiedemann <at> lingfil.uu.se> To: corpora-owner <at> uib.no Thanks a lot for all your replies to my query on tokenization/segmentation tools! Here is a summary of the responses I've got so far (including the original list in no particular order): GATE (LGPL) variety of tokenizers and splitters (generic & language specific) http://gate.ac.uk/ MorphAdorner http://morphadorner.northwestern.edu/ English only "tokenize.pl" script from the WCDG parser: http://nats-www.informatik.uni-hamburg.de/view/CDG/DownloadPage (even de-hyphenation when used together with the parser's lexicon) Java-based program, Segment https://sourceforge.net/projects/segment/ (MIT-type licence) SRX rules for sentence splitting, includes a library for sentence splitting, which is used by LanguageTool and the Maligna sentence aligner C++ library in development (GPL) Mecab (successor of Chasen) http://mecab.sourceforge.net/ Japanese Juman http://www-lab25.kuee.kyoto-u.ac.jp/nl-resource/juman-e.html includes dependency parsing etc Japanese IceNLP is open source http://icenlp.sourceforge.net tokenizer/sentence segmentizer module, part of IceNLP - a toolkit for Icelandic Lingua::PT::PLNbase Portuguese heuristics with names and standard abbreviations http://www.cis.uni-muenchen.de/~wastl/misc/tokenizer.tgz fast, rule-based, tokenizer + sentence boundary detector German, Russian, English Sebastian Nagel <wastl.nagel <at> googlemail.com> SentTrick (GPLv3) http://sourceforge.net/projects/sentrick/ sentence boundary detector for German, trainable fullstop http://hackage.haskell.org/package/fullstop English sentence segmenter in Haskell Grammatical Framework tool http://hackage.haskell.org/package/toktok MADA + TOKAN http://www1.ccls.columbia.edu/~cadim/MADA.html Arabic Moses/Europarl tokenizer http://www.statmt.org/wmt10/scripts.tgz Europarl sentence splitter as Perl modules: http://code.google.com/p/corpus-tools/downloads/list http://search.cpan.org/~achimru/Lingua-Sentence-1.01/lib/Lingua/Sentence.pm Other Perl modules: http://search.cpan.org/~shlomoy/Lingua-EN-Sentence-0.25/lib/Lingua/EN/Sentence.pm http://search.cpan.org/~holsten/Lingua-DE-Sentence-0.07/Sentence.pm http://search.cpan.org/~shlomoy/Lingua-HE-Sentence-0.13/lib/Lingua/HE/Sentence.pm Punkt implemented in NLTK (Apache license) http://www.nltk.org/ trainable (unsupervised) existing models for different languages (?) OpenNLP (GPL) http://opennlp.sourceforge.net/ trainable tokenizer & sentence boundary detector models available for English, German, Spanish, Thai further models to come, wiki at: https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page huntoken (License?) http://mokk.bme.hu/resources/huntoken mainly for Hungarian (?) Jena NLP tools http://www.julielab.de/Resources/Software/NLP+Tools.html trainable tokenizer & sentence splitter FreeLing (GPL) http://www.lsi.upc.edu/~nlp/freeling regexp tokenizer (mainly for Catalan & Spanish?) Alpino for Dutch (tokenization + sentence splitting) http://www.let.rug.nl/vannoord/alp/Alpino/ Ellogon (LGPL) http://www.ellogon.org ChaSen for Japanese (successor: mecab (see above)) http://chasen-legacy.sourceforge.jp/ MXPOST & MXTERMINATOR (research only!) ftp://ftp.cis.upenn.edu/pub/adwait/jmx/ trainable sentence splitter *******/\/\/\/\/\/\/\/\/\/\/\****************************************** Jörg Tiedemann jorg.tiedemann <at> lingfil.uu.se Dep. of Linguistics and Philology http://stp.lingfil.uu.se/~joerg/ Uppsala University tel: +46 (0)18 - 471 1412 Box 635, SE-751 26 Uppsala/SWEDEN fax: +46 (0)18 - 471 1094 *********************************/\/\/\/\/\/\/\/\/\/\/\**************** _______________________________________________ Corpora mailing list Corpora <at> uib.no http://mailman.uib.no/listinfo/corpora
RSS Feed