Type

PhD Thesis

Authors

Joachim Wagner

Subjects

Linguistics

Topics
voting classifier n gram language models probabilistic grammar computational linguistics precision grammar artificial intelligence learner corpus linguistics decision tree learning machine learning natural language processing roc curve language grammar checker error detection error corpora

Detecting grammatical errors with treebank-induced, probabilistic parsers (2012)

Abstract Today's grammar checkers often use hand-crafted rule systems that define acceptable language. The development of such rule systems is labour-intensive and has to be repeated for each language. At the same time, grammars automatically induced from syntactically annotated corpora (treebanks) are successfully employed in other applications, for example text understanding and machine translation. At first glance, treebank-induced grammars seem to be unsuitable for grammar checking as they massively over-generate and fail to reject ungrammatical input due to their high robustness. We present three new methods for judging the grammaticality of a sentence with probabilistic, treebank-induced grammars, demonstrating that such grammars can be successfully applied to automatically judge the grammaticality of an input string. Our best-performing method exploits the differences between parse results for grammars trained on grammatical and ungrammatical treebanks. The second approach builds an estimator of the probability of the most likely parse using grammatical training data that has previously been parsed and annotated with parse probabilities. If the estimated probability of an input sentence (whose grammaticality is to be judged by the system) is higher by a certain amount than the actual parse probability, the sentence is flagged as ungrammatical. The third approach extracts discriminative parse tree fragments in the form of CFG rules from parsed grammatical and ungrammatical corpora and trains a binary classifier to distinguish grammatical from ungrammatical sentences. The three approaches are evaluated on a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting common grammatical errors into the British National Corpus. The results are compared to two traditional approaches, one that uses a hand-crafted, discriminative grammar, the XLE ParGram English LFG, and one based on part-of-speech n-grams. In addition, the baseline methods and the new methods are combined in a machine learning-based framework, yielding further improvements.
Collections Ireland -> Dublin City University -> DCU Faculties and Centres = DCU Faculties and Schools: Faculty of Engineering and Computing
Ireland -> Dublin City University -> Publication Type = Thesis
Ireland -> Dublin City University -> Thesis Type = Doctoral Thesis
Ireland -> Dublin City University -> DCU Faculties and Centres = DCU Faculties and Schools
Ireland -> Dublin City University -> Subject = Humanities: Language
Ireland -> Dublin City University -> Subject = Humanities: Linguistics
Ireland -> Dublin City University -> Subject = Computer Science: Computational linguistics
Ireland -> Dublin City University -> Subject = Computer Science: Artificial intelligence
Ireland -> Dublin City University -> Subject = Computer Science: Machine learning
Ireland -> Dublin City University -> DCU Faculties and Centres = DCU Faculties and Schools: Faculty of Engineering and Computing: School of Computing
Ireland -> Dublin City University -> Subject = Humanities
Ireland -> Dublin City University -> Subject = Computer Science
Ireland -> Dublin City University -> Status = Unpublished

Full list of authors on original publication

Joachim Wagner

Experts in our system

1
Joachim Wagner
Dublin City University
Total Publications: 18