Font Size: a A A

The Chinese Word Sense Tagging Consistency Test Method Realization

Posted on:2011-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:J M QiaoFull Text:PDF
GTID:2218330371954049Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
These years with the development of the corpus construct and corpus linguistics, the processing of real large-scale text becomes the main target of natural language processing (NLP). Now the application of the corpus is very wide and the quality is required to be higher and higher. The construct of large-scale corpus with high quality becomes the key task. The standard of the quality of corpus is focused on the taggings of the corpus, and the sense tagging is an important kind of tagging. So in order to improve the quality of the sense tagging, this thesis does researches on the technology and the method of realization for the consistency checking of chinese word sense tagging, and we complete a consistency checking device for word sense tagging. Considering the workload is heavy, we only select polyseme of verb to do research on consistency checking.The main tasks which have been completed in this thesis are as followings:First, this thesis presents new method of extracting and clustering the sentences, based on the masses of research on the real large-scale text. We extract short sentences in the extracting of sentences. The clustering of the sentences includes three steps:(1) extracting object through summing up some rules of extracting object; (2) presenting a new method of sentence similariy computation for the characteristic of sentences in the corpus; (3) clustering the sentences according to the value of sentence similariy.Second, we analyse and convert the linguistic knowledge resources. (1) "hownet" This thesis does much research on its representation structure and its calculability. And we convert it according to the requirement of the system. One is converting the word definition file from the text file to the database in SQL Server 2000. Another is converting the form of sememe representation from the indent of white space to the unique number encoding, because the previous expression is not convenient for the computer to process. (2) corpus of "people daily":This thesis counts, queries and synthetically processes the word, spell, part of speech, heteronym, separation of meanings, date and frequency for the corpus.Third, we construct the standard model database. Standard model database is used to be matched for the checked sentences to do consistency checking of word sense tagging. We do researches on its structure and component element and select every sense of the verb and its object range as its component element. We construct the standard model database facing the consistency checking of sense tagging.Fourth, we do the consistency checking. The checking situation is classified into several classes:(1) the context of the sentence including polyseme matches to the standard model database directly; (2) the context of the sentence including polyseme matches to the standard model database through similarity computation; (3) not included in (1) and (2). we take different method for the different matching situation. At the same time the system counts related experimental data automatically and shows them on the interface. After consistency checking, we can repair and extend the standard model database according to the conditions.The creative points and difficult points are sentence similarity computation of corpus using object and consistency checking. The experiment data shows that the method for consistency checking is effective.
Keywords/Search Tags:sense tagging, consistency checking, sentence similarity computation, "hownet", corpus
PDF Full Text Request
Related items