Font Size: a A A

A Research On Acquisition And Verification Of Concepts From Large-Scale Chinese Corpora

Posted on:2007-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:L YuFull Text:PDF
GTID:2178360185954110Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Building knowledge bases from the massive information on Web pages has become a veryurgent task. Because concepts as well as their inter-conceptual relations and inter-attributerelations are primary parts of human knowledge, how to acquire and verify concepts is animportant step in knowledge acquisition.The essence of concept acquisition is to acquire terms which denote concepts. Foreignlanguage processing technologies are not suitable for the Chinese concept acquisition becauseof unique linguistical characteristics of the Chinese language. Furthermore, the concepts thatwe aim to acquire are not limited to a single special domain. As a result, concept acquisitionfrom Chinese text is very challenging. This thesis puts forward several methods of conceptacquisition and verification. In details, the author has conducted the following researches:(1) Designing methods to extract and verify concepts based on their structure rules.Generally, this kind of rules is usually manually created by linguists, after they analyze a greatdeal of language material. However, the approach is not adeduate of constituting structurerules for concept words which are not domain-specific. We have proposed a corpus-basedlearning approach to acquiring those structure rules according to morphemic analysis andstatistical methods.(2) Proposing verification methods based on word contribution relations. We have proposedtwo methods to acquire contribution relations to be used in concept verification. The first is aniterative verification approach based on word contribution relations and concept components.There are many common inner parts in concept words, which show a good statistical featurein large corpus. We call them concept components and make use of the statistical method toobtain them. Then the task was to verify whether the candidates for concept words can beconstructed by these concept components and nouns in the dictionary in certain orders.Experiments have shown a very good performance of this method for verifying concepts. Thesecond method is analogical pattern learning. Some concept words are generated according tothe original ones. Consequently, many concept words have a similar structure. We haveproposed a machine learning approach to analyzing these concept words to acquire analogicalpatterns. These patterns became beneficial complements for verifying concept words.(3) Proposing a corpus-based approach of concept verification. We have proposed a methodusing contextual features of candidate strings and common contextual patterns to verifyconcept words. Because manual acquisition of common contextual patterns is a tedious andtime-consuming process, we introduced a method for automatically obtaining the contextualpatterns. These contextual patterns were then comprehensively evaluated. Better contextpatterns were selected for concept acquisition and verification. This work reduced the cost ofmanually constructing contextual pattern. Besides, the speed of algorithm is low to someextent. We also introduced concept verification dependence relations to reduce the amount ofthe strings which must be verified.(4) Proposing a unified framework of concept extraction and verification. The frameworkmakes use of rules, statistic, syntactic, and contextual information to identify and verifyconcepts. First, the pattern matching process gets candidate strings form a large corpus usingconcept acquisition patterns. Then a verification method based on word contribution relationsis used to verify them. Candidate strings which were not verified are transfered to the partitionmodule, and the partition module segments candidate strings to sentence blocks. The conceptextract module recognizes concept words based on concept structure rules which areexpressed in regular expressions. Moreover, the concept evaluation is done in the module.Lastly, statistical methods are used to verify those concept strings which can not simply bereconginized by concept structure rules, and reestimate those concept words whose structuremight indicate ambiguity.
Keywords/Search Tags:National Knowledge Infrastructure, Knowledge Acquisition, Concept, Concept Word, Concept Acquisition, Concept Verification, Information Extraction, Context
PDF Full Text Request
Related items