Font Size: a A A

Acquiring Commonsense Corpora From Large Scale Web Corpora

Posted on:2009-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhuFull Text:PDF
GTID:2178360242990445Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Commonsense knowledge acquisition is a long time standing challenge within AI research. Previous work on commonsense knowledge acquisition mainly relies on the reflections of knowledge engineers to manually input the knowledge. The massiveness, easy accessibility, and holistic domain independence of Web corpora make it another possible knowledge source for commonsense knowledge acquisition. An important step in commonsense knowledge acquisition from Web corpora is to select from Web pages those sentences which are suitable for commonsense knowledge acquisition. These sentences are what we call a commonsense corpus.Through a manual experiment on commosense knowledge acquisition from Web corpora, we discussed the feasibility to differentiate sentences from the perspective of acquiring commsense knowledge from them. We also found two regularities to characterize sentences'suitability for commonsense acquisition. We provided rules to select out those sentences amenable to analyzing their commonsense acquisition suitability. In order to utilize the regularities from the manual experiment, we chose to use the weighted lexical network model and its training algorithm, as well as the method to compute from the Web corpora the cognitive salience of nominal words in a closed lexicon.The mainly contributions of this thesis are as follows.1. A manual experiment on commonsense acquisition from web corpora. We analyzed the experiment results from the three perspectives of agreement test, concordance test, and acquisition similarity test. We also discussed two regularities to characterize the suitability of a sentence for commonsense acquisition, namely the co-occurrence frequency of semantically related words and the cognitive salience of nominal words.2. We provided a set of rules to select out those sentences amenable to analyzing their commonsense acquisition suitability. Some of the sentences directly extracted from Web pages cannot be used to analyze their suitability for commonsense acquisition, mostly because of errors in segmentation or pos tagging, existence of idioms or non-morpheme characters within the sentences, segmented named entity, and existence of ancient Chinese. In order to avoid the influence of these factors, we designed the word level preprocessing strategies, built the lexicon resources these strategies rely on, and finally derived the filtering rules.3. We proposed the model of weighted lexical network and designed the training algorithm to construct it from the Web corpora. From the perspective of analyzing the suitability of a sentence for commonsense knowledge acquisition, we delimited the allowed words in the weighted lexical network, and built the corresponding lexicon resources. In order to constrain the preceding-succeeding relation between words in the weighted lexical network, we provided the part-of-speech binary matching relation as well as the corresponding processing strategies. We used the Jaccard coefficient between a pair of words to clean the directly trained weighted lexical network.4. We proposed a method to compute the cognitive salience of nominal words from Web corpora. We discussed the importance of nominal words appearing in a sentence in evaluating the suitability of a sentence for commonsense acquisition. Referring to the work on basic level category theory in cognitive science, we designed an algorithm to construct the nominal relational network from Web corpora. And based on this relational network, we computed the scores of cognitive salience of nominal words.5. We used the weighted lexical network and scores of cognitive salience of nominal words to analyze the suitability of a sentence for commonsense acquisition. We provided an algorithm to construct a sentence lexical network for each sentence from the trained weighted lexical network. Based on the features extracted from the sentence lexical network as well as the scores of cognitive salience of nominal words, we analyzed the difficulty of a sentence for commonsense acquisition. We proposed the concept of minimal semantic component of a sentence and provided the pos based type system of minimal semantic component. We designed the algorithm to extract minimal semantic component from a sentence according to this type system. We gave the method to estimate the probability of a semantic component based on the information in the weighted lexical network. We defined the inward and outward extension of semantic components. Based on minimal semantic components, inward and outward extension, as well as the frequency threshold of a semantic component, we gave the method to analyze the abundance of a sentence in commonsense knowledge.
Keywords/Search Tags:commonsense acquisition, commonsense acquisition from text, commonsense acquisition from web corpora, web corpora, commonsense corpora, a sentence's suitability for commonsense acquisition, a sentence's difficulty for commonsense acquisition
PDF Full Text Request
Related items