Font Size: a A A

The Research Of Knowledge Acquisition Algorithm And Emantics Computation For Chinese Vocabulary And It's Applications

Posted on:2013-02-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L LiuFull Text:PDF
GTID:1118330374476421Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The Web has become the most important resource for information dissemination andsharing due to its rapid development. However, with the exponential data growth, it is noteasy to find useful knowledge on the Web."Full of data, but lack of knowledge" has becomea most urgent problem to many researchers.Most research on Knowledge Acquisition is solely based on computer technology, suchas extracting knowledge on the level of grammar logic using rules or sentence-mode. But theoccurrence of new concepts creates many new vocabularies, which consist of several words ormorphemes. The existing word segmentation systems often split them into several singlewords or morphemes before collecting them. As a result, the existing knowledge acquisitionmethods can't recognize them correctly, let alone semantic comparison. This will bring newproblems to Knowledge Acquisition, and also forces the search engine that using informationretrieval as a main technique in dealing with web pages to take "non-semantic" but keywordmatching manner, so that precision of funded content is lower; the application of semanticcomputation is expected to improve the situation.This paper mainly focuses on research and application of vocabulary knowledgeacquisition and vocabulary semantic computing. In particular, to solve the problem ofout-of-vocabulary recognition, it tries to recognize compound-words based on wordsegmentation system. Moreover, it builds a part-of-speech tagging model forcompound-words to eliminate lexicon ambiguity, which can not only solve the problem thatthe existing part-of-speech tagging model can't directly apply for tagging compound-words,but also correct the word segmentation results. Based on the compound-words recognition, itextracts thematic words from text and builds a vocabulary semantic computing model, so thatwords can compare with each other. Replacing the traditional keyword matching approachwith semantic computing approach is fundamental for Intelligent Information Retrieval,building a vocabulary semantic knowledge base and knowledge reasoning.Finally, a platform for vocabulary knowledge acquisition and semantics calculation isimplemented. Based on the above proposed algorithms, it builds an integrated system containing vocabulary knowledge acquisition, vocabulary semantics calculation and avocabulary semantics knowledge base, and validates the meaning and effectiveness of theproposed algorithms.The main contributions of this paper include:1. A Chinese compound-word recognition algorithm CWRWCDG based onpart-of-speech detecting and word co-occurrence directed graph is proposed in this paper forsolving the out-of-vocabulary recognition problem. The algorithm firstly extracts wordsequence from a text using part-of-speech detecting, and then generates word co-occurrencedirected graph with these sequences. After that, inspired by the Bellman-Ford algorithm, itfinds the longest paths whose weight value satisfies the given condition for multiple startingpoints in the word co-occurrence directed graph, the word strings corresponding to the pathsare considered as compound-words. Experiment results show that the proposed algorithmoutperforms existing algorithms.2. The key problem in labeling Chinese compound-word is part-of-speech identification.To solve this problem, a part-of-speech tagging of Chinese compound-word algorithm basedon head-feature percolation theory is proposed in this paper. Lieber firstly introduced thetheory in1980, and he figured that the lexicon of compound-word is decided by keyattributions. This paper applies the theory on part-of-speech tagging for Chinesecompound-word, and provides two tagging methods: explicit and implicit.3. The existing thematic term extracting algorithms are often based on word frequency,such as TF/IDF value, and don't really work on text with balance word distribution. To solvethe problem, a thematic term extraction algorithm TTEITS based on word position weight andincremental term set frequency is proposed in this paper. The algorithm considers thatdifferent positions of a word in a document suggest different importance of the term.Moreover, when distinguishing a thematic term, it not only calculates the weight of the singleword, but also calculates the incremental weight in the term set. As a result, the algorithm stillcan extract the most suitable thematic terms even when the candidate thematic terms haverelatively small or average frequency of occurrence.4. Based on the work of thematic term extraction, an automatic summarizationalgorithm CASTTS on Chinese texts based on thematic term set is proposed in this paper. The algorithm firstly utilizes the TTEITS algorithm to extract thematic terms, and then calculatesthe weights of the sentences which contain thematic terms to get the total weight of eachsentences corresponding to the thematic term set. Finally it selects a certain number ofsentences with the largest weight to form the summarization. Experiment results show that thealgorithm can generate high quality summarization, is very close to the original referencesummarization.5. A text similarity calculation method TSCTTS based on thematic term set is proposedin this paper, which transforms text similarity calculation into thematic term set similaritycalculation using HowNet. The algorithm firstly extracts thematic terms using TTEITSalgorithm, and then calculates the semantic distance between two words at the primitive levelstructure of HowNet. After that, it calculates the text similarity based on the semanticsimilarity between thematic terms. The algorithm was applied for text classification, andexperiment results prove its effectiveness.
Keywords/Search Tags:Vocabulary Knowledge, Compound-word, Thematic Term, SemanticComputation, Text Similarity
PDF Full Text Request
Related items