Font Size: a A A

Short Text Processing Method Based On Wikipedia

Posted on:2017-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y LuoFull Text:PDF
GTID:2308330482980504Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the popularization of instant messaging and Internet technology, the diversified social network system has been gradually formed. The amount of short text data is growing with each passing day. How to deal with a lot of short text data is very important. Short text has the characteristics of brief content and sparse feature. Ordinary text classification methods are not suitable for short text. Short text classification becomes the focus and difficulty of the current research. Researchs on short text classification by scholars at home and abroad are mainly focused on the short text processing and the improvement of classification algorithm. This paper mainly studied the short text word sense disambiguation and short text feature extension in short text processing, and at last common algorithms were used to classify the processed short text. Wikipedia was regarded as an external knowledge base in view of the fact that Wikipedia has the characteristics of comprehensive data and rich semantics. The short text word sense disambiguation method and the short text feature extension method based on Wikipedia were proposed. They can solve the problems of polysemy and sparse feature in short text, and can effectively improve the performance of short text classification. The main work is as follows:1) The proposal of TF-IDF algorithm based on word frequency statisticsFocused on the issue of the low efficiency and the poor accuracy of the traditional TF-IDF algorithm in keyword extraction, the TF-IDF algorithm based on word frequency statistics was proposed. Firstly, the formula of the same frequency words in text was deduced according to Zipf’s Law; Secondly, the proportion of each frequency word in text was determined in accordance with the formula of the same frequency words, among which most were low-frequency words; Finally, the TF-IDF algorithm based on word frequency statistics(TFIDFWFS) was proposed by combining the word frequency statistics law with the traditional TF-IDF algorithm.Simulation experiments were conducted on Chinese and English text experiment data sets. The simulation results show that in text keyword extraction, TFIDFWFS is superior to the traditional TF-IDF algorithm in precision, recall and F1-measure, and can effectively reduce the runtime in keyword extraction.2) The proposal of Wikipedia oriented TF-IDF algorithm based on word frequency statisticsFocused on the issue that TFIDFWFS doesn’t consider the Wikipedia page features, the Wikipedia oriented TF-IDF algorithm based on word frequency statistics was proposed. The algorithm selected the most representative words to represent Wikipedia entry. Firstly, the text structure weighted TF method was proposed by combining the feature of Wikipedia text structure; Secondly, the anchor text weighted TF method was proposed by combining the feature of anchor text information; Thirdly, the category information weighted TF-IDF method was proposed by combining the feature of category information; Finally, the Wikipedia oriented TF-IDF algorithm based on word frequency statistics(W-TFIDFWFS) was proposed by combining the Wikipedia page features with the TFIDFWFS.Simulation experiments were conducted on Chinese and English Wikipedia data sets. The simulation results show that in Wikipedia page keyword extraction, W-TFIDFWFS is superior to TFIDFWFS in precision, recall and F1-measure. W-TFIDFWFS can accurately calculate the weight of words, and can effectively extract the core information of Wikipedia page.3) The proposal of short text word sense disambiguation method based on WikipediaFocused on the issue of the polysemy in short text, the short text word sense disambiguation method based on Wikipedia(STWSDMW) was proposed. Firstly, the disambiguation candidate set of polysemy was obtained from the Wikipedia disambiguation page; Secondly, the Wikipedia entry title set of unambiguous words in short text was acquired; Thirdly, the similarity score of each disambiguation candidate was calculated; Finally, the disambiguation candidate with the highest similarity score was selected as the final disambiguation result.Simulation experiments were conducted on Chinese and English short text experiment data sets. The simulation results show that using STWSDMW to solve the problem of word sense disambiguation in short text can effectively improve the performance of short text classification.4) The proposal of short text feature extension method based on WikipediaSince the feature of short text is sparse, the short text classification accuracy is low. Focused on this issue, the short text feature extension method based on Wikipedia(STFEMW) was proposed. Firstly, each word in short text was disambiguated; Secondly, the vector representation of Wikipedia page of each word was obtained; Finally, the top k words with the largest weight in Wikipedia page were selected, and were extended to the feature word set of the short text.Simulation experiments were conducted on Chinese and English short text experiment data sets. The simulation results show that using STFEMW to solve the problem of feature extension in short text can effectively improve the performance of short text classification.
Keywords/Search Tags:word frequency statistics, wikipedia, short text, word sense disambiguation, feature extension
PDF Full Text Request
Related items