Font Size: a A A

Text Content Analysis Based On Word Association Network

Posted on:2014-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y H LinFull Text:PDF
GTID:2248330398471030Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the development of the Internet, types of information and text have become increasingly rich, from online news, blogs and other long text media to short texts like SMS, microblogging etc. As people are expecting more efficient, intuitive and intelligent information processing way, text content analysis and information retrieval technology gradually become a hot topic. Date mining and natural language processing, which can greatly help effective information organization and information processing, has important theoretical significance and practical value in the era of big data.In Chinese, character are treated as the smallest unit of texts, correct Chinese word segmentation can assist in the text modeling and reveal the potential relationship between word and text content. On the one hand, the network of word relationships can help retrieval task by more accurately reconstruct the query words, find intent of through user’s search expression, and provide precise search results. On the other hand, the wording of text can describe news themes by presenting relationship between words, make the content analysis simple and clear, and achieve the goal of knowledge discovery.In this paper, the study subject is word. Based on deep understanding of text model, we have done researches on term relationship analysis and text mining, aiming at achieving the goal of text content analysis. The innovation and contribution of this paper is mainly four folds:First, we propose an algorithm to find text keywords based on the word relationship building. It’s an unsupervised method. The combination of the word/term frequency of co-occurrence, and knowledge-based Chinese word segmentation can better discover keywords and theme at the same time, and able to generate a lot of new words. It can also help Chinese corpus to build dictionary, which laid the research foundation for the subsequent term relationship study. Experiments show that the algorithm have better effect thesaurus for English news corpus.Second, we propose a resistor network model to calculate the relationship of words in semantic space. By simulating the word space network to electrical network, we effectively simplify the calculation of complex sparse term relationship network of word association, focused on solving the short text query expansion problems. For the retrieval evaluation of TREC microblogging, comparative experiments show that it not only can provide more semantically related extended words, but improve the accuracy of search results at the same time.Third, we also propose a word clustering method for activation force model based on WAR Cluster the different connotations of one word based of the Affinity Force of WAF can better express the meaning of word and make the relationship network of term cluster visualized. The visualization system has been applied to two systems with BNC news corpus and COSE campus search entity relationship corpus. The results proved that this method is feasible and effective.Fourth, we design and implement a search engine named COSE. The mining of entity extraction and entity-relationship analysis makes use of previous word association network and with the relationships built, it can not only achieve a structured entity search, while make the campus entity relationship visualization. This system has good scalability. This section will not be separated described, but introduced in other parts as in Chapter2,4and5.
Keywords/Search Tags:Word Association Network, Microblogging, WAF, QueryExpansion, Visualization, Content Analysis
PDF Full Text Request
Related items