Font Size: a A A

Research On Automatic Annotation For Chinese Text And Its Application

Posted on:2016-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:J X NanFull Text:PDF
GTID:2298330467491859Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the widespread of network, information on the internet is exploding. Especially in the era of web2.0, ordinary users can create information themselves, which exacerbates this process. Not only does the massive information bring people convenience, but also troubles. Therefore, it’s becoming an urgent issue to know how to filter useful information from massive information. Here comes the text annotation. In this thesis, research on text annotation is conducted based on complex network theory and TextRank algorithm. The result of annotation is presented visually in order to help people choose information quickly.The main work of this thesis is as follows:1. Research on text annotation based on complex network. Based on complex network theory, this thesis presents an algorithm called EC-DC to annotate text. At first, the algorithm preprocesses an article to get candidate words. Then these candidate words are mapped to nodes in a network and the co-occurrence relationship among them are mapped to edges. The importance of words is evaluated using eccentricity centrality and degree centrality. Finally, the most important K words are selected to annotate text.2. Research on text annotation based on TextRank. In this thesis, the weight of candidate words is calculated using word frequency, word location and word span. The distance among words is calculated using their co-occurrence relationship. A candidate word is regarded as an object with mass equaling to its weight. Similar to gravitation, the strength of attraction between two words is calculated according to their weight and distance, which will be used to replace the relationship in TextRank algorithm. Compared to the original algorithm, improved TextRank algorithm makes full use of the information of text. Not only is the co-occurrence relationship of words taken into consideration, but also the information of words.3. Visualization of text annotation. Visualization plays an important role in the age of big data. In the end of this thesis, a visualization system of text annotation is implemented. According to their weight calculated by two algorithms above, the annotation words are presented in tag cloud, where important words are larger and more eye-catching.In this thesis, text annotation is realized based on complex network and TextRank respectively. Compared to TFIDF, both of the two algorithms have improved in precision, recall and F1, which proves their effectiveness. In the meanwhile, the visualization system presents annotation words to users directly, making the filter of information conveniently. This system achieves satisfying result as expected.
Keywords/Search Tags:text annotation, keywords extraction, complexnetwork centrality, TextRank algorithm, information visualization
PDF Full Text Request
Related items