Font Size: a A A

The Research On Semantic-Based Web Information Automatic Aggregation System And Key Techonology

Posted on:2015-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:N GongFull Text:PDF
GTID:2298330467463873Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, with the success of the social network, personal blog and twitter, Internet has accessed to the age of Web2.0, the characteristics of which is open, equality and decentration. The enormous growth of the Web information resources makes information overload has become increasingly serious problem. Therefore, how to get the semi-structured and discrete information association and aggregation dynamically to providing effective service and promoting knowledge sharing, has become the main research direction of scholars.In this paper, on the basis of the study of text clustering analysis, with the help of text processing technology such as Chinese participle, combined with the traditional search engine technology and RSS information aggregation technology, this paper presented a kind of information processing method to refine information. This method can aggregate same or similar information automatically based on latent semantic, so as to find new topics and trace the existing topics. The primary researches in this study included:Firstly, aiming at the lack of information processing in traditional information aggregation technology, this paper proposes a web information automatic aggregation system. According to the different of function, the system is divided into three parts, which include information acquisition, information preprocessing and semantic aggregation. Secondly, this paper proposed a web content extraction method based on the punctuation distribution and HTML tag similarity. Experimental results showed that the proposed method can effectively and accurately extract web content in different themes. Thirdly, this paper deeply studied the theme model of text, especially the LDA model that can cluster text base on latent semantic information. According to the characteristics of Web information like diversity and changeable topic, this paper did some improvement of LDA to make the LDA model, which can only handle the offline information, apply to online Web information aggregation system. Experimental analysis showed that the algorithm can be classified documents which have similar subject based on latent semantic, and can also analyze the trend of topic according to the topic distribution and topic popularity in different time.
Keywords/Search Tags:information aggregation, LDA model, content extraction, latent semantic
PDF Full Text Request
Related items