Font Size: a A A

Study On Semantic Informatioin Extraction From Web Page

Posted on:2009-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:P Y YangFull Text:PDF
GTID:2178360308979278Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As a vast library of information resources, Internet has become the primary means of access to information, and it's one of the most important sources of educational resources. However, with explosive growth of online resources, people find themselves more difficult to get information they interest in. The emergence of search engine technique eases the people in the information searching, but the weaknesses of search engine appear gradually in recent years, that can not accurately provide results that the users really search for. The reason is that the existing Web content is human-understanding as the starting point, but lack of information the computer can understand. The Semantic Web which is considered to resolve this problem has emerged for years. However, in this process of transition between two generations of the Web, people can not immediately abandon the existing wealth of information in the Web page, so it is the key how this process becomes smooth. Nowadays, the Semantic Web information extraction requires a large amount of manual work involved in, while a higher degree of automation of the semantic information extraction technology has poor performance. Manners adopted custom extraction rules are lack of versatility, and difficult to upgrade.To address the above issues, a semantic information extracting model from Web page is proposed in this thesis, which is based on text cluster. This model can automatically mark the bulk of the Web page, and can extract the semantic information automatically also. Specifically, preprocessing technology which uses visual features based on the Web page, gets rid of noise in Web page, and improves the accuracy and speed of semantic information extracting. The semantic tagging based on text cluster is proposed. An improved HAC algorithm based on paragraphs is proposed, and it clusters the paragraphs of text in a bottom-up manner, as well it extracts and rolls up the candidate keywords of paragraphs in the cluster process. The "Semantic Cluster" is defined, and the semantic keywords of every level are generated. A method of semantic information extraction from Web page based on the cluster of Semantic Cluster is designed. The hierarchy of semantic entities is analyzed by using different text cluster thresholds. The model also analyzes the correlation of Semantic Cluster, and establishes the semantic association of Semantic Cluster. After this, a semantic theme concept called "Seed Semantic Cluster" is generating for extracting semantic information of Web pages.The experiments show that the improved algorithms based on hierarchy proposed in this thesis increased the clustering accuracy in the text clustering stage, and reduced the number of keywords. In semantic information extraction stage, Semantic Cluster clustering algorithm has certain advantages in time and accuracy compared with the traditional algorithm.
Keywords/Search Tags:Semantic Web, Text Cluster, Semantic Information, Semantic Cluster, Information Extraction
PDF Full Text Request
Related items