Study On Semantic Informatioin Extraction From Web Page

Posted on:2009-04-15

Degree:Master

Type:Thesis

Country:China

Candidate:P Y Yang

Full Text:PDF

GTID:2178360308979278

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As a vast library of information resources, Internet has become the primary means of access to information, and it's one of the most important sources of educational resources. However, with explosive growth of online resources, people find themselves more difficult to get information they interest in. The emergence of search engine technique eases the people in the information searching, but the weaknesses of search engine appear gradually in recent years, that can not accurately provide results that the users really search for. The reason is that the existing Web content is human-understanding as the starting point, but lack of information the computer can understand. The Semantic Web which is considered to resolve this problem has emerged for years. However, in this process of transition between two generations of the Web, people can not immediately abandon the existing wealth of information in the Web page, so it is the key how this process becomes smooth. Nowadays, the Semantic Web information extraction requires a large amount of manual work involved in, while a higher degree of automation of the semantic information extraction technology has poor performance. Manners adopted custom extraction rules are lack of versatility, and difficult to upgrade.To address the above issues, a semantic information extracting model from Web page is proposed in this thesis, which is based on text cluster. This model can automatically mark the bulk of the Web page, and can extract the semantic information automatically also. Specifically, preprocessing technology which uses visual features based on the Web page, gets rid of noise in Web page, and improves the accuracy and speed of semantic information extracting. The semantic tagging based on text cluster is proposed. An improved HAC algorithm based on paragraphs is proposed, and it clusters the paragraphs of text in a bottom-up manner, as well it extracts and rolls up the candidate keywords of paragraphs in the cluster process. The "Semantic Cluster" is defined, and the semantic keywords of every level are generated. A method of semantic information extraction from Web page based on the cluster of Semantic Cluster is designed. The hierarchy of semantic entities is analyzed by using different text cluster thresholds. The model also analyzes the correlation of Semantic Cluster, and establishes the semantic association of Semantic Cluster. After this, a semantic theme concept called "Seed Semantic Cluster" is generating for extracting semantic information of Web pages.The experiments show that the improved algorithms based on hierarchy proposed in this thesis increased the clustering accuracy in the text clustering stage, and reduced the number of keywords. In semantic information extraction stage, Semantic Cluster clustering algorithm has certain advantages in time and accuracy compared with the traditional algorithm.

Keywords/Search Tags:

Semantic Web, Text Cluster, Semantic Information, Semantic Cluster, Information Extraction

PDF Full Text Request

Related items

1	Research On Chinese Text Clustering Algorithm Based On Semantic Cluster
2	A Semantic Enhancement Of Text Clustering Algorithm
3	Study On The Approaches Of Semantic Overlapping Communities Detection
4	Research On Algorithms For Machine Learning And Text Mining
5	Automatic Semantic Annotation Method For IoT Sensory Data Research And Implementation
6	Research And Design Of Semantic-Based Web Information Extraction System
7	The Semantic Information Automatic Generation
8	Research On System Of Multi-field Information Extraction Based On Semantic Role And Concept Graphs
9	Research On Semantic Extraction Of Content-based Video Retrieval
10	Research On Information Retrieval Technology Based On Semantic Web