Font Size: a A A

A Method Of Extracting The Graphic-Text Abstract Of Webpage Based On OWL

Posted on:2015-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:F J HanFull Text:PDF
GTID:2268330428980410Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of different kinds of network data mining, information could be obtained more efficiently and accurately. As a hot topic of network data mining research, the extraction of webpage main text can only process the text abstract, but receiving information for human,80percent of which comes from the vision and pictures. Therefore, the development of network data mining will be able to extract both the text abstract, and the picture abstract in a short future. Moreover, along with the development of the mobile application, achieving information via mobile media has become an important method. However, in order to obtain information more quickly, accurately and intuitively than that from mobile media, webpage should be used as another effective strategy, which can be seen in the future, the extraction research of which may focus on extracting the text abstract and picture abstract, then to make pictures visually express the key information of webpage abstract. Finally, the new form of webpage abstract is defined as the Webpage Graphic-Text Abstract.Nowadays the technologies which are closely associated with the extraction of webpage Graphic-Text Abstract has been applied to some mobile applications, such as mobile client of Netease news, mobile client of Daily News, Zaker, and Flipboard and so on. The application of these mobile applications is limited for human inputting news. The two mobile applications, Daily News and Zaker, generally only show the news titles without text and picture when display news in news list, which will result in limited reading quality while reading the news with multiple pictures. As to the Flipboard, it shows the first default picture when the news itself has more than one picture, and about the mobile client of Netease news, sometimes we cannot find the picture in its original news webpage which has been showed in the news list, which is inconvenience when using it.Therefore, in order to solve these problems we proposed a method, which was used to introduce a method of extracting the Graphic-Text abstract of Webpage based on OWL (EGTAO). In this method, firstly the ontology model of Webpage (OMW) should be built with the Web Ontology Language OWL. Then the OMW is traversed to get some semantic properties of text and picture. Finally the Graphic-Text abstract of Webpage will be extracted more accurately and humanization by using the algorithm for extracting the text abstract and the algorithm for extracting the picture abstract. This method has following steps:1. Build the ontology model of Webpage (OMW) with the Web Ontology Language OWL. Based on the webpage structure displayed as traditional DOM Tree, we should decompose all the parts of the original webpage with the ObjectProperties among themselves, and then use the ontology building tools Protege to build the ontology model of Webpage (OMW).2. Extracting the picture abstract from the original webpage. Based on the ontology model of Webpage (OMW), we traverse the OMW to get the semantic properties of picture and its align properties in HTML. Then a normalization processing is carried out to deal with these properties into a more appropriate property which is the point for extracting the picture abstract.3. Extract the text abstract from the original webpage. Based on the ontology model of Webpage(OMW), combining with the traditional extraction which is based on the DOM Tree, we traverse the OMW to get the semantic properties of text and its txt_key、txt_tit、txt_text. Then a normalization processing is carried out to deal with these properties into a more appropriate property which is the point for extracting the text abstract.Experiment result shows that in comparison to the traditional method, the method of extracting the Graphic-Text abstract of Webpage based on OWL is proposed to extract much more accurate, representative and appropriate abstract. This method is expected to play significant roles in the future development of mobile applications and search engines. The research of this paper has a positive meaning to the webpage data mining research from the theoretical level transition to industrial implementation level.
Keywords/Search Tags:Graphic-Text abstract, Algorithm of Extracting the Graphic Abstract, Algorithm of Extracting the Text Abstract, Ontology Model of Webpage (OMW)
PDF Full Text Request
Related items