Font Size: a A A

Research And Application Of The Web Information Extraction Based On Multi-feature

Posted on:2016-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y ChenFull Text:PDF
GTID:2308330473457809Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the high-speed development of the Internet, the data size of the Internet has an explosive growth. Cloud-computing and big data analysis which based on Internet data is growing up rapidly. However, not each web page only contains important data. They also contain the noise data such as advertising information, navigation, etc. They seriously affect the accuracy of data extraction. So how to improve the accuracy of data extraction is becoming a hot spot of the research.On the other hand, the key of the current web information extraction methods is to distinguish the important information and noise information. They want to improve the accuracy and efficiency of the information extraction. However, after the web information extraction, the results lack formal organization, causing all of the information mixed together. People are unable to distinguish the whole information and classify it. It also leads to coarse-grained of the extracted information and poor availability in other applications.This paper introduces the development, principle of several web information extraction algorithms and discusses them. We mainly study on the VIPS algorithm. The main research points of this paper are as follows:(1) According to disorganization of web information, we propose a formal and organized information description model. Based on filtering noisy information, we go on subdividing coarseness of original web information. Aiming at portal website in the internet, we describe the important information as title, number of visiting, main body, multimedia information, comments and so on. Except that, we also set different weigh for each formalized description element. We can estimate extraction precision of single webpage according to whether existing corresponding proportion in information extraction results. As a result, extracted information becomes formalized and strict organization, providing higher availability for further data analysis and other application.(2) According to deficiency and insufficient supporting of current extraction algorithm for formal and organized information description proposed by this paper, we proposed a modified extraction technology based on VIPS algorithm. Our method integrates visual feature and DOM structure together, parsing DOM structure inverted from above to bottom. Meanwhile, we utilize visual feature and DOM structure as basis of information extraction and mutually combine label block and visual block. Furthermore, we classify blocks according to organized information description and integrate similar blocks according to block features. In the end, extracted information is divided into different blocks according to organized information description. Our method increases the extraction precision by merging the structural and visual feature.(3) Finally, we compare the effectiveness of the method proposed by this paper with other previous information extraction algorithms. Then we apply result to proposed organized information description model. Simulation experiments indicate that proposed organized information description model is more formalized, valuable and with higher precision for classification. We also apply it to existing traditional web mobile system which is the mobile campus website of a university in Qingdao. This website mainly is visited through mobile devices including Android and iOS intelligent terminal. It improves the user experience after information regroups and gains perfect experiment effects.
Keywords/Search Tags:Information Extraction, DOM Analysis, Visual Features, Organized Information Description
PDF Full Text Request
Related items