Font Size: a A A

Research And Design Of Link Information Extraction System Based On Path Synopsis

Posted on:2018-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:S WuFull Text:PDF
GTID:2348330563452570Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the information age,a large amount of information circulating on the Internet.The web information extraction system is established to meet the information collecting demand for those who are interested in valuable information,Web information extraction system are a broad class of software applications targeting at extracts data from web sources.A web data extraction system usually interacts with a web source and extracts data stored in it: for instance,if the source is an HTML Web page,the extracted content could consist of elements in the page as well as the full-text of the page itself.Eventually,extracted data might be post-processed,converted in the most convenient structured format and stored for further usage.The extraction of the link information page is a kind of important application scene of web information extraction.The so-called link information refers to the two-tuple of(title,link),and the link information page refers to the web page used to display similar link information.Link information pages are extremely common on the internet,and those websites that involved in the release of a large number of information,such as news sites,online communities,movie information sites and vertical information publishing sites,etc.,will use the link information page as the index of the information detail page.The demand for the collection of information of these sites has always been widespread,and the technique of link extraction is the key to solving this need.However,conventional method to extract link information based on regular expression requires the participation of professional staff and the efficiency of the extraction process is low which becomes the bottleneck of productivity.In this paper,we firstly made an abstraction of link extracting problem.Then according to the convention that information with similar meaning is designed to be in same structure and style on the website,we proposed a common solution,which uses the structure and style to locate the information conversely,by means of a tree-like data structure called path synopsis.After that,we were back to the specific problem.By applying our common solution to the link extracting scenes,we designed a brand-new link extraction system.Using a combination of artificial extraction and machine extraction,this system managed to achieve a balance between accuracy and automaticity.In the artificial part,we implemented a GUI-based extractor,which is very easy to use and has the ability to improve efficiency of the artificial extracting.In the machine part,link information extraction problem is transformed into a binary classification problem,which makes it possible to automatically extract.In the link extracting scenes,our system has a very low learning cost,high artificial extraction efficiency and reliable auto extraction performance,which will significantly improve the productivity of this aspect.
Keywords/Search Tags:Information Retrieve System, Link Extraction, Path Synopsis
PDF Full Text Request
Related items