Font Size: a A A

The Research Technology Of The Large Scale Information Extraction

Posted on:2014-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:2248330398475381Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays, the Internet has become a huge open knowledge base which contains vast amounts of information. People increasingly rely on the Internet to get information. But, the format of information is very complex. At the same time, the Internet is full of a lot of spam. So, it has become an important project to research how to extract accurate information automatically from the Internet. The thesis mainly researches the key technologies of information extraction based on Internet. It consists of collection and collation of large-scale webpage, webpage content extraction, and text information extraction.For the webpage collection and collation, the main task is to establish a large-scale web library as the data source of information extraction. The thesis uses network crawler to collect information from a large-scale webpage and sifts the collected webpages by judging the page importance with analysis of link analysis algorithm. The thesis analyzes the HITS algorithm and PageRank algorithm performance, then determines to use the PageRank as the link analysis algorithm. For dealing with a larger scale of webpages, single machine processing power is insufficient. So the thesis uses the Hadoop platform to select and implement the PageRank algorithm.For the web information extraction, with the analysis of current Web information extraction deficiencies and the reasons, the thesis proposes a webpages information extraction method based on heterogeneous features. Comparing with other methods which choose little features, this thesis selects more characters of webpages information to adapts to a various webpages more effectively. The experiments results show that the extraction precision of this method is high and satisfies the practical application of the web page text information extraction.For the text information extraction, this thesis researches the related literature and the commonly used methods, and summarized existing performance. In order to extract the gist of text information, this thesis synthesizes the shallow parsing and the central theory to propose a shallow syntax parsing based on merging words. Compared with the other identification phrase sentence method, this method can better adapt to identify the sentence structure by merging words and simplifying the sentence elements. With the experiment of the shallow parsing algorithm based on the rules and statistics, the results show that the predicate identification precision of the method in this thesis is high. The farther experiment confirms that this method is effective on the sentence identification.Though the above work, this thesis collects large-scale web pages and extracts the content of web pages, then finally extracts the gist of the page information. The results achieve the expected goals.
Keywords/Search Tags:large-scale web page, information extraction, shallow syntax parsing, textfeatures
PDF Full Text Request
Related items