Font Size: a A A

The Personal Information Extraction Based On Webpage Understanding

Posted on:2013-08-08Degree:MasterType:Thesis
Country:ChinaCandidate:D S HaoFull Text:PDF
GTID:2248330371985841Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the Internet becoming more and more popular in our society,when people want toknow something in their daily life, especially the information about the person who they areinterested in, they will more or less turn to the internet for help. What’s more, it is common tounderstand that more people are interested in getting information about public figures.However, with the explosive growth of information on the Internet, it is increasingly difficultfor people who want to get some information to satisfy their own use from the variety kindsof information on the internet. So how to obtain satisfactory information searching results andextraction effects have become a hot issue in today’s internet extraction research fields. Therehave been already some personal information extraction systems, such as Microsoft’s "Cube”,Yahoo’s "Chinese People Search” and so on. However, the accuracy of the informationextracted by those information extraction systems is not very high.In order to obtain highly accurate personal information, this paper presents a personalinformation extraction method which is based on the complete understanding of the webpage.Since the internet is flooded with various desultorily, worthless, sometimes even falseinformation, which makes the effects of information extraction bad.As we all know, the reliability of personal information on the personal homepage andresume is high, this paper adopts the Focused Crawling technology and ontology technologyto automatically crawl them, then taking those webpage as information source. Theinformation extraction accuracy also will certainly be improved greatly if we take theinformation source for personal information extraction.The based on the complete webpage understanding information extraction methodcontains three steps. First, taking use of the webpage semantic segmentation algorithm to splitthe webpage, divides the webpage into many different semantic blocks. Then we are using thevector space model to calculate the relevance between the semantic block and the theme ofpersonal information, thus gaining the personal information relevant semantic block. Finally,we are taking use of these personal information semantic blocks to extract the personalinformation.In this paper, the method used for webpage segmentation algorithm is the Microsoftwebpage segmentation algorithm. When computing the relevance between the semantic blockand the theme of personal information, we constructed the Semantic block of space vector andthe personal information space vector. the TF-IDF method is used to calculate the weight ofsemantic feature and the keywords of personal information. As for the specific treatment onsemantic block, we first use the HTML Parser to extract the links in and plain text from the semantic block, and then deal with the plain text for personal information extraction. In theprocessing of information extraction, we firstly need to extract personal name which isdescribed in the all webpage. Then we can use the recognition algorithms which are based onthe role mark to extract all names in the webpage. Finally we can use trigger vocabularylibrary which contains personal attribute information to extract other attributes information.The paper takes use of two methods to extract the personal name which is described in the allwebpage. One is taking use of the Rules to achieve the personal, another is taking use of thetriggered vocabulary. The Rules method: For links whose link label is the personal namedirectly, we can use the complete link to match the URL of the webpage. If it matches, thenlink label is the personal name which describes on the webpage, otherwise we need to use thetriggered vocabulary to extract the personal name. The Triggered vocabulary method:building personal names trigger vocabularies based on the personal name’s significant featureson the personal homepage or resume, and then using the trigger lexical to extract personalnames. At last, the paper randomly selected300webpages from the personal homepages andresumes, and extracted the personal information from those webpages. While we statisticallyanalysis the experimental results.
Keywords/Search Tags:webpage understanding, webpage segmentation, personal information extraction, VSM
PDF Full Text Request
Related items