Font Size: a A A

Research On Large-Scale Chinese People Information Extraction Based On Web

Posted on:2014-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:W T HuFull Text:PDF
GTID:2248330398475320Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Currently, people increasingly rely on the Internet to retrieve information. The information about people is an important aspect. The aim of this thesis is extracting information of famous people as much as possible. It can be used as a knowledge base of the people search engine, also can be used as a part of the knowledge base of the semantic search engine. This is vast personal information on the network. But, the format of information is different and complex. At the same time, a lot of spam full of the Internet. So, extracting accurate information from the network automatically and relatively faces with many difficulties. This thesis proposes a complete process of personal information extraction. It consists of downloading page, extracting webpage content, word segmentation and extracting structured personal information.Firstly, this thesis introduces the processing of data collection. The thesis narrates the process of selecting Web data sources and ways of page-downloading. It is more difficult to download page than in the past. Some Websites take a variety of measures against reptiles, such as limiting access frequency of the same IP. The writer makes up the downloading program and used three ways of page-downloading:general way, agent download way and dynamic Web data download way.Then, the content of page should be extracted. This thesis summarizes the relative research of content extraction and uses the extraction way based on statistics and DOM. To each container label, the thesis gets content length, the number of links and the number of end punctuation and computes their ratio. Then, it can be judged that whether the label contains content.The next step is word segmentation. Common segmentation systems are less effective in entity recognition so that they don’t suite for knowledge extraction and natural language processing. The segmentation system of Southwest Jiaotong University is better than the other system in entity recognition. And, the organization name recognition algorithm is implemented in this thesis. The recognition algorithm is based on word frequency statistics. Training data mainly comes from Baidu encyclopedia entries. In the process of training, the organization names are split into a number of words and all the words frequency are computed. On the basis of computation of words frequency, this thesis establishes the mathematical model and implements the algorithm of organization name recognition.Finally, the most critical step is extracting the structured personal information. The personal information commonly is semi-structured and unstructured. At this part, semi-structured and unstructured personal information which came from page should be extracted and then saved as structured information. The method of extracting semi-structured information is simple and effective. The algorithm matches the text to the attribute dictionary and then extracts directly attribute value through simple rule. For extraction of unstructured information, this thesis proposes the algorithm based on rule. The dictionary of trigger words and rules need to be established in the extraction process. The dictionary of trigger words includes basic people attributes and their trigger words. The artificial rules are used to extract attribute values.
Keywords/Search Tags:Information extractions, structuring, word segmentation, word frequencystatistics, content extraction
PDF Full Text Request
Related items