Font Size: a A A

Research On Character Attributes Extraction Based On Rules From Baidu Encyclopedia

Posted on:2014-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:H L LiFull Text:PDF
GTID:2248330398975971Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With Internet being integrated into people’s lives, the scale of the Internet is expanding rapidly and the information on the Internet is increasing in a geometrical speed. One important research has focused on obtaining useful information from huge amounts of texts and converting the information obtained into structured data that computer can read and process. Text information extraction means a natural language processing task that involves in automatically extracting the specific information (entities, entities relation etc.) from texts or a piece of text into database slots for user querying or computer further analyzing and processing.Character attributes extraction is one task of entity relation extraction. The thesis researches character attributes extraction from Baidu Encyclopedia. The following questions are researched according to actual needs.First of all, the author designs the multi-threads web spider to download Baidu Encyclopedia pages. Then after analyzing the features of the pages, the author parses the web pages by the method of regular expression.Secondly, each page contains open categories which are also called social tags or folksonomy. It is observed and analyzed that the number of the open categories in the characters domain is112. The pages containing the open categories in the characters domain are regarded as the texts of characters. And the number of the characters texts selected is218,171.Thirdly, character attribute extraction based on trigger words is researched. The trigger words set are built by online collection and linguistic analysis. The experiment results indicate the approach is feasible.Fourthly, the method of automatically obtaining rules is presented in the thesis. It uses the speech tagging of each attribute value to locate the encyclopedia free text. The candidate regular is discovered by statistics of the words lying before and behind the speech tagging. Then the extraction rules are obtained by math computation on the candidate rules, At last the characters attributes are obtained according to the way of regular matching. Experiments show that this method is feasible and effective.At last, the characters attributes extraction system is realized. The functions of the system include text collection, data preprocessing and characters attributes extraction.
Keywords/Search Tags:information extraction, regular obtained, free text, character attributes
PDF Full Text Request
Related items