Research On Character Attributes Extraction Based On Rules From Baidu Encyclopedia

Posted on:2014-02-27

Degree:Master

Type:Thesis

Country:China

Candidate:H L Li

Full Text:PDF

GTID:2248330398975971

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With Internet being integrated into people’s lives, the scale of the Internet is expanding rapidly and the information on the Internet is increasing in a geometrical speed. One important research has focused on obtaining useful information from huge amounts of texts and converting the information obtained into structured data that computer can read and process. Text information extraction means a natural language processing task that involves in automatically extracting the specific information (entities, entities relation etc.) from texts or a piece of text into database slots for user querying or computer further analyzing and processing.Character attributes extraction is one task of entity relation extraction. The thesis researches character attributes extraction from Baidu Encyclopedia. The following questions are researched according to actual needs.First of all, the author designs the multi-threads web spider to download Baidu Encyclopedia pages. Then after analyzing the features of the pages, the author parses the web pages by the method of regular expression.Secondly, each page contains open categories which are also called social tags or folksonomy. It is observed and analyzed that the number of the open categories in the characters domain is112. The pages containing the open categories in the characters domain are regarded as the texts of characters. And the number of the characters texts selected is218,171.Thirdly, character attribute extraction based on trigger words is researched. The trigger words set are built by online collection and linguistic analysis. The experiment results indicate the approach is feasible.Fourthly, the method of automatically obtaining rules is presented in the thesis. It uses the speech tagging of each attribute value to locate the encyclopedia free text. The candidate regular is discovered by statistics of the words lying before and behind the speech tagging. Then the extraction rules are obtained by math computation on the candidate rules, At last the characters attributes are obtained according to the way of regular matching. Experiments show that this method is feasible and effective.At last, the characters attributes extraction system is realized. The functions of the system include text collection, data preprocessing and characters attributes extraction.

Keywords/Search Tags:

information extraction, regular obtained, free text, character attributes

PDF Full Text Request

Related items

1	Research On Rule-based Extraction Of Mongolian Character Attributes
2	Text Information Extraction In Colorful Scene Image
3	A Research Of Multi-source Character Attributes Data Fusion
4	Research And System Realization Of Key Technology Of Information Extraction Optimization
5	Research And Implementation On Chinese Web Pages-Oriented Information Extraction Technologies
6	Web Free Text Information Extraction Based On TABLE Layout And Hidden Markov Model
7	The Research Of Web Information Extraction Technique And Application Based On NFA Regular Matching
8	The Research And Implementation Of Web Information Extraction System Based On The Regular Expression
9	Research On Character Recognition Algorithm Based On Regular Extreme Learning Machine
10	The Application And Research Of Regular Expression In Webpage Extration