Font Size: a A A

Research And Realization Of Web Information Extraction For Specific Field

Posted on:2017-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:2348330518495956Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The Internet has brought a huge amount of information to users,and the information is usually presented in a formatted form.However,the users' requirements for customized information increases as electronic devices bring more and more ways to present information.Also,the format of results returned by the server shouldn't stick to the original one.In order to satisfy the users' requirements and present more precise and personalized information,a robust and flexible information extraction system is essential.This thesis studied the extraction algorithm at page level and record level in Web information extraction system.At page level,a general content extraction algorithm based on Semantic Web is proposed.It combines structural features and semantic features,which calculates the content relevance by words leaves ratio and semantic weight.The algorithm proves to be performed well in old pages and modern pages.At record level,this thesis proposes a named entity recognition method applying to tendering field.We use character-based tagging method to tag the training corpus and thesaurus.Combining web structure features and context features,we recognizes the named entity from content based on CRFs model.The experimental results show that recognizing Chinese names of person has much better performance than organization.We applied these two algorithms in an information integration system.We also adopted the heuristic rules to improve the low recall rate of organization.This thesis has implemented the association of property and entity based on tendering content features and saved the pair of property and entity value to the database,which satisfied the requirements of application.
Keywords/Search Tags:web content extraction, named entity recognition, semantic web, conditional random fields
PDF Full Text Request
Related items