Font Size: a A A

Research And Application Of Semi-structured Data Extraction

Posted on:2016-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:C Z ZhangFull Text:PDF
GTID:2428330491960037Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As more and more traditional industries transition to the Internet,the degree vertical field of information technology continues to improve,data integration,information diversification and vertical integration allows common interest in the search engine can't satisfy the needs of users in the field vertically.These users urgent need for a more detailed classification of data,integration of more comprehensive,more professional level search tool.It is with a certain strategy to obtain the original data from the Internet,its understanding and extraction,and provide search and navigation services to users through indexing and association.In such a demand driven,vertical search engines emerged.The key technology of vertical search engines include:crawling,extraction and indexing,due to the presence of a large number of semi-structured text data verticals,these data tend to have strong domain knowledge,a variety of formats,irregular structure,especially the TXT format semi-structured text,compared to the lack of labeling information,and HTML documents fixed delimiter.So to extract useful information in these semi-structured data contained is a challenging job.This paper studies the related technology of semi-structured text processing,combining the characteristics of the vertical field of pigeon semi-structured data,and analysis in the field of knowledge,establish the rule base,through the study of CRFs and semi-automated labeling solutions entity recognition extraction process mixed extraction technology design problems,using a combination of rules and statistics to achieve a pigeon vertical search engine extraction system,and achieved good results.
Keywords/Search Tags:semi-structured text, CRF, mixed extraction, entity recognition
PDF Full Text Request
Related items