Font Size: a A A

Visual Web Page Information Extraction And Text Feature Word Extraction Technology Research

Posted on:2014-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z W ZhangFull Text:PDF
GTID:2358330512462787Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Internet is the important sources of information in various industries as well as a variety of information systems with its rich massive resources and the growing scale and open nature.How obtain valuable information accurately from this huge information repository has become one of the key problems of research and decision-making system for a variety of information and intelligence analysis system.Because of the good or bad quality of the text information of the pages which crawl from the Internet is,will directly affect the accuracy of the successive information processing and decision-making,therefore,if we can efficiently and accurately to extract information,assess the information quality and classify the web pages which crawl from the internet,which can not only improve the working efficiency of information processing personnel,but also to enhance the practical value of the specific intelligence analysis system and decision-making system.This paper depends on the infrastructure construction project of "Yunnan competitive intelligence public service platform and service architecture construction" of Yunnan provincial department of science and technology,and also based on the project of Business Intelligence Competitive Intelligence Management System which supported by the medium or small enterprises technology and innovation fund of the Department of Science and Technology of China.We get the paper start from the actual application requirements of the project and while making theoretical innovation.This paper has made a brief analysis for current research status both at home and abroad of competitive intelligence system,visual web information extraction system and the text feature words extraction system.This paper designed and implemented a Visual Web Pages Information Extraction System,and studied the Text Feature Word Extraction by introducing the TF*IDF which improved by adding the word property,and made evaluation and verification on which the feasibility and accuracy of the algorithms in Text Feature Word Extraction System.This article designed and implemented the visual Web information extraction system is from the actual application requirements of the project,as well as the perspective of the user-friendly operation.The paper made a web information extraction plan which combined the extract rules and extract templates by improving the traditional template-based one,and to obtain the extraction rules and template of each target item under the visual operation environment,and classifying the extraction according to the different site module.When to parse web page text of one module of a specific site,we obtain firstly the web page extraction by using the corresponding page classification rules,and then to extract text of each target item by using the corresponding extraction rules,and finally encapsulate the each target item text into a standard text document.The experimental consequences indicate that this plan not only friendly and easily generate web information extraction rules and templates,but also obtained finely web information extraction accuracy and recall rate.This paper studied technology of text feature words extraction.It introducing the word property tagging process when using the tokenize IKAnalyzer to tokenize the text,and marking each entry part-of-speech by using the Sogou entry library and the candidate entry library which its structure same as the Sogou entry library,in which process also statistic the TF and IDF of each entry.To adjust the weight metric value of each part-of-speech,and calculate weight of each entry by using the formula W=TF*IDF*ATTR and make no-descending order for the entry set,and finally select some largest weight entries as the feature entry set for the specific site or industry.The experiment got a good classification results when using the KNN classification algorithm to validate the actual text classification.And the experiment also obtained a well average F1 value.
Keywords/Search Tags:visualization, information extraction, feature word extraction, part-of-speech tagging, classification
PDF Full Text Request
Related items