Font Size: a A A

Research On Key Technology In Preprocessing Oriented Web Text Data Warehouse

Posted on:2012-10-16Degree:MasterType:Thesis
Country:ChinaCandidate:S MaFull Text:PDF
GTID:2248330395455583Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Internet technology, more and more people get neededresources form Web. As various resources in website is innumerable, including a largequantity of images, text and more. On the construction of Web text data warehouse, ithas been a key technology on preprocessing that how to make the unstructured Webtext structure, extract useful information to support advanced application and load.This paper takes a preprocessing system of Web text data warehouse as anexample, and focus on the key technology of preprocessing. Firstly, the paper proposesthe topic-based approach used to build the Web text warehouse and design its starschema to obtain the extracted information from Web text. Secondly, Document ObjectModel (DOM) and information extraction technology is detailed. Through the HTMLparsing DOM is changed from the unstructured Web text, needed information i.e. title,author and content is obtained and structural model is provided for following neededinformation extraction. Technology of text segmentation, keywords extraction,automatic summarization and text classification is used in information extraction. Assegmentation technology is relatively mature, we adopt the CAS ICTCLASsegmentation system. In the keyword extraction, a”co-occurrence” model is presented.This paper use an improved method based on statistical automatic summarizationtechnology to obtain smooth abstract. A two-dimensional SVM-KNN textclassification method is presented to get better classification results, It solves theproblem of dependency on kernel function in the SVM and uses high accuracy of theKNN.According to above study, the design and implementation of the Web textpreprocessing system are described.
Keywords/Search Tags:Web text data warehouse, preprocessing, information extractiontext classification
PDF Full Text Request
Related items