Font Size: a A A

Research And Implementation Of Data Cleaning Key Technology Oriented Web Text

Posted on:2010-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:K GuoFull Text:PDF
GTID:2178330332488546Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the fast development of the computer technology and the Internet, Web resources have already become an important source of information and knowledge. Web contains a large amount of text information, how to extract the text data from the Web pages and organize them to the certain model which has the clear semantic information and supports advanced data applications become the research focus.The contents of research in this paper are data cleaning and system designing oriented the Web text. Through the development of the system, the key technologies involved in the system are discussed. The method based on the theme which is used to build the text data warehouse is proposed and this method is used to design a text data warehouse. According to the structure and features of the HTML, the HTML is mapped into a tree structure. The main content of the Web page is obtained through the analysis of the HTML. Taking into account the characteristics of Chinese, the algorithm based on the dictionary is adopted to segment the words. A "co-occurrence" model is presented to extract the keywords using the results based on the segment. An auto-abstract algorithm based on the statistics method implements the automatic summary extraction. A SVM-based multi-category classification method and a Vector Space Model method based on TF-IDF are proposed to classify the test texts and achieve the similar text cleaning function.In view of the above research results, the design and realization details of the Web text cleaning system are described.
Keywords/Search Tags:Web text data warehouse, data cleaning, information extraction, text classification, similar text
PDF Full Text Request
Related items