Research And Implementation Of Data Cleaning Key Technology Oriented Web Text

Posted on:2010-07-11

Degree:Master

Type:Thesis

Country:China

Candidate:K Guo

Full Text:PDF

GTID:2178330332488546

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the fast development of the computer technology and the Internet, Web resources have already become an important source of information and knowledge. Web contains a large amount of text information, how to extract the text data from the Web pages and organize them to the certain model which has the clear semantic information and supports advanced data applications become the research focus.The contents of research in this paper are data cleaning and system designing oriented the Web text. Through the development of the system, the key technologies involved in the system are discussed. The method based on the theme which is used to build the text data warehouse is proposed and this method is used to design a text data warehouse. According to the structure and features of the HTML, the HTML is mapped into a tree structure. The main content of the Web page is obtained through the analysis of the HTML. Taking into account the characteristics of Chinese, the algorithm based on the dictionary is adopted to segment the words. A "co-occurrence" model is presented to extract the keywords using the results based on the segment. An auto-abstract algorithm based on the statistics method implements the automatic summary extraction. A SVM-based multi-category classification method and a Vector Space Model method based on TF-IDF are proposed to classify the test texts and achieve the similar text cleaning function.In view of the above research results, the design and realization details of the Web text cleaning system are described.

Keywords/Search Tags:

Web text data warehouse, data cleaning, information extraction, text classification, similar text

PDF Full Text Request

Related items

1	Learning-Based Text Extraction In Natural Background
2	Research On Web Similar Duplicate Data Cleaning Based On Hadoop
3	Research On Key Technology In Preprocessing Oriented Web Text Data Warehouse
4	Text Emotional Classification Based On Text Mining
5	Research On Key Approaches Of Similar Detecting Based On Massive Text Data Set
6	Research On The Key Techniques Of Web Information Intelligent Acquisition
7	Research On Mining Geographic Location Attributes Of Characters Based On Social Text Data
8	Text Classification Based On Natural Dimension Of Webpage
9	Reasearch On Video Text Information Extraction Based On Features Integration
10	Analysis Of Text Information Based On Deep Learning