Font Size: a A A

Web Content Extraction Research Based On Dom Structure Tree And Feature Word

Posted on:2015-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhouFull Text:PDF
GTID:2268330428966202Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, the amount of information on the Internet has grown in the explosive way and the scale of the text information is expanding exponentially. Colorful Internet serves people with vast amounts of information. Web has become the most important source of information in the Internet, and Web has become an indispensable part of people’s daily life and work. Information on the Web contains not only text information, but also a lot of interferential information, which reduces the availability of Web information. How to obtain timely correct and valuable information in the vast sea of text information has become a pressing problem of information obtaining. Full use of data mining and text classification technology can commendably solve this kind of problem.Web information extraction technology and short text classification technology are very important research contents in the area of text information mining. Web information extraction technology is to divide the Web into different areas, where the body content can be found and extracted accurately according to algorithm. Web information extraction technology does not need to be trained and should have good flexibility and accuracy. The short text classification technology is a very important step of text information processing after web information extraction. After short text classification, it not only can better provide valuable information to users, but also can ensure the accuracy and efficiency of work.This paper introduces the research background, the research significance, the research status at home and abroad, the related theoretical knowledge of the web information extraction technology and short text classification technology. This paper proposes a novel method of the web information extraction technology and short text classification based on the study of predecessors. The details mainly include the following two aspects:It proposed a correct and efficient way to obtain text information from Web page, and the text can be paragraphed accurately according to the meaning of the original. First of all, the method constructs a structure tree of DOM by using the page layout tabs <table> and <div>. Then it finds out the right area and extracts the body content through the nested relationship and hierarchical relationship of the layout tabs the structure tree described. At last, after special processing the text according to the attributes of special label, we can get the correct fragmentation of body content. Experiments show that this method can automatically extract accurate web text information with easy implementation, high efficiency and strong flexibility.It introduces a short text classification method based on relevance of keyword. The thesis proposes a new idea on classification algorithm based on positive and negative weight of keyword through the research of the short text classification algorithm and the method of extracting keyword which is based on weighted complex network. First of all, divide the sentence in the short text into words, remove the stop words and then establish corpus. Secondly, obtain keywords of the short text by keyword extraction method. At last, by using of the relevance of keyword, the text relevance degree can be calculated to do the short text classification. Results of experiments show that the algorithm has high accuracy and is suitable for automatically processing a large number of web pages in batches.
Keywords/Search Tags:DOM structure tree, semantic markup, segmentation, weighted complexnetwork, relevance of feature word
PDF Full Text Request
Related items