Font Size: a A A

The Automatic Extraction Technology Research Of Web Page Text Information

Posted on:2008-09-07Degree:MasterType:Thesis
Country:ChinaCandidate:J J ZhangFull Text:PDF
GTID:2178360242960069Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the acceleration speed of information production and dissemination, the amount of information is increasing. It is a important issues to improve the efficiency of information resources in society and enterprises development. Because of the emergence of Internet and the existence of Internet world, how to find people's own interested things rapidly and accurately really became a new problem. Although the search engine can help solving some problems, but it only can provide some relevant pages of document including many repeated information that can not be distilled, the application effect of information is not ideal enough. In light of this situation, this article focuses on the automatic extraction of information and application of Web pages and text information.By analyzing the characteristics of Web pages and text information, we expatiate the distinctions between Web pages and text information and traditional text. We also lay the foundation for its extraction and find out its object. Comparing to the traditional information resources, the network has following traits: wide distribution, rapid transmission, large quantities, fast-growing, disorder and instability. Depending on different classification of information, the network resources can be divided into various types. With the rapid development of Internet and Web technology, the Web has becoming a global distributed computing environment whose development trend is the management of dynamic information rather than static HTML pages. Due to the lack of interactive, the development and maintenance of static Web sites becoming increasingly difficult. We analyze several types of common WWW dynamic interactive technology.As the growth of the amount of information brings us the larger scope of information mining, it also created a lot of garbage data. Although the manual processing of data has higher accuracy in information identification, analysis, synthesis, reasoning and association, it also has lower speed and inconvenient application. This article researches on pro-progressing technologies of Web pages and text information, points out the deficiency of Chinese classification in current. The main technologies of pro-processing in Web pages and text information contains: information automatic indexing technology, information automatic classification technology and information automatic abstracting technology. We also analyze and evaluate their current situation and advance. This article researches on automatic extraction technology of Web pages and text information, focuses on discussing Web page and its features, studies the flow and implementation method of automatic extracting, and also studies data collection and data flittering .We also put forward the concept of the Knowledge Base.The aim of automatic extracting is further processing applications of text information, mainly focuses on the application of text: the concept of data mining, analysis method and application types, and then leads to the algorithm of text application. At last, we suggest the specific algorithm of the application and mining of webpage and text information. The main analytical methods of text analysis and application contains: correlation analysis, namely the use of association rules for data analysis and processing; sequence model analysis, focusing on the analysis of data before and after the sequence; classification analysis, by analyzing the sample data in the database, provide the accurate description for each category ,built analysis model or analyze the classification rules, to make accurate description or analysis model or analysis to establish a classification rule, and then use this classification rules class the records in another databases; cluster analysis, by analyzing the recorded data in the database, according to the classification of certain rules to determine the type of each record. The main target objects of text analysis is that is a large number of processing and no-structure text information. In the last part of this chapter, we discusses the achieving processing steps of Web pages and text. According to the automatic extracting and mining examples of Web pages and text information of Network System ,we puts forwards solution scheme of automatic extracting and application of Web pages and text information. With practice, suggests the implement flow and main methods of Web pages and text information, and then applies the extracted information in the text mining. The automatic extraction and application technology of Web page text information is a great potential for developing technology, particularly as the continuous development of network technology and computer technology, it continues to developing. The electronic data becoming more and more and the number is growing. As the continues development of automatic extraction technology of web it is bound to be more widely applied.
Keywords/Search Tags:Information
PDF Full Text Request
Related items