Font Size: a A A

The Research And Implementation Of Torrent Information Aggregation And Extraction Model Based On RSS

Posted on:2011-07-28Degree:MasterType:Thesis
Country:ChinaCandidate:L N ZhangFull Text:PDF
GTID:2178360305971741Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the fast development of the Internet, World Wide Web has become a huge distributed information space, which provides users a massive and valuable information resource. But, when search engines are used for information retrieval on Internet, the returned results are so extremely huge that users often find it difficult to seek the quite consistent useful information. In addition, Internet-based information searching and accessing are no longer facing a simple static page, but the up-to-date dynamic pages, such as blog, forum websites. So, how to find the information accurately and how to access to new knowledge and new content timely become the two major issues need to be resolved.In response to these two problems, by analyzing the RSS(Really simples Syndication) aggregation technology's advantages in information updates, and information filtering in terms of disadvantage, combined with the characteristics of Web information extraction technology which can locate the required information accurately for the user, an information aggregation combined with extraction thought is proposed. Then the thought is applied in BT torrent and the information aggregation and extraction system is designed and developmented. This system can make the computer aggregate and extract the BT torrent information automatically, and presents the user with a complete view, so it replaces a lot of manpower and time consumption and raises automation level, to be support the A380 media player system researched by Shanxi Easydo company. The content is as follows:Firstly, the advantages and disadvantages of information aggregation are discussed. The various technical methods of information extraction are analyzed and compared. The Lucene which is the filtering techniques and the HtmlParser which is the extracting techniques are studied.Secondly, the model is designed, which is divided into four basic modules: information aggregation module, information filter module, information retrieval module and information extraction module. The aggregation function is achieved by RSS technology. The retrieval function, based on the information aggregation and using the Lucene technology to create index, is achieved according to query the custom lexicon. The retrieval function is achieved according to query the keywords. The extracting function using HtmlParser technology to parser the web is achieved by matching the parameters database.Finally, the system is implemented, the availability of the system is testing, and then the effectiveness of the results is assessed.The system of information aggregation and extraction aimed at BT torrent adds the filter function in information aggregation process, and achieves on-line information extraction and structured storage. It can better meet users'needs of finding effective information accurately and accessing to new content timely. Through testing and analysis, the effectiveness of filter, and the recall and precision of extraction both can meet the model requirements, so the correctness of this research is proved, and lays a solid foundation for developing more special and extensive system in the future.
Keywords/Search Tags:RSS information aggregation, Web information extraction, Lucene, HtmlParser
PDF Full Text Request
Related items