Font Size: a A A

Web Information Extraction System In The Bookmarks Research And Implementation

Posted on:2015-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:D M YangFull Text:PDF
GTID:2268330425487890Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Social bookmark system is an effective tool to collect, manage and share web information. But its social features depends on the amount of users and resource. This thesis mainly researches on how to apply web information extraction and related studies of natural language processing to bookmark system, solving the cold start of system, therefore improving user experience.This thesis firstly makes a research on web information extraction algorithms. Based on the open source project of Goose, it improves the scrapping of web pages data, adds the identify of web pages’ charset automatically, then improves the preprocessing of the web pages as well as adds the supports for Chinese web pages and finally adds the formatting function of web page text, optimizing users’ reading experience. At last, this thesis implements Web information module based on ElementTree. This module could be used in production system with a high practicality. This thesis presents tag recommendation algorithm which is combined with web development pattern and implements a simple web summary function based on the results of Web information extraction and Web metadata.This thesis designs and implements a bookmark system, the reference architecture is Tornado as the web/application server and web development framework, MongoDB for the database server, AngularJS、jQuery on the client side, along with Bootstrap3for styling, implements a client application with responsive layout and flattening grid, and develops a chrome plug-in. Web information extraction module was integrated to the system, users can read and editor bookmark content, which effectively improve the user experience. Based on the information extracted, this system adopts full-text search to implements the search function avoiding the limitations of search on page title as well as a search on entire web page.The system this thesis introduces is different from current popular recommending reading system. It focus more on management bookmarks rather than reading. If a combination with notes system and bookmark system, it will be more efficient on information secondary filter.
Keywords/Search Tags:bookmark system, web information extraction, tag recommendation, MongoDB
PDF Full Text Request
Related items