| Nowadays, Internet plays the most significant method for people to retrieve information and express their opinions. With the overwhelming growth of social-based and comprehensive information platform, such as Weibo and wechat, public sentiment sources have become more and more sophisticated, more social, more instantaneous and more propagating. There is a strong need for enterprises to fully monitor online news, blogs, tweets, comments in forums, video sites, to maintain public image and deal with emergency opportunely. Public sentiment mining systems, based on multi-source web page collecting, intellectually analyzing, and fully monitoring is crucial.Based on the requirement of the background project, this thesis does thorough research on crucial phases in public sentiment mining system, including web content extracting, web page preprocessing, sentiment page detection, new event detection, topic detection and tracking, sentiment analysis. Vital problems are addressed, such as low accuracy for traditional web crawlers to extract pages from hidden web, high false positive rate for vector space model during new event detection, low precision rate in sentiment analysis, and new algorithms and strategies are proposed. A public sentiment mining system that fully compliance with requirements is implemented, and tested.In this thesis, AjaxCrawler is proposed. It’s based on dynamic script execution and reconstructing DOM context. Instead of building a navigation path followed by hyperlinks, AjaxCrawler builds a DOM state transfer map linked by events on particular DOM nodes. DOM node distance is proposed as the h-score function to heuristically speed up the search process. By replaying events on the shortest path, hidden content can be extracted. Case studies in extracting price tables and comments on B2C web sites show that AjaxCrawler has very higher precision and better performance than traditional crawlers. A new strategy based on pre-classifying and named entity recognition is proposed for new event detection. First pre-classify web contents to ten classes and only documents from the same class are fully compared. Weighted named entity similarity is proposed for measuring document distance. Cases study shows that this method improves the precision and recall of new event detection. An improved method based on segmenting and weighted sentiment appraisement is applied to increase the precision rate in sentiment analysis.In the background project of this thesis, algorithms and systems implemented in this thesis are fully functional tested and trail used. Cases studies show that this system has good precision rate and performance, and is fully applicable in real world practice. |