Font Size: a A A

Research On Several Key Issues In Blog Search

Posted on:2010-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:L DuFull Text:PDF
GTID:2178360278965526Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Facing the ever growing complexity of web pages and the increasing need of more intelligent content analysis technology, we investigated two key problems in the blog information retrieval system, which include web page content extraction and text sentiment analysis. The main innovations of this thesis are stated below:Firstly, to eliminate the space and computational overhead of the state-of-the-art DOM based web page content extraction method, we purposed a SAX style algorithm. This fast and robust algorithm utilizes the page level template structure and site level noise block dedupe to extract the content of pages.While being tested on TREC Blog06 dataset, it reduces the dataset to 12.5 percent of its original size and gets a 33 percent improvement on MAP.Secondly, the feature selection problem in statistical sentiment analysis is investigated. Features including term n-gram, part-of-speech, negation and synonym expansion are tested. We found that, in word leve sentiment polarity analysis, all these advanced features need to be accompanied by position information in order to take effect, and the statistical classification model work well enough when using only unigram feature on sentence level sentiment polarity analysis. The best features combination achieved an 88.6% precision word polarity classification and an 83.9% in sentence level.At last, we introduced a blog opinion retrieval system we developed with the techniques above. We got great improvement in topic relevance when we innovatily use the site-level noise removing technique to extract the web page content and combine the document\paragraph level relevant score. In TREC 2008 Blog Track, our system is chosen as one of the baselines for our good relevance retrieval performance.
Keywords/Search Tags:blog retrieval, page segmentation, content extraction, sentiment analysis, feature selection
PDF Full Text Request
Related items