Font Size: a A A

Research On Data Extraction Of Information Text Of The Web Forums

Posted on:2012-08-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z WangFull Text:PDF
GTID:2218330362956274Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Along with the renovation of web technologies, there comes a new round of huge development for the internet industry, where people are able to retrieve more effective knowledge and data comparing with before. However, as the information piles up explosively, people would need a simple and direct way to check most information and to retrieve answers to most inquiries. People share information or seek for help to solve difficulties on various Bulletin Board System (BBS) or web forums. So, it has become a pressing task to do extractions from articles on BBS effectively, simplify the information and get the key points.As for the extract of information texts from BBS, the first important aspect is to extract the abstracts of articles. This research mainly draws off abstracts of articles from BBS and then improves the practicality of BBS platform according to its features. It's not just a traditional extract of text. Two main types of functions BBS carries are: first, to deliver information and make comments; and second, to seek for information and get answers. This research focuses on extracting information and drawing off effective answers in accordance with the functions respectively.A novel composite method based on Maximum Marginal Relevance (MMR), Subtopic clustering along with characteristics of context is proposed in this research. As for topics of larger volume, also known as topics of type I, the steps will be that firstly, the subtopics are to be analyzed according to a series of inter-sentence similarities among which the lowest score of the two is the segment point in order to select such K segments for K-means clustering; then, MMR algorithm framework is to be incorporated into each segment or cluster; at last, a strategic ranking method is introduced for every cluster, which judges the several contexts of significance for the final output. As for topics of type II, the algorithm adopts the language-model-based relevance model, which firstly obtains the word-word translational statistics and then constructs a model of relevance to compute similarity between the two blocks of sentences. The result shows that the methods for both tasks proposed in this article outperform the baseline systems respectively.
Keywords/Search Tags:Bulletin Board System, information text, abstract extraction, similarity computation, Vector space model
PDF Full Text Request
Related items