Research On Data Extraction Of Information Text Of The Web Forums

Posted on:2012-08-09

Degree:Master

Type:Thesis

Country:China

Candidate:Z Wang

Full Text:PDF

GTID:2218330362956274

Subject:Communication and Information System

Abstract/Summary:

Along with the renovation of web technologies, there comes a new round of huge development for the internet industry, where people are able to retrieve more effective knowledge and data comparing with before. However, as the information piles up explosively, people would need a simple and direct way to check most information and to retrieve answers to most inquiries. People share information or seek for help to solve difficulties on various Bulletin Board System (BBS) or web forums. So, it has become a pressing task to do extractions from articles on BBS effectively, simplify the information and get the key points.As for the extract of information texts from BBS, the first important aspect is to extract the abstracts of articles. This research mainly draws off abstracts of articles from BBS and then improves the practicality of BBS platform according to its features. It's not just a traditional extract of text. Two main types of functions BBS carries are: first, to deliver information and make comments; and second, to seek for information and get answers. This research focuses on extracting information and drawing off effective answers in accordance with the functions respectively.A novel composite method based on Maximum Marginal Relevance (MMR), Subtopic clustering along with characteristics of context is proposed in this research. As for topics of larger volume, also known as topics of type I, the steps will be that firstly, the subtopics are to be analyzed according to a series of inter-sentence similarities among which the lowest score of the two is the segment point in order to select such K segments for K-means clustering; then, MMR algorithm framework is to be incorporated into each segment or cluster; at last, a strategic ranking method is introduced for every cluster, which judges the several contexts of significance for the final output. As for topics of type II, the algorithm adopts the language-model-based relevance model, which firstly obtains the word-word translational statistics and then constructs a model of relevance to compute similarity between the two blocks of sentences. The result shows that the methods for both tasks proposed in this article outperform the baseline systems respectively.

Keywords/Search Tags:

Bulletin Board System, information text, abstract extraction, similarity computation, Vector space model

Related items

1	Research Of Public Opinion Information Mining On Bulletin Board Systems Based On Cluster Analysis
2	Semantic Similarity Calculation Text Field Vector Space Model
3	Study On Similarity-based Text Clustering Algorithm And It's Application
4	Text Similarity Computing Theory And Applied Research
5	Online Learning Technology On Abstract Extraction System In Short Text Stream
6	Applications Of Hierarchical Keyword Extraction And Automated Text Classification In Bulletin Board System
7	Research And Implementation Of Text Similarity Algorithm Based On Semantic Fusion
8	The Research About Text Similarity Measuring Through Hamming-Distance And Semantics
9	Information Filtering Systems Based On Web Text Content And Design,
10	Research On Text Similarity Algorithm Based On Vector Space Model