Font Size: a A A

Text Subtopic-field Segmentation And Unsupervised Feature Extraction

Posted on:2010-04-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:X F WangFull Text:PDF
GTID:1118360302965850Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
With the rapid development and popularization of the Internet, online information resources are increasing and people have changed the era information age to the rich digital information age. Faced with a deluge of online information resources, it has been difficult to find the real need of information quickly and efficiently. Therefore, how rational and effective way to organize, manage, and use of such information, has gradually become an important field of information processing study. Traditionally, information processing methods mainly rely on manual classification and selection, and web pages would be assigned to one or several more appropriate category through professional analysis of the contents. Obviously, with the rapid growth of Web information capacity, artificial approach has become very unrealistic.Text clustering is a powerful tool to organize and manage information, and it can be to solve the current chaotic situation on the Internet, making it easier for users to more accurately locate the information they need. Therefore, an ongoing study of text clustering is necessary and essential. This makes the study of text clustering has become an increasingly important area of research, and it gradually combined with the search engines, information filtering technologies into an important means of obtaining web-based information.Text clustering is a classic problem in natural language processing. In order to changing text clustering into a general pattern recognition problem, several problems need to be solved. First, the multi-topic text should be divided into a lot of single-topic sub-topic fields, then the appropriate feature units can be selected in virtue of the characteristics of natural language and the weight of the feature units can be canculated and sorted. Finally, the feature units can be clustered through a lot of clustering strategy. In order to resolve the problems of current sub-topic field segmentation and feature extraction, in this paper main works is the following:1. The text representation model was studied. Semantic quantum was defined based on the key elements of characteristics and divided into obvious quantum and latent quantum based on the contribution to expressing the topic and the concept. Obvious quantum has a direct instructions role to express the topic of text and latent quantum can express the text details through the Co-occurrence in effective area. With the improved vector space model to improve significantly the structure expression of obvious quantum and with the improved word-series model to improve significantly the structure expression of latent quantum, thereby a new text representation model based on the topic and the concept was established.2. A subtopic-field segmentation technique based on the optimal control model was proposed. A basic supposition that the subtopic-fields segmentation pattern in which the distance and the angle in the subtopic-field is small and the distance and the angle between the subtopic-field is bigger is best was proposed. The object function of the optimal control model was constructed by the within-subtopic-field distance, the between-subtopic-field distance, the within-subtopic-field angle and the between-subtopic-field angle. By solving the optimal control model, optimal subtopic-field segmentation is obtained. The method independent of specific applications is a global optimal method. This method can apply to not only the specific applications but also the Internet information retrieval and processing.3. This paper presents a new unsupervised feature extraction model based on the text conceptual model. First of all, we compute the weight of obvious quantum based the obvious quantum entanglement intensity, thereby we compute the weight of latent quantum based the window function of the latent quantum, finally we can obtain the obvious quantum feature sequence and the latent quantum feature sequence according to the respective sorted weights.If only to category the text-sets, we can obtain the categories through the clustering of the obvious quantum features. To reflect the details of the categories, the clustering of the latent quantum features based on the clustering results of obvious quantum is required. In practice, the selection of different features can be based on the different needs, it can greatly reduce the computational complexity, on the other hand greatly reduce redundancy between features.
Keywords/Search Tags:Text clustering, Chinese text, Sub-topic field, Feature extraction, Weighting
PDF Full Text Request
Related items