Font Size: a A A

Research On Text Information Mining Technology Of Outdated Answers For Stack Overflow

Posted on:2022-08-28Degree:MasterType:Thesis
Country:ChinaCandidate:S J GuoFull Text:PDF
GTID:2518306332453434Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The programmer's technical Q&A website has become an important knowledge sharing platform in today's society.Stack Overflow(SO)is a popular programmer's question-and-answer website,with tens of millions of users,hundreds of millions of posts,and a large number of valuable software projects.The most important feature of technical knowledge is that it is frequently updated.Over time,the technical knowledge shared on the website may become outdated.If these outdated information is not clearly marked or recorded,it may mislead users seeking help and cause development accidents.The accumulation of outdated content in the website will seriously affect the quality of the website content,but currently Stack Overflow does not have an effective mechanism to deal with this problem.We can solve this problem with the help of deep learning algorithms.As more and more decisions are handed over to deep learning algorithms,the interpretability of the model has become a key factor in determining whether users trust the judgment results of the model.The end-to-end black box design of the deep learning model makes the model decision basis that users cannot understand and verifies the reliability of the decision.The emergence of the attention mechanism increases the interpretability of the model,that is,calculating the range of the sample's attention and giving high weight to the model,while assigning low weight to irrelevant parts.However,the saliency feature extraction method of the attention mechanism has significant performance in the interpretability study of the model,and can only be applied to some models,such as classification models fast Text and Text CNN.In the structure of the model,the text position information is lost and cannot or only The attention mechanism can be applied limitedly.This paper proposes a new visualization method of salient features for fast Text and Text CNN models.At the same time,the LSTM model of the attention mechanism is used to study the visualization methods respectively.Visualization is a research basis for the next step of interpretability.Based on the text classification task in the field of natural language processing,this paper conducts research on outdated answer text information mining technology,with Stack Overflow as the scene of outdated knowledge mining.(1)Obtaining the data set: This article downloads the Stack Overflow data set from the Stack Exchange Data Dump website.Through a lot of observation and analysis,combined with outdated features and data characteristics,the answers are selected as the extraction objects of outdated knowledge.(2)Data screening rules: This research adopts heuristic methods to set rules for obsolete and non-obsolete data,and extracts obsolete data from tens of millions of data sets.(3)Data cleaning: The paper performs data preprocessing based on the characteristics of Stack Overflow data.The research extracts outdated data samples(a total of 542,511 items,and the accuracy of outdated data is 98%),performs quantitative and qualitative analysis on the data,and explores the potential connection between labeling and timeliness,and classify outdated knowledge.(4)Model improvement and application: In view of the shortcomings of low accuracy of outdated data,the experiment uses an interpretable model to evaluate the effectiveness of the model training results.Therefore,this paper proposes a new visualization method of salient features based on the non-sequence model of fast Text and Text CNN.In addition,using the bi-LSTM model of the attention mechanism,the attention weight parameters are extracted,and the visualization method based on the attention mechanism is studied.(5)Analysis of experimental results: The experiment finally trained a set of models with interpretable functions that could judge outdated answers.This method uses outdated keywords that play a decisive role in model identification,and compares the results of the model identification with the features of the visual annotation to determine the accuracy of the model's judgment and explain the reason for the model's wrong judgment.Finally,the paper compares the performance of the three models.The research results show that the rule-based data extraction method can accurately obtain outdated data.The three models can effectively annotate outdated features,and the result of the judgment is consistent with the effect of the annotated feature information.Research on Stack Overflow's outdated answer text information mining technology will help improve the quality of Stack Overflow community content and help users identify outdated information.Finally,this article recommends that Stack Overflow develop this method to encourage the entire online community to maintain answers.
Keywords/Search Tags:Stack Overflow, Obsolete knowledge, Attention mechanism, Interpretability, Text classification
PDF Full Text Request
Related items