StackOverflow Tag Predictor
Stack Overflow - Tag Prediction
Stack Overflow is the largest, most trusted online community for developers to learn, share their programming knowledge, and build their careers.
Suggest the tags based on the content that was there in the question posted on Stackoverflow.
Youtube : https://youtu.be/nNDqbUhtIRg
Research paper : https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL
Business Objectives and Constraints
Predict as many tags as possible with high precision and recall.
Incorrect tags could impact customer experience on StackOverflow.
No strict latency constraints.
Type of Machine Learning Problem
It is a multi-label classification problem
Multi-label Classification: Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A question on Stackoverflow might be about any of C, Pointers, FileIO and/or memory-management at the same time or none of these.
Micro-Averaged F1-Score (Mean F Score) :
The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:
F1 = 2 * (precision * recall) / (precision + recall)
In the multi-class and multi-label case, this is the weighted average of the F1 score of each class.
‘Micro f1 score’:
Calculate metrics globally by counting the total true positives, false negatives and false positives. This is a better metric when we have class imbalance.
‘Macro f1 score’:
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
Hamming loss :
The Hamming loss is the fraction of labels that are incorrectly predicted.