Tell Them Apart: Distilling Technology Differences from Crowd-Scale Comparison Discussion

Similar technology evaluation see here.

The first column and second are two technologies, and the third column is the label which indicates if the two technologies are comparable.

Comparative sentence evaluation see here.

We first randomly sample 300 sentences (50 sentences for each comparative sentence pattern) which are extracted by our model and manually check if the sampled sentence is a comparative sentence.

Cluster evaluation see here.

As there is no ground truth for clustering comparative sentences, we ask two Master students to manually build a small-scale ground truth. We randomly sample 15 pairs of comparable technologies with different number of comparative sentences. For each technol- ogy pair, the two students read each comparative sentence and each of them will individually create several clusters for these comparative sentences. Note some comparative sentences are unique without any similar comparative sentence, and we put all those sentences into one cluster. Then they will discuss with the Ph.D student about the clustering results, and change the clusters accordingly. Finally, they reach an agreement for 12 pairs of comparable technologies.

Given the ground truth clusters, we take the Adjusted Rand Index (ARI) , Normalized Mutual Informa- tion(NMI), homogeneity, completeness, V-measure, and Fowlkes-Mallows Index to evaluate the clustering results.

Traditional TF-IDF with K-means and document-to-vector deep learning model (i.e., Doc2vec) with K-means are two baselines.

Usefulness evaluation see here.

We use the name of comparable technologies with several keywords such as compare, vs, difference to search questions in Stack Overflow. We then manually check which of them are truly about comparable technology comparison, and randomly sample five questions that discuss comparable technologies in different categories and have at least five answers.

We then ask two Master students to read each sentence in all answers and cluster all sentences into several clusters which represent developers’ opinions in different aspects. To make the data as valid as possible, they still first carry out the clustering individually and then reach an agreement after discussions. For each comparative opinion in the answer, we manually check if that opinion also appears in the knowledge base of comparative sentences extracted by our method. To make this study fair, our method does not extract comparative sentences from answers of questions used in this experiment.