False Positives. Manual inspection of the posts reveal that most of the false positives are due to 1) Mis-recognizing indications as an ADR, i.e. an illness for which the drug has been prescribed is recognized as an adverse drug reaction (Chowd-hury et al., 2018). For instance in the two posts“I started effexor after having pretty severe post-partum depression”and “depression hurts cymbalta can help”, depression is labeled as ADR even though it is an indication. However, depression commonly occur as ADR as well in other posts, which might be the cause for this error (Chowdhury et al., 2018); 2) Ignoring negative verbs. As an example the word manic in “The only one that didn’t make me manic, Wellbrutin” and vomiting in “@uclaibd I never had bleed-ing or vomiting just a lot of fatigue” are detected as ADRs due to the structure of the posts. However the model was not able to distinguish the negative verbs; 3) Mis-labeling ADR-related words as an ADR: For instance in the post “temperature would start to rise, depression weakens” the word depression was recognized as ADR; 4) Mistakes in manual annotation in the test data. For instance in the Tweet ”Ive had no appetite since I started on prozac” , the annotators did not annotate no appetite as an ADR. However, our model was able to predict it correctly as an ADR, but due to this mistake in test data is considered a false positive.
False Negatives. False negatives are likely to occur in posts that are ambigious or overly complex. For example, in the post “Im just wondering if its safe to take tramadol 15h after vyanse and if promethazine and melatonin would lower my chances of a seizure” the word seizure was not detected as an ADR. It must be noted how, in this specific case, even human annotators debated if seizure is indeed an ADR of tramadol, or an indication of vyanse. In another example “Am I the only one that grinds the shit out of their teeth on Vyvanse”. The expression grinds the shit out of their teeth is a long description of the slang ADR teeth grind, which has been described in a very unstructured and informal way. This is hard to handle for phrase detectors like CRF or BLSTM-RNN as some level of abstraction would be necessary to deal with this.
True Positives. VAE was able to detect terms such as "tiiiiired", "zombieish", "stomach hurt" which were not detected by the other methods. In general VAE is good at detecting unigrams/ bigrams even with small amount of data however is still not able to recognise long phrases such as "may not switch your brain", "3 days of hell", "electronic shocks in your brain" which we are not detected by the other comparison methods as well.