Our Results

Measurements before Debiasing

Percentage of analogies voted as biased vs. non-biased from the before debiasing group by voter A
Percentage of analogies voted as biased vs. non-biased from the before debiasing group by voter B

Measurements after Debiasing

Percentage of analogies voted as biased vs. non-biased from the after debiasing group by voter A
Percentage of analogies voted as biased vs. non-biased from the after debiasing group by voter B

Analysis of Measurements

We have attempted to minimize confirmation bias by randomly shuffling together the analogies created before and after debiasing before objectively casting our votes. The shuffled lists of analogies labeled with our votes can be viewed in the repository at reference (7).

The graphs above show the percentage of analogies that were voted by each of us as biased or non-biased from the before and after debiasing groups. Because this is an objective and qualitative analysis, each of our opinions may vary. However, as seen when comparing the measurements between before and after, our votes determined that the percentage of biased analogies decreased after debiasing. This demonstrates that the hard-debiasing algorithm effectively reduced gender stereotypes in the Wikipedia word embedding, which is consistent with the result from the original paper on the Google News embedding.

While casting our votes, we noticed that some of the analogies did not make sense and could not be considered a gender-appropriate analogy. However, in such cases, we marked these analogies as appropriate since they did not actually contain gender bias. The original experiment noted that the debiasing algorithm preserves gender-appropriate analogies (i.e. an analogy such as grandma-grandpa should not be removed after debiasing), but we did not investigate if all the sensible, appropriate analogies that existed before debiasing also existed after debiasing as we were only trying to confirm if the algorithm strictly reduced gender bias.