Insight Data Science

During my time at Insight, I worked on a consulting project with wallet.AI, a San Francisco based startup that builds intelligent machines to help people make better financial decisions.

When making a substantial financial decision, like buying a house or a car, most people would do a fair amount of research about the pros and cons of the available options. However we all make decisions about our finances everyday that we often don't think about. Like getting a coffee every morning, or taking a cab to go to the airport etc. Spending time or effort into thinking about such decisions is often not worth it. Therefore we may end up making sub-optimal decisions. Imagine, for example, if there is a coffee place just a block away from your usual, which serves same coffee for a fraction of the cost. Although minor, such savings can add up over a longer period of time. This is one way that wallet.AI could help its users make decisions more intelligently by using data.

I was interested in the problem of perceived value of goods to consumers and what makes two vendors similar for them. In the example above, some people may go to the same coffee place everyday because its cheaper, but others may go there because they like the barista. Even though unrelated to the coffee being served, it may still add value for the consumer guiding their decision making process.

To start, my project focused on developing an algorithm that matches pairs of vendors that are very similar. The hypothesis was that similarity could hopefully predict "substitutability" for consumers. And if it didn't, then we could get some insight into what factors made vendors similar but not substitutable and vice versa.

Algorithm:

The algorithm consists of three main stages: the first is to build the database augmenting data from external content. Specifically I make calls to the Yelp and Google Places API to get information about categorization, reviews, menus, and text blurbs. I also scrape content from Wikipedia and the vendors website, if these exist.

The second stage is to generate useful features from this database which will be the basis of developing a similarity metric. In order to generate these features, I use term frequency inverse document frequency (tf-idf) which is a measure for how important a word is in a given document (in this case, all text on a given vendor). The tf-idf value for a given word increases proportionally to the number of times it appears in the document, and is offset by the frequency of the word in the corpus. This adjusts for words that appear more frequently in general (eg. "the", "of" etc.).

Finally, the similarity of given vendors is extracted. For all pairs of vendors, the algorithm calculates the cosine similarity, which measures the cosine of the angle between between two feature vectors of an inner product space. The vectors here are given by a linear combination of the features (extracted in the previous step) with the coefficients being the tf-idf score.

Validation:

Our initial hypothesis was that the similarity score from the algorithm could be a predictive indicator of substitutability for a user. Therefore the goal of the algorithm is really to understand human behavior and how two vendors could provide the same perceived value for the user.

In order to test this, early adopters of wallet.AI were presented with two vendors with high similarity scores. They were asked to rate, on a scale of 0-10, to what extent the two vendors "sold the same items" and "fulfilled the same need" for them.

For the first question, 77% of the responses had a score of 5 or higher. This implies that the algorithm performs the task of identifying similar vendors that sell the same items quite well. This question acts as a control to make sure that the algorithm is in fact identifying vendors that are high in similarity.

For the second question (whether the vendors "fulfilled the same need" for users), 90% of the responses had a score of 5 or higher. This shows that the algorithm can not only identify similar vendors, but also identify vendors that fulfill the same need for the users.

Intuitively our initial hypothesis was that the algorithm would perform better on the "same items" question compared to the "same need" question. However the results from this initial validation showed otherwise, giving us insight into how people make decisions about perceived value from given vendors. For example, if two restaurants sell different types of food they don't necessarily sell the same items, however could still serve the same value for a consumer.

Acknowledgments:

I am grateful to Immanuel Buder, Boris Fedorov, and Omar Green of wallet.AI for their advice and support.

Slides:

Please see the attached slides for more details.