Knowledge Graph Machine Learning
Distillation of Transformer models
[Code] Keyphrase extraction from URL content using semantic and syntactic heuristics
Spacy.io to extract noun chunks
PageRank on graph with noun chunks as nodes and edge-weights calculated using
Semantic similarity b/w nodes using 200-D word vectors generated by word2vec
Syntactic heuristics based on phrase frequency, first occurrence etc.
Real time searching of ad-keywords semantically and commercially relevant to extracted keyphrases
Locality sensitive hashing of 50 Million 200-D ad-keyword vectors using random projections
Used Non-Metric Space Library (NMSLIB) for further speed optimizations
Ranking nearest ad-keywords using Word mover’s distance b/w keyphrases of ad-keyword & URL
Distance defined as a function of cosine similarity b/w word2vec vectors
Modified word2vec code to learn one vector representation per sense per word
Cluster ‘context’ vectors for each sense of a candidate word
Retrain word2vec neural network with each sense vector initialized with context vectors
Char-rnn models using LSTM RNNs for error types - insertion, deletion, substitution, split, concatenation
Factors impacting likelihood of candidate pruning in beam search
Probability of next character, given input sequence, predicted using Char-rnn models
Models trained on both the original and reversed Wikipedia data set
String similarity with original query using Damerau-Levenshtein distance
Works even for bi-grams queries e.g. starry xights, fundamental xights, fluorescent xights
Extracted n-grams from all possible paths from root to leaf node in DOM tree
Speedy clustering using ngram based tf-idf models after reducing tree to document-like structures
PQ-grams - suited for hierarchical data, used for better accuracy