SA-Tweedie model (Kim and Wang 2025) is a global dense vector representation model trained using the Wikipedia dump data (Jan. 2021).
The WordPiece tokenizer was applied to the training data. Then weighted token-token co-occurrence counts were calculated based on the formula in Kim and Wang (2025). The co-occurrence counts were fitted with the SA-Twedie model to obtain embeddings for each WordPiece token.
To download the SA-Tweedie embedding for WordPiece tokens, click the dimension from the list below:
After download an embedding file, use the following Python function to load an embedding into Python:
import numpy as np
import torch
def get_embedding(filename) :
embedding = {}
f = open(filename, 'r', encoding ='UTF-8')
for line in f:
values = line.split()
word = values[0]
vectors = np.asarray(values[1:], 'float32')
embedding[word] = torch.tensor(vectors)
f.close()
return embedding