SA-Tweedie

SA-Tweedie model (Kim and Wang 2025) is a global dense vector representation model trained using the Wikipedia dump data (Jan. 2021).

The WordPiece tokenizer was applied to the training data. Then weighted token-token co-occurrence counts were calculated based on the formula in Kim and Wang (2025). The co-occurrence counts were fitted with the SA-Twedie model to obtain embeddings for each WordPiece token.

To download the SA-Tweedie embedding for WordPiece tokens, click the dimension from the list below:

100-d

300-d

768-d

After download an embedding file, use the following Python function to load an embedding into Python:

import numpy as np

import torch

def get_embedding(filename) :

embedding = {}

f = open(filename, 'r', encoding ='UTF-8')

for line in f:

values = line.split()