Distance geometry for word embeddings and applications
Many methods used for the treatment of sequential data often rely on the construction of intermediate representations of unitary entities (e.g. words in natural language, or k-mers in bioinformatics). Traditionally, these representations are constructed with optimization formulations arising either from co-occurrence based models, or encoders combined with attention mechanisms. In this work, we propose a new formulation to embed these entities based on the Distance Geometry Problem (DGP): find object positions based on a subset of their pairwise distances. Considering the empirical Pointwise Mutual Information (PMI) as an inner product approximation, we discuss two algorithms to obtain approximate solutions of the underlying Euclidean DGP on large instances. The main advantage of our approach is its significantly lower computational complexity, which allows us to train representations much faster with a negligible quality loss. Furthermore, our experiments demonstrate these properties are valuable when pre-trained models are not available, for example in the case of low resource natural language processing, or for protein sequence embeddings in bioinformatics.