Interchangeable Token Embeddings for Extendable Vocabulary and Alpha-Equivalence

Abstract

We propose a novel approach for learning interchangeable tokens in language models to obtain an extendable vocabulary that can generalize to new tokens. Our method addresses alpha-equivalence, the principle that renaming bound variables preserves semantics. This property arises in many formal languages such as temporal logics, where all proposition symbols represent the same concept but remain distinct. To handle such tokens, we develop a dual-part embedding approach. The first part is shared across all interchangeable tokens, enforcing that they represent the same core concept. The second part is randomly generated for each token, enabling distinguishability. As a baseline, we consider a simpler approach that uses alpha-renaming for data augmentation. We also present alpha-covariance, a metric for measuring robustness against alpha-conversions. When evaluated in a Transformer encoder-decoder model for solving linear temporal logic formulae and copying with extendable vocabulary, our method demonstrates promising generalization capabilities as well as a favorable inductive bias for alpha-equivalence.

Alpha-Equivalence

In many formal languages, variables can be renamed without changing the meaning. This transformation is called alpha-renaming, and the resulting formulas are alpha-equivalent. Since all variables share the same underlying concept, we aim to train a model that can generalize to larger variable vocabularies.

Motivation

In many application areas such as linear temporal logic, obtaining data with a larger vocabulary is significantly more expensive. As a result, it is desirable to train a model with an extensible vocabulary that can generalize to unseen tokens during test time. This problem was not investigated previously to the best of our knowledge.

Time required to solve a linear temporal logic formula using classical algorithms.

Details

We use weight tying to align input and output embeddings.
We propose multiple methods to generate the random part.
- Some of them use discrete sampling sets.
- We devise methods to efficiently ensure uniqueness.
To ensure training stability, we use feature normalization.
We adapt the AdaCos loss function for autoregressive language modeling, which is novel in literature to the best of our knowledge.

Please see the full paper for more details.

Experiments

We perform experiments on three problems:

Copying with extendable vocabulary
Solving linear temporal logic
Predicting assignments for propositional logic

Our method achieves flawless performance on the first task, extrapolating to a vocabulary size of 160 after being trained on only 20 unique characters. It also combines well with positional encoding methods that enable length generalization (e.g., RoPE). Further details are given in the paper.

Models

Embedding matrices of the methods evaluated in experiments. Interchangeable tokens are at the bottom of matrix, separated by a small gap. The full vocabulary baseline requires a different training dataset with extended vocabulary. Note that the alpha-renaming baseline is also introduced by our work, as a more straightforward approach to enforce alpha-equivalence.

Experimental Results

(a) LTL Solving

(b) Propositional Logic

Heatmaps visualizing the ratio of correct predictions on a special test set, for LTL solving (top) and propositional logic (bottom) tasks. The brightness of the color depends on the sample size, with full brightness representing 100 samples. The dashed white box represents the boundaries of the training dataset. AP stands for "Atomic Proposition", i.e., variable.