Words vs Tokens - AI Embedding Explorer

Words

3D Embedding Space

Tokens

Analyzed Passage

Loading passage...

Interpretation & Findings

This interactive 3D visualization explores the spatial relationships between word embeddings (magenta markers) and 5-character sub-word token embeddings (cyan markers) derived from the analyzed text passage. Using the GloVe model, vectors representing meaning/context are generated and then reduced to three dimensions via PCA. Proximity in this 3D space indicates similarity. Dashed purple lines connect nearby words, while dotted grey lines connect nearby tokens (below a distance threshold).

Hypothesis: Word embeddings and token embeddings capture language structure differently. Words may represent broader semantic concepts learned from diverse usage. Tokens, as substrings, might reflect relationships influenced more by local character patterns and orthographic similarity.

Observations & Interpretation:

Semantic Clustering of Words (Magenta): Distinct groupings based on meaning are visible. For example, words central to the passage's theme like sunlight, raindrops, rainbow, light, colors, prism, beautiful, and atmosphere occupy a related region. Words related to shape and path (shape, round, arch, path) also show proximity. Function words (the, a, in, is, its, they, these, with) cluster differently based on grammatical role, typical for word embeddings. The dashed purple lines often connect these semantically related neighbors, illustrating learned meaning from broad context.
Token Proximity & Orthography (Cyan): Tokens frequently appear near their "parent" words (e.g., sunli near sunlight). However, the token space is densely packed because many words share 5-character sequences (like the token light appearing in both sunlight and white light, or ision in division). This overlap pulls tokens together based on shared character patterns (orthographic similarity), even if the parent words are semantically distant. For instance, the token light connects embeddings related to both sunlight and white light at a sub-word level.
Structural Differences (Lines): The different patterns formed by the word links (dashed purple) and token links (dotted grey) highlight the distinct nature of the information captured. The dense web of dotted grey token-token links shows relationships driven heavily by shared characters – many tokens are "close" simply because they look similar. In contrast, the sparser dashed purple word-word links connect words based more on semantic relatedness within the passage or learned global context.
Word vs. Token Meaning Representation: These observations support the hypothesis. Word positions reflect meanings learned from diverse contexts ("supervised" by usage). Token positions are heavily influenced by local character patterns and co-occurrence statistics ("unsupervised" statistical properties). The visualization shows these related but distinct representations coexist in the embedding space.

Dimensionality Reduction Caveat: Remember, this is a 3D projection (via PCA) of a 50-dimensional space. PCA preserves major variance (similarity/distance), but some nuance might be lost. Points close in 3D were close in 50D, but apparent distance in 3D doesn't perfectly capture all high-dimensional relationships.

This visualization provides a glimpse into spatial representations of meaning. Use your mouse to rotate, zoom, and hover over points to explore specific relationships!