Text-Image Alignment in Diffusion Models: The Role of Attention Sink
My research is in the field of machine learning and artificial intelligence. It concerns computer vision, natural language processing, and the interaction between text and visual data.
Text–to–image diffusion models can now synthesize highly realistic pictures, yet their outputs often deviate from the intended prompt: objects disappear, attributes are swapped, and syntactic relations are ignored. This thesis first diagnoses the problem. By visualising cross–attention it shows that spatial overlapping of cross-attention maps tracks object–loss and attribute–binding errors. An analytical treatment reveals that both pathologies stem from an attention–sink: excessive focus on special tokens collapses key embeddings, making unrelated tokens indistinguishable to the model.
Guided by this insight, this thesis introduces T–SAM (Text–Self–Attention–guided Modulation), a training-free inference–time algorithm. T–SAM computes the self-attention matrix of the text encoder and lightly adjusts a few denoising steps so that the cross-attention similarity matrix matches this richer linguistic structure. The method adds no parameters and is architecture–agnostic.
Experiments on Stable Diffusion v1.5 and PixArt-α validate the approach. On the 4 000-
prompt Tifa benchmark, T–SAM raises question-answer accuracy from 0.79 to 0.83 while slightly improving CLIP image–text similarity. On structured “Attend-and-Excite” prompts it equals or surpasses state-of-the-art parser-based baselines, and a 25-participant study shows users prefer T–SAM images.
In summary, this thesis formally links attention sink to semantic drift, proposes a princi-
pled, lightweight remedy requiring no retraining, and demonstrates consistent fidelity gains across
models and prompt styles, paving the way for more reliable and controllable text–to–image gener-
ation.
History
Degree Type
- Master of Science
Department
- Electrical and Computer Engineering
Campus location
- West Lafayette