Purdue University Graduate School
Browse

Text-Image Alignment in Diffusion Models: The Role of Attention Sink

Download (11.64 MB)
thesis
posted on 2025-05-08, 18:44 authored by Erfan Esmaeili FakhabiErfan Esmaeili Fakhabi

My research is in the field of machine learning and artificial intelligence. It concerns computer vision, natural language processing, and the interaction between text and visual data.

Text–to–image diffusion models can now synthesize highly realistic pictures, yet their outputs often deviate from the intended prompt: objects disappear, attributes are swapped, and syntactic relations are ignored. This thesis first diagnoses the problem. By visualising cross–attention it shows that spatial overlapping of cross-attention maps tracks object–loss and attribute–binding errors. An analytical treatment reveals that both pathologies stem from an attention–sink: excessive focus on special tokens collapses key embeddings, making unrelated tokens indistinguishable to the model.

Guided by this insight, this thesis introduces T–SAM (Text–Self–Attention–guided Modulation), a training-free inference–time algorithm. T–SAM computes the self-attention matrix of the text encoder and lightly adjusts a few denoising steps so that the cross-attention similarity matrix matches this richer linguistic structure. The method adds no parameters and is architecture–agnostic.


Experiments on Stable Diffusion v1.5 and PixArt-α validate the approach. On the 4 000-
prompt Tifa benchmark, T–SAM raises question-answer accuracy from 0.79 to 0.83 while slightly improving CLIP image–text similarity. On structured “Attend-and-Excite” prompts it equals or surpasses state-of-the-art parser-based baselines, and a 25-participant study shows users prefer T–SAM images.


In summary, this thesis formally links attention sink to semantic drift, proposes a princi-
pled, lightweight remedy requiring no retraining, and demonstrates consistent fidelity gains across
models and prompt styles, paving the way for more reliable and controllable text–to–image gener-
ation.

History

Degree Type

  • Master of Science

Department

  • Electrical and Computer Engineering

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Qiang Qiu

Additional Committee Member 2

Fengqing Maggie Zhu

Additional Committee Member 3

Xiaoqian Wang

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC