Grounding text-to-image diffusion models for controlled high-quality image generation

dc.contributor.authorSüleyman, Ahmad
dc.date.accessioned2025-11-13T06:31:17Z
dc.date.available2025-11-13T06:31:17Z
dc.date.issued2025
dc.departmentTAÜ, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü
dc.description.abstractge-scale text-to-image (T2I) diffusion models have emerged as the new state-of-the art image generative models, demonstrating outstanding performance in synthesizing diverse high-quality visuals from natural language text captions. Image generative models have a wide range of applications including content creation, image editing and medical imaging. Although simple and powerful, text prompt alone is insufficient for tailoring the generation process into creating customized outputs. To address this limitation, multiple layout-to-image models have been developed to control the generation process by utilizing a broad array of layouts such as segmentation maps, edges, and human keypoints. In this thesis, we propose ObjectDiffusion, a model that builds on the top of cutting-edge image generative frameworks to seamlessly extend T2I models with object names along with their corresponding bounding boxes. Specifically, we make substantial modifications to the network architecture introduced in ContorlNet to integrate it with the condition processing and injection techniques proposed in GLIGEN. ObjectDiffusion is initialized with pre-training parameters to leverage the generation knowledge obtained from training on large-scale datasets. We fine-tune ObjectDiffusion on COCO2017 training dataset and evaluate it on COCO2017 validation dataset. Our model achieves an AP score of 27.4, an AP50 of 46.6, an AP75 of 28.2, an AR of 44.5, and a FID of 19.8 outperforming the current SOTA model trained on open-source datasets in AP50, AR, and FID metrics. ObjectDiffusion demonstrates a distinctive capability in synthesizing diverse high-quality, high-fidelity images that seamlessly conform to the semantic and spatial control inputs. Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits remarkable grounding abilities on closed-set and open-set settings across a wide variety of contexts. The qualitative assessment verifies the ability of ObjectDiffusion to integrate multiple grounding entities of different sizes and locations. The results of the ablative studies highlight the efficacy of our suggested weight initialization in harnessing the pre-training knowledge to enhance the conditional model performance.
dc.identifier.citationSüleyman, A. (2025). Grounding text-to-image diffusion models for controlled high-quality image generation. Türk-Alman Üniversitesi, Fen Bilimler Enstitüsü.
dc.identifier.urihttps://hdl.handle.net/20.500.12846/2096
dc.language.isoen
dc.publisherTürk-Alman Üniversitesi, Fen Bilimler Enstitüsü
dc.relation.publicationcategoryTez
dc.rightsinfo:eu-repo/semantics/openAccess
dc.subjectStable Diffusion
dc.subjectKontrollü görüntü üretken modeli
dc.subjectÜretken modeller
dc.subjectDifüzyon modelleri
dc.subjectGörüntü üretimi
dc.subjectMetinden görüntü üretimi
dc.subjectTemellendirilmiş görüntü üretimi
dc.subjectControlled Image Generative Models
dc.subjectConditional Image Generative Models
dc.subjectGrounded Image Generative Models
dc.subjectText-to-Image Models
dc.subjectImage Generation
dc.subjectGenerative Models
dc.subjectDiffusion Models
dc.titleGrounding text-to-image diffusion models for controlled high-quality image generation
dc.title.alternativeKontrollü yüksek kaliteli görüntü üretimi için metin- görüntü difüzyon modellerinin koşullandırılması
dc.typeMaster Thesis

Dosyalar

Orijinal paket
Listeleniyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
İsim:
Ahmad_SÜLEYMAN_216107002.pdf
Boyut:
36.82 MB
Biçim:
Adobe Portable Document Format
Lisans paketi
Listeleniyor 1 - 1 / 1
[ X ]
İsim:
license.txt
Boyut:
1.17 KB
Biçim:
Item-specific license agreed upon to submission
Açıklama:

Koleksiyon