Grounding text-to-image diffusion models for controlled high-quality image generation
| dc.contributor.author | Süleyman, Ahmad | |
| dc.date.accessioned | 2025-11-13T06:31:17Z | |
| dc.date.available | 2025-11-13T06:31:17Z | |
| dc.date.issued | 2025 | |
| dc.department | TAÜ, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü | |
| dc.description.abstract | ge-scale text-to-image (T2I) diffusion models have emerged as the new state-of-the art image generative models, demonstrating outstanding performance in synthesizing diverse high-quality visuals from natural language text captions. Image generative models have a wide range of applications including content creation, image editing and medical imaging. Although simple and powerful, text prompt alone is insufficient for tailoring the generation process into creating customized outputs. To address this limitation, multiple layout-to-image models have been developed to control the generation process by utilizing a broad array of layouts such as segmentation maps, edges, and human keypoints. In this thesis, we propose ObjectDiffusion, a model that builds on the top of cutting-edge image generative frameworks to seamlessly extend T2I models with object names along with their corresponding bounding boxes. Specifically, we make substantial modifications to the network architecture introduced in ContorlNet to integrate it with the condition processing and injection techniques proposed in GLIGEN. ObjectDiffusion is initialized with pre-training parameters to leverage the generation knowledge obtained from training on large-scale datasets. We fine-tune ObjectDiffusion on COCO2017 training dataset and evaluate it on COCO2017 validation dataset. Our model achieves an AP score of 27.4, an AP50 of 46.6, an AP75 of 28.2, an AR of 44.5, and a FID of 19.8 outperforming the current SOTA model trained on open-source datasets in AP50, AR, and FID metrics. ObjectDiffusion demonstrates a distinctive capability in synthesizing diverse high-quality, high-fidelity images that seamlessly conform to the semantic and spatial control inputs. Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits remarkable grounding abilities on closed-set and open-set settings across a wide variety of contexts. The qualitative assessment verifies the ability of ObjectDiffusion to integrate multiple grounding entities of different sizes and locations. The results of the ablative studies highlight the efficacy of our suggested weight initialization in harnessing the pre-training knowledge to enhance the conditional model performance. | |
| dc.identifier.citation | Süleyman, A. (2025). Grounding text-to-image diffusion models for controlled high-quality image generation. Türk-Alman Üniversitesi, Fen Bilimler Enstitüsü. | |
| dc.identifier.uri | https://hdl.handle.net/20.500.12846/2096 | |
| dc.language.iso | en | |
| dc.publisher | Türk-Alman Üniversitesi, Fen Bilimler Enstitüsü | |
| dc.relation.publicationcategory | Tez | |
| dc.rights | info:eu-repo/semantics/openAccess | |
| dc.subject | Stable Diffusion | |
| dc.subject | Kontrollü görüntü üretken modeli | |
| dc.subject | Üretken modeller | |
| dc.subject | Difüzyon modelleri | |
| dc.subject | Görüntü üretimi | |
| dc.subject | Metinden görüntü üretimi | |
| dc.subject | Temellendirilmiş görüntü üretimi | |
| dc.subject | Controlled Image Generative Models | |
| dc.subject | Conditional Image Generative Models | |
| dc.subject | Grounded Image Generative Models | |
| dc.subject | Text-to-Image Models | |
| dc.subject | Image Generation | |
| dc.subject | Generative Models | |
| dc.subject | Diffusion Models | |
| dc.title | Grounding text-to-image diffusion models for controlled high-quality image generation | |
| dc.title.alternative | Kontrollü yüksek kaliteli görüntü üretimi için metin- görüntü difüzyon modellerinin koşullandırılması | |
| dc.type | Master Thesis |











