Grounding text-to-image diffusion models for controlled high-quality image generation

Yükleniyor...
Küçük Resim

Tarih

2025

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Türk-Alman Üniversitesi, Fen Bilimler Enstitüsü

Erişim Hakkı

info:eu-repo/semantics/openAccess

Özet

ge-scale text-to-image (T2I) diffusion models have emerged as the new state-of-the art image generative models, demonstrating outstanding performance in synthesizing diverse high-quality visuals from natural language text captions. Image generative models have a wide range of applications including content creation, image editing and medical imaging. Although simple and powerful, text prompt alone is insufficient for tailoring the generation process into creating customized outputs. To address this limitation, multiple layout-to-image models have been developed to control the generation process by utilizing a broad array of layouts such as segmentation maps, edges, and human keypoints. In this thesis, we propose ObjectDiffusion, a model that builds on the top of cutting-edge image generative frameworks to seamlessly extend T2I models with object names along with their corresponding bounding boxes. Specifically, we make substantial modifications to the network architecture introduced in ContorlNet to integrate it with the condition processing and injection techniques proposed in GLIGEN. ObjectDiffusion is initialized with pre-training parameters to leverage the generation knowledge obtained from training on large-scale datasets. We fine-tune ObjectDiffusion on COCO2017 training dataset and evaluate it on COCO2017 validation dataset. Our model achieves an AP score of 27.4, an AP50 of 46.6, an AP75 of 28.2, an AR of 44.5, and a FID of 19.8 outperforming the current SOTA model trained on open-source datasets in AP50, AR, and FID metrics. ObjectDiffusion demonstrates a distinctive capability in synthesizing diverse high-quality, high-fidelity images that seamlessly conform to the semantic and spatial control inputs. Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits remarkable grounding abilities on closed-set and open-set settings across a wide variety of contexts. The qualitative assessment verifies the ability of ObjectDiffusion to integrate multiple grounding entities of different sizes and locations. The results of the ablative studies highlight the efficacy of our suggested weight initialization in harnessing the pre-training knowledge to enhance the conditional model performance.

Açıklama

Anahtar Kelimeler

Stable Diffusion, Kontrollü görüntü üretken modeli, Üretken modeller, Difüzyon modelleri, Görüntü üretimi, Metinden görüntü üretimi, Temellendirilmiş görüntü üretimi, Controlled Image Generative Models, Conditional Image Generative Models, Grounded Image Generative Models, Text-to-Image Models, Image Generation, Generative Models, Diffusion Models

Kaynak

WoS Q Değeri

Scopus Q Değeri

Cilt

Sayı

Künye

Süleyman, A. (2025). Grounding text-to-image diffusion models for controlled high-quality image generation. Türk-Alman Üniversitesi, Fen Bilimler Enstitüsü.

Koleksiyon