Tag

Image Gen

96 articles archived under #image-gen · RSS

arXiv — Machine Learning research 1mo ago

Moment Matching Q-Learning

arXiv:2605.29033v1 Announce Type: new Abstract: Score-based and flow-based generative models exhibit remarkable expressive capacity in capturing complex distributions, and have been extensively deployed in tasks ranging from image generation to reinforcement learning.…

31
Hugging Face Daily Papers research 1mo ago

GenClaw: Code-Driven Agentic Image Generation

Abstract GenClaw presents a code-driven agentic image generation framework that enables precise visual construction through conceptualization, sketching, and coloring stages, integrating programmatic logic with generative models. AI-generated summary Image generation models have…

8
r/LocalLLaMA community 1mo ago

Qwen/Qwen-Image-Bench · Hugging Face

Model Description Q-Judger is a vision-language model fine-tuned specifically for automated evaluation of text-to-image generated images. Given a text prompt and a generated image, the model evaluates the image on fine-grained quality criteria organized in a 3-level hierarchy…

8
arXiv — NLP / Computation & Language research 1mo ago

ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

arXiv:2605.27374v1 Announce Type: new Abstract: Recent advances in multimodal large language models (MLLMs) and diffusion models (DMs) have opened new possibilities for AI-generated content. Yet, personalized cover image generation remains underexplored, despite its critical…

26
arXiv — NLP / Computation & Language research 1mo ago

PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI

arXiv:2605.27545v1 Announce Type: new Abstract: Jailbreak attacks on multimodal AI systems remain underexplored, even though unsafe image generation can have more severe consequences than unsafe text and current defenses are relatively immature. We introduce PAST2HARM, a simple…

38
Hugging Face Daily Papers research 1mo ago

MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale

Abstract A 20B-parameter masked region diffusion model enables scalable multi-layer transparent image generation and editing through unified task handling and efficient canvas management. AI-generated summary Layered image generation and editing is a fundamental capability that…

21
arXiv — Machine Learning research 1mo ago

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

arXiv:2605.26491v1 Announce Type: new Abstract: Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary…

10
arXiv — Machine Learning research 1mo ago

On the Error-Correcting Effects of Stochasticity in Discrete Diffusion

arXiv:2605.26582v1 Announce Type: new Abstract: Discrete diffusion models achieve strong performance in text and image generation, but their inference remains slow and must inherently balance sampling efficiency and sample quality. In this work, we present a systematic study of…

8
arXiv — Machine Learning research 1mo ago

RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models

arXiv:2605.26632v1 Announce Type: new Abstract: Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cost via quantization and distillation, semi-structured sparsity, which can nearly…

27
Hugging Face Daily Papers research 1mo ago

Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

Abstract A novel approach conditions diffusion models on multimodal large language models for subject-driven image generation, combining text and reference image encoding with VAE-based identity conditioning to improve both semantic understanding and identity preservation.…

7
Hugging Face Daily Papers research 1mo ago

RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models

Abstract Diffusion Transformers achieve strong image generation performance but face high inference costs; this work proposes RT-Lynx, which uses activation sparsification and optimized CUDA kernels to accelerate inference while maintaining generation quality. AI-generated…

27
r/LocalLLaMA community 1mo ago

PrismML just released Binary and Ternary Bonsai Image 4B: 1-bit/ternary text-to-image diffusion transformers that can even run 100% locally in your browser on WebGPU.

The PrismML team really cooked with these models. They're only ~3GB in size (compared to FLUX.2 Klein 4B, which is ~16GB). Apache-2.0! Official collection on HF: https://huggingface.co/collections/prism-ml/bonsai-image Link to demo:…

11
Hugging Face Daily Papers research 1mo ago

Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference

Abstract Visual Concept Fusion enables dual text and image conditioning in diffusion models through feature alignment and fusion strategies without requiring retraining. AI-generated summary Text-to-image diffusion models like Stable Diffusion generate high-quality images from…

35
Hugging Face Daily Papers research 1mo ago

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

Abstract RTDMD is a two-stage framework that combines distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences. AI-generated summary Recent advances in few-step diffusion distillation have…

30
Hugging Face Daily Papers research 1mo ago

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

Abstract Lens is a compact 3.8B-parameter text-to-image model achieving superior performance with reduced training compute through dense caption datasets, multi-resolution batching, efficient architecture, and optimization techniques. AI-generated summary We introduce Lens, a…

19
Hugging Face Daily Papers research 1mo ago

ETCHR: Editing To Clarify and Harness Reasoning

Abstract A novel image editing approach called ETCHR is introduced that decouples visual reasoning from image generation, improving multimodal language model performance across multiple visual reasoning tasks through a two-stage training process. AI-generated summary Multimodal…

7
Hugging Face Daily Papers research 1mo ago

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

Abstract AutoRubric-T2I automatically generates and selects explicit rubrics to guide Vision-Language Model judges for text-to-image generation, achieving high-quality reward signals with minimal human annotation while improving generation quality in downstream tasks.…

36
Hugging Face Daily Papers research 1mo ago

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

Abstract Discrete autoregressive text-to-image models suffer from latent covariate shift during policy optimization, which RankE addresses through end-to-end co-evolution of policy and decoder components. AI-generated summary Discrete autoregressive (AR) text-to-image (T2I)…

9
Hugging Face Daily Papers research 1mo ago

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

Abstract SEGA improves high-resolution text-to-image generation by adaptively scaling attention across RoPE components based on spatial-frequency structure during denoising steps. AI-generated summary Diffusion transformers (DiTs) have emerged as a dominant architecture for…

33
Hugging Face Daily Papers research 1mo ago

GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

Abstract A self-evolving image generation framework uses tool-orchestrated trajectories and visual experience distillation to improve generative capabilities through iterative learning and reference-based prompting. AI-generated summary Open-ended image generation is no longer a…

19
Smol AI News news-outlet 1mo ago

not much happened today

**RAEv2** advances representation-first tokenization with **>10x faster convergence** and improved generation, tested on **text-to-image** and **world models**. **NVIDIA's Gated DeltaNet-2** innovates linear attention with channel-wise gates, outperforming **KDA** and…

23
Hugging Face Daily Papers research 1mo ago

OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation

Abstract OcclusionFormer addresses inter-object occlusion challenges in layout-to-image generation by modeling explicit Z-order priority through diffusion transformers and volume rendering techniques. AI-generated summary Recent layout-to-image models have achieved remarkable…

36
Hugging Face Daily Papers research 1mo ago

PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

Abstract A large-scale UHR image-text dataset and evaluation benchmark are introduced to advance ultra-high-resolution text-to-image generation capabilities. AI-generated summary Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the…

29
r/LocalLLaMA community 1mo ago

bytedance released an open source model that attempts to do just about anything with only 3b parameters

Lance is a lightweight native unified multimodal model that supports image and video understanding, generation, and editing within a single framework. Efficient at 3B scale. With only 3B active parameters , Lance delivers strong performance across image generation, image…

32
arXiv — Machine Learning research 1mo ago

Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra

arXiv:2605.16259v1 Announce Type: new Abstract: While real-time image generation using diffusion models has advanced rapidly on NVIDIA GPUs, systematic optimization research on non-CUDA platforms such as Apple Silicon remains extremely limited. In this study, we conducted…

32
Hugging Face Daily Papers research 1mo ago

Efficient Image Synthesis with Sphere Latent Encoder

Abstract A decoupled framework for few-step image generation that improves efficiency and performance by separating pixel-space operations from latent denoising training. AI-generated summary Few-step image generation has seen rapid progress, with consistency and meanflow-based…

15
Hugging Face Daily Papers research 1mo ago

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

Abstract InsightTok improves discrete visual tokenization for better text and face reconstruction through content-aware perceptual losses, enhancing autoregressive image generation quality. AI-generated summary Text and faces are among the most perceptually salient and…

12
Hugging Face Daily Papers research 1mo ago

Aligning Latent Geometry for Spherical Flow Matching in Image Generation

Abstract Geodesic flow matching improves image generation by projecting latents onto fixed radius spheres and using spherical linear interpolation instead of linear paths, preserving semantic content through angular components. AI-generated summary Latent flow matching for image…

26
Hugging Face Daily Papers research 1mo ago

Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning

Abstract Realiz3D addresses the domain gap between synthetic renders and real images in 3D-consistent image generation by decoupling visual domain from control signals through residual adapters and layer-specific denoising strategies. AI-generated summary We often aim to…

19
Hugging Face Daily Papers research 1mo ago

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Abstract A closed-loop visual reasoning framework integrates visual-language planning with diffusion generation to improve complex image synthesis while addressing latency and optimization challenges. AI-generated summary Despite rapid advancements, current text-to-image (T2I)…

14
Hugging Face Daily Papers research 1mo ago

Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

Abstract Synthetic layered image data improves graphic design decomposition by enabling scalable training and better layer distribution control compared to traditional methods. AI-generated summary Recent advances in image generation have made it easy to produce high-quality…

35
r/LocalLLaMA community 1mo ago

Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced - one English sentence in, finished mp4 with characters, story, music, and voice-over out (fast demo video, not the best quality). ~45 minutes end-to-end on a single AMD…

13
Hugging Face Daily Papers research 1mo ago

Asymmetric Flow Models

Abstract Asymmetric Flow Modeling enables efficient high-dimensional flow-based generation by restricting noise prediction to low-rank subspaces while maintaining full-dimensional data prediction, achieving superior performance in pixel-space text-to-image generation through…

12
Hugging Face Daily Papers research 1mo ago

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Abstract INSET is a unified multimodal model that embeds images as native vocabulary within textual instructions, enabling better handling of complex interleaved inputs through transformer-based contextual locality and supporting both image generation and editing tasks.…

34
r/MachineLearning community 1mo ago

Image generation models running locally on limited resources [P]

I have a project consisting of generating high quality free ebook covers out of its content. On my 16GB of ram machine with no gpu, i have tested the opensourced stable diffusion models without any success. All return bad quality covers with blurred faces and scenes that do not…

6
arXiv — Machine Learning research 1mo ago

Efficient Adjoint Matching for Fine-tuning Diffusion Models

arXiv:2605.11480v1 Announce Type: new Abstract: Reward fine-tuning has become a common approach for aligning pretrained diffusion and flow models with human preferences in text-to-image generation. Among reward-gradient-based methods, Adjoint Matching (AM) provides a principled…

30
OpenAI news 2mo ago

Introducing ChatGPT Images 2.0

ChatGPT Images 2.0 introduces a state-of-the-art image generation model with improved text rendering, multilingual support, and advanced visual reasoning.

9
Smol AI News news-outlet 2mo ago

GPT-Image-2

**OpenAI** launched **GPT-Image-2**, enhancing image generation with improved text rendering, layout fidelity, editing, multilingual support, and "thinking" capabilities. It supports generating slides, infographics, diagrams, UI mockups, and QR codes, and integrates with tools…

36
OpenAI news 2mo ago

Codex for (almost) everything

The updated Codex app for macOS and Windows adds computer use, in-app browsing, image generation, memory, and plugins to accelerate developer workflows.

5
Hugging Face official-blog 3mo ago

PRX Part 3 — Training a Text-to-Image Model in 24h!

Back to Articles PRX Part 3 — Training a Text-to-Image Model in 24h! Team Article Published March 3, 2026 Upvote 64 David Bertoin Bertoin Photoroom Roman Frigg photoroman Photoroom Jon Almazán jon-almazan Photoroom Introduction Welcome back 👋 In the last two posts ( Part 1 and…

23
Smol AI News news-outlet 4mo ago

Nano Banana 2 aka Gemini 3.1 Flash Image Preview: the new SOTA Imagegen model

**Google and DeepMind** launched **Nano Banana 2** (aka **Gemini 3.1 Flash Image Preview**), a leading image generation and editing model integrated across multiple Google products with features like **4K upscaling**, **multi-subject consistency**, and **real-time…

29
Hugging Face official-blog 4mo ago

Training Design for Text-to-Image Models: Lessons from Ablations

Back to Articles Training Design for Text-to-Image Models: Lessons from Ablations Team Article Published February 3, 2026 Upvote 73 David Bertoin Bertoin Photoroom Roman Frigg photoroman Photoroom Jon Almazán jon-almazan Photoroom Welcome back! This is the second part of our…

13
Hugging Face official-blog 7mo ago

Diffusers welcomes FLUX-2

Back to Articles Welcome FLUX.2 - BFL’s new open image generation model 🤗 Published November 25, 2025 Update on GitHub Upvote 190 YiYi Xu YiYiXu Daniel Gu dg845 Sayak Paul sayakpaul Alvaro Somoza OzzyGT Dhruv Nair dn6 Aritra Roy Gosthipaty ariG23498 Linoy Tsaban linoyts…

12
Google DeepMind official-blog 7mo ago

Build with Nano Banana Pro, our Gemini 3 Pro Image model

Build with Nano Banana Pro, our Gemini 3 Pro Image model Share x.com Facebook LinkedIn Mail Here’s how developers can use Nano Banana Pro (Gemini 3 Pro Image), a powerful new image generation and editing model with advanced features and creative control. Alisa Fortin Product…

10
Google DeepMind official-blog 15mo ago

Experiment with Gemini 2.0 Flash native image generation

Native image output is available in Gemini 2.0 Flash for developers to experiment with in Google AI Studio and the Gemini API.

5
Eugene Yan research 43mo ago

Text-to-Image: Diffusion, Text Conditioning, Guidance, Latent Space

The fundamentals of text-to-image generation, relevant papers, and experimenting with DDPM.

35

Moment Matching Q-Learning

GenClaw: Code-Driven Agentic Image Generation

Qwen/Qwen-Image-Bench · Hugging Face

ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI

MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

On the Error-Correcting Effects of Stochasticity in Discrete Diffusion

RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models

Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models

PrismML just released Binary and Ternary Bonsai Image 4B: 1-bit/ternary text-to-image diffusion transformers that can even run 100% locally in your browser on WebGPU.

Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

ETCHR: Editing To Clarify and Harness Reasoning

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

not much happened today

OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation

PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

bytedance released an open source model that attempts to do just about anything with only 3b parameters

Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra

Efficient Image Synthesis with Sphere Latent Encoder

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

Aligning Latent Geometry for Spherical Flow Matching in Image Generation

Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

Asymmetric Flow Models

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Image generation models running locally on limited resources [P]

Efficient Adjoint Matching for Fine-tuning Diffusion Models

Introducing ChatGPT Images 2.0

GPT-Image-2

Codex for (almost) everything

PRX Part 3 — Training a Text-to-Image Model in 24h!

Nano Banana 2 aka Gemini 3.1 Flash Image Preview: the new SOTA Imagegen model

Training Design for Text-to-Image Models: Lessons from Ablations

Diffusers welcomes FLUX-2

Build with Nano Banana Pro, our Gemini 3 Pro Image model

Experiment with Gemini 2.0 Flash native image generation

Text-to-Image: Diffusion, Text Conditioning, Guidance, Latent Space