Tag

Image Gen

96 articles archived under #image-gen · RSS

arXiv — Machine Learning research 1h ago

Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models

arXiv:2606.28406v1 Announce Type: new Abstract: Text-to-image and multimodal generative models are increasingly used to produce scientific figures such as mechanism diagrams, experimental-design schematics, conceptual frameworks, and graphical abstracts. Yet existing…

36
TechCrunch — AI news-outlet 9h ago

Gemini’s personalized AI image generation is now free for US users

Google is expanding Gemini’s personalized AI image generation to eligible free users in the U.S., allowing the chatbot to create images based on your interests and data from connected Google apps.

29
r/LocalLLaMA community 2d ago

clark-labs/clark-air-sana-1.6b-1.58bit · Hugging Face

A Sana 1.6B text-to-image transformer compressed to ternary (~1.85 bits/weight): 8.6× smaller than FP16, near-FP16 quality. Footprint (measured) Artifact Size vs FP16 What it is FP16 transformer 3.21 GB 1× (100%) reference Clark Air (packed) 374 MB 8.6× (≈12%) packed ternary (…

36
Hugging Face Daily Papers research 3d ago

Qwen-Image-Agent: Bridging the Context Gap in Real-World Image Generation

Abstract A unified agentic framework called Qwen-Image-Agent is proposed to address the context gap in text-to-image generation by progressively constructing complete generation context through planning, reasoning, searching, and memory mechanisms. Generated by…

22
arXiv — NLP / Computation & Language research 4d ago

DanceOPD: On-Policy Generative Field Distillation

arXiv:2606.27377v1 Announce Type: cross Abstract: Modern image generation demands a single model that unifies diverse capabilities, including text-to-image (T2I), local editing, and global editing. However, these capabilities are rarely naturally aligned and often conflict. For…

18
Hugging Face Daily Papers research 4d ago

DanceOPD: On-Policy Generative Field Distillation

Abstract A novel on-policy generative field distillation framework called DanceOPD is proposed to unify text-to-image generation, local editing, and global editing capabilities in flow-matching models through capability-specific routing and velocity-based training. Generated by…

10
Hugging Face Daily Papers research 5d ago

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

Abstract Implicit Visual Chain-of-Thought decomposes visual conditioning into structural and semantic cascades for improved structure-aware image generation with sketch supervision. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Unified multi-modal large language models (MLLMs)…

7
r/LocalLLaMA community 5d ago

SDXL running locally in the browser on WebGPU, open-source

I needed simple local image generation without the usual setup. No virtual environments, no ComfyUI with a complex graph and installation as an exe. So i tried to push the whole thing into the browser and run it on WebGPU. It's a browser extension. You install it, then it loads…

13
Hugging Face Daily Papers research 5d ago

Semantic Browsing: Controllable Diversity for Image Generation

Abstract Text-to-image models are enhanced with controlled diversity through semantic browsing capabilities that enable structured navigation of image variations based on meaningful semantic decisions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Modern text-to-image models…

4
Hugging Face Daily Papers research 5d ago

FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

Abstract FLUX3D addresses limitations in image-to-3D Gaussian Splatting generation by improving representation learning and cross-modal alignment through specialized architectures and attention mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Sparse voxel representation…

34
arXiv — Machine Learning research 6d ago

Information-Theoretic Classifier-Free Guidance with Adaptive Schedule Optimization

arXiv:2606.24025v1 Announce Type: new Abstract: Diffusion models have achieved strong performance in image, text-to-image, and video generation, where conditional generation is often controlled by classifier-free guidance (CFG). CFG improves condition consistency by increasing a…

35
Hugging Face Daily Papers research 6d ago

Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

Abstract Text-to-image models fail to generate counterfactual scenes because they rely on tightly coupled visual-textual patterns rather than causal reasoning, demonstrating limited understanding beyond pattern matching. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Text-to-image…

26
Hugging Face Daily Papers research 7d ago

Safe Few-Step Generation via Velocity Editing

Abstract VESFlow is a training-free safety method for flow matching-based text-to-image generation that edits velocity fields to ensure safe output while maintaining prompt integrity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Flow matching has recently emerged as a strong…

16
r/LocalLLaMA community 7d ago

Boogu Base, Turbo, Edit - open-source unified image generation and editing model series

Boogu-Image-0.1 is a competitive Apache-2.0 open-source unified image generation and editing model family , including Base , Turbo , Edit , and other variants that provide stable, practical capabilities for high-quality text-to-image generation, fast generation, image editing,…

22
Hugging Face Daily Papers research 7d ago

Exploring the Design Space of Reward Backpropagation for Flow Matching

Abstract FlowBP addresses limitations in flow matching model alignment by using a surrogate trajectory framework that reduces memory usage and gradient chaining while maintaining performance across multiple text-to-image models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

23
Hugging Face Daily Papers research 8d ago

BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation

Abstract A 3D brain MRI generative model uses a masked-autoencoder tokenizer to create clinically informative embeddings that support both medical task performance and controlled image generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Three-dimensional (3D) brain MRI is…

6
r/LocalLLaMA community 8d ago

Local text to image model comparaison: The ultimate test.

I selected 192 prompts to evaluate text-to-image model various capabilities and generated images for all the local models I was able to make work on my GX10 Spark. For instance: Is the model good at text? At faces? At human anatomy? At respecting spatial composition, etc...? You…

4
r/MachineLearning community 9d ago

Studying FLUX in diffusers library was hard, so I built a smaller open-source version [P]

If you've tried to study modern diffusion models by digging through the official diffusers library, you know it can be overwhelming with its complexity and abstractions. I wanted to simplify FLUX diffusion models, so I built minFLUX : a PyTorch implementation focused on its core…

38
Hugging Face Daily Papers research 10d ago

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

Abstract Analysis of FID variance across different training and sampling seeds reveals significant reproducibility issues in image generation evaluation, with retraining causing larger fluctuations than resampling, and recommends updated evaluation protocols with error bars and…

21
arXiv — NLP / Computation & Language research 11d ago

NAMESAKES: Probing Identity Memorization in Text-to-Image Models

arXiv:2606.20155v1 Announce Type: cross Abstract: Text-to-image (T2I) models generate realistic likenesses of some individuals when prompted with their names, raising privacy concerns. However, distinguishing whether a generated face is memorized or fabricated currently requires…

10
Latent.Space news-outlet 12d ago

[AINews] Midjourney Medical: scan your organs like you step on a scale

The only bootstrapped frontier lab announces its second product and second

12
Hacker News — AI on Front Page community 12d ago

Midjourney Medical

https://www.midjourney.com/medical Video: https://x.com/midjourney/status/2067422898407837797 Comments URL: https://news.ycombinator.com/item?id=48579650 Points: 228 # Comments: 203

10
Smol AI News news-outlet 12d ago

Midjourney Medical: scan your organs like you step on a scale

**Midjourney** unveiled a new **medical imaging/scanning system** called the **Midjourney Scanner**, described as **radiation-free, magnet-free, fast, and low-cost**, but requiring a **water immersion tank** and having **coarser resolution than CT/MRI**. The announcement…

12
Hugging Face Daily Papers research 13d ago

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Abstract UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing through multi-level feature fusion, bitwise quantization, and…

19
r/LocalLLaMA community 17d ago

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS)

I wanted AI Dungeon but fully local and actually private, so I built it. The narrator is Gemma 4 (QAT Q4) through Ollama, and when a scene is worth showing it draws the picture too, locally, with FLUX. No API keys, no cloud, nothing leaves your machine. The part that surprised…

26
Hugging Face Daily Papers research 17d ago

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

Abstract Structured Defect Grounding (SDG) addresses limitations in text-to-image model diagnosis by modeling defects as structured sets and using vision-language models for detection and reward-based alignment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Despite generating…

22
Hugging Face Daily Papers research 17d ago

High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation

Abstract A 2-step image generation model is developed through distillation from an 8-step teacher using distribution-aligned adversarial learning, step-decoupled parameterization, and end-to-end training with iterative regularization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

33
Hugging Face Daily Papers research 18d ago

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Abstract Evoflux enables compact language models to execute tool workflows more reliably by using evolutionary search to repair failed plans during inference, significantly improving execution feasibility compared to traditional fine-tuning methods. Generated by…

20
arXiv — Machine Learning research 19d ago

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

arXiv:2606.12280v1 Announce Type: new Abstract: Post-training quantization lets large text-to-image diffusion transformers run on consumer GPUs, yet the hardware-specific trade-offs are seldom measured directly. We quantize Ideogram 4.0 - a 9.3B flow-matching diffusion…

17
Hugging Face Daily Papers research 19d ago

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

Abstract A teacher-student framework decouples complex reasoning from efficient reward deployment in text-to-image training, achieving superior preference accuracy and optimization performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reward models are central to…

22
Hugging Face Daily Papers research 19d ago

i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models

Abstract A comprehensive experimental study of text-to-image diffusion models reveals key design choices and training insights leading to the development of i1, a 3B-parameter model that matches leading performance while maintaining full openness. Generated by…

21
Ars Technica — AI news-outlet 19d ago

Google DeepMind releases DiffusionGemma, a model that runs local AI 4x faster

Diffusion AI is most common in image generation, but it can make text outputs much faster.

29
Hacker News — AI on Front Page community 19d ago

Mercedes‑Benz starts large‑scale production of electric axial flux motor

Article URL: https://media.mercedes-benz.com/en/article/bebac2af-acdc-465a-9538-adb0bf3d8ccf Comments URL: https://news.ycombinator.com/item?id=48472877 Points: 262 # Comments: 139

21
Hugging Face Daily Papers research 20d ago

Text-to-Image Models Need Less from Text Encoders Than You Think

Abstract Text-to-image models primarily utilize basic text representation aspects like word merging and order rather than complex contextual information encoded in full text embeddings. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Text-to-image models rely on text prompts as…

36
r/MachineLearning community 21d ago

Open image generation models are closer to closed-source quality than this sub thinks [D]

I run evaluations on generative image models as part of my workflow, mostly comparing coherence, prompt adherence, and compositional accuracy across different architectures. The consensus here seems to be that open models are still a generation behind closed APIs. Based on my…

25
Hacker News — AI on Front Page community 25d ago

Ask HN: What was your "oh shit" moment with GenAI?

Most of us were amused when DALL-E and its peers went mainstream, and we were quick to point out the obvious flaws. Then ChatGPT hit the scene and again, many of us dismissed it as a parlor trick that would never amount to much. Using LLMs for coding initially was a only small…

26
Hugging Face Daily Papers research 25d ago

Training-Free Multi-Concept LoRA Composition with Prompt-Aware Weighting

Abstract Multi-concept customization in text-to-image generation is improved through prompt-aware weighting strategies that reduce interference between learned visual concepts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Low-Rank Adaptation (LoRA) successfully enables…

5
Hugging Face Daily Papers research 25d ago

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

Abstract Research reveals significant disparities between text and image generation capabilities in multimodal models, with effective textual knowledge editing not transferring reliably to visual output, necessitating modality-aware editing approaches. Generated by…

9
r/MachineLearning community 25d ago

Research in Image/Video Gen AI models [D]

I've been going down a rabbit hole with image/video generation/editing models for a few months now, started with playing around with Stable Diffusion and ComfyUI, then got genuinely hooked on understanding why things work, not just that they do. I have an Engineering background…

20
The Information — AI news-outlet 26d ago

Cybersecurity’s AI Paradox

It's no secret that criminals are using AI to streamline computer hacks in hopes of emptying out people’s bank accounts (never has it looked more appealing to stash cash under the mattress!). Cybersecurity executives, meanwhile, are rubbing their hands with glee at the influx of…

32
Hugging Face Daily Papers research 26d ago

Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation

Abstract Decoupled Residual Denoising Diffusion models (DRDD) improve unified image-to-image translation by separating noise diffusion for domain harmonization from residual diffusion for semantic mapping, enhancing data efficiency and performance. Generated by…

32
r/LocalLLaMA community 27d ago

1-bit Bonsai Image 4B and Ternary Bonsai Image 4B Image Generation for Local Devices with just 0.93 GB and 1.21 GB respectively of Diffusion Transformer Footprint. So tiny!

https://prismml.com/news/bonsai-image-4b   submitted by   /u/Addyad [link]   [comments]

6
Hacker News — AI on Front Page community 27d ago

Adafruit Receives Demand Letter from Fenwick Legal Counsel on Behalf of Flux.ai

Article URL: https://blog.adafruit.com/ Comments URL: https://news.ycombinator.com/item?id=48368121 Points: 255 # Comments: 87

11
arXiv — Machine Learning research 28d ago

CHAM-net: A Contrastive Hierarchical Adaptive Meta-network for Robust Global Methane Flux Prediction

arXiv:2606.00338v1 Announce Type: new Abstract: Methane is a potent greenhouse gas that significantly contributes to global warming. However, accurately estimating global methane emissions and consumption remains challenging due to the complex interactions among environmental…

28
Hugging Face Daily Papers research 28d ago

Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization

Abstract BiDPO enhances text-to-image models for complex compositional prompts through preference-based fine-tuning and region-level guidance. AI-generated summary Despite the rapid progress of text-to-image (T2I) models, generating images that accurately reflect complex…

18
Hugging Face Daily Papers research 28d ago

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

Abstract GCPO enables per-token credit assignment in reinforcement learning by contrasting model predictions under positive and negative prompts, improving performance in text-to-image generation and chain-of-thought reasoning tasks. AI-generated summary Group-advantage-based…

27
Hugging Face Daily Papers research 29d ago

Representation Forcing for Bottleneck-Free Unified Multimodal Models

Abstract Representation Forcing enables unified multimodal models to perform both perception and generation tasks end-to-end without relying on external latent spaces, matching state-of-the-art performance in image generation while improving understanding capabilities.…

27
Hacker News — AI on Front Page community 29d ago

1-Bit Bonsai Image 4B Image Generation for Local Devices

Article URL: https://prismml.com/news/bonsai-image-4b Comments URL: https://news.ycombinator.com/item?id=48346257 Points: 228 # Comments: 81

36
r/LocalLLaMA community 29d ago

Should I buy this RTX 2060 12GB graphics card at around $260 for AI purpose ?

I’m interested in running Gemma 4 model/s for text only . It runs smooth even on my laptop but gets crazy hot. Initially wanted to buy an 8 GB card. But I find this price for 12 GB good. (Maybe I can run some image generation models too. But its not important.) It has 6 Month…

10
r/LocalLLaMA community 1mo ago

Could someone make some ggufs for Qwen-Image-Bench?

I'd like to try it out for automating image generation quality output, I haven't had great luck with that using 27b base or gemma. If this can reliably detect 6 fingered generations and other undesirable outputs it would be a great boon. I took a swing and quantizing it myself…

30

Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models

Gemini&#8217;s personalized AI image generation is now free for US users

clark-labs/clark-air-sana-1.6b-1.58bit · Hugging Face

Qwen-Image-Agent: Bridging the Context Gap in Real-World Image Generation

DanceOPD: On-Policy Generative Field Distillation

DanceOPD: On-Policy Generative Field Distillation

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

SDXL running locally in the browser on WebGPU, open-source

Semantic Browsing: Controllable Diversity for Image Generation

FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

Information-Theoretic Classifier-Free Guidance with Adaptive Schedule Optimization

Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

Safe Few-Step Generation via Velocity Editing

Boogu Base, Turbo, Edit - open-source unified image generation and editing model series

Exploring the Design Space of Reward Backpropagation for Flow Matching

BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation

Local text to image model comparaison: The ultimate test.

Studying FLUX in diffusers library was hard, so I built a smaller open-source version [P]

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

NAMESAKES: Probing Identity Memorization in Text-to-Image Models

[AINews] Midjourney Medical: scan your organs like you step on a scale

Midjourney Medical

Midjourney Medical: scan your organs like you step on a scale

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS)

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models

Google DeepMind releases DiffusionGemma, a model that runs local AI 4x faster

Mercedes‑Benz starts large‑scale production of electric axial flux motor

Text-to-Image Models Need Less from Text Encoders Than You Think

Open image generation models are closer to closed-source quality than this sub thinks [D]

Ask HN: What was your "oh shit" moment with GenAI?

Training-Free Multi-Concept LoRA Composition with Prompt-Aware Weighting

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

Research in Image/Video Gen AI models [D]

Cybersecurity’s AI Paradox

Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation

1-bit Bonsai Image 4B and Ternary Bonsai Image 4B Image Generation for Local Devices with just 0.93 GB and 1.21 GB respectively of Diffusion Transformer Footprint. So tiny!

Adafruit Receives Demand Letter from Fenwick Legal Counsel on Behalf of Flux.ai

CHAM-net: A Contrastive Hierarchical Adaptive Meta-network for Robust Global Methane Flux Prediction

Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

Representation Forcing for Bottleneck-Free Unified Multimodal Models

1-Bit Bonsai Image 4B Image Generation for Local Devices

Should I buy this RTX 2060 12GB graphics card at around $260 for AI purpose ?

Could someone make some ggufs for Qwen-Image-Bench?

Gemini’s personalized AI image generation is now free for US users