r/LocalLLaMA · · 1 min read

Qwen/Qwen-Image-Bench · Hugging Face

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Qwen/Qwen-Image-Bench · Hugging Face

Model Description

Q-Judger is a vision-language model fine-tuned specifically for automated evaluation of text-to-image generated images. Given a text prompt and a generated image, the model evaluates the image on fine-grained quality criteria organized in a 3-level hierarchy and outputs structured JSON scores.

  • Base Model: Qwen3.6-27B
  • Task: Image quality evaluation / judging
  • Input: Text prompt + generated image
  • Output: Structured JSON with per-dimension scores (0 = Fail, 1 = Pass, 2 = Excel, N/A)
  • Thinking Mode: Enabled — the model uses chain-of-thought reasoning before producing the final JSON output

Evaluation Dimensions

The model evaluates images across 5 top-level dimensions, each with multiple sub-dimensions:

Quality

  • Realism: Physical Logic, Material Texture
  • Detail: Noise, Edge Clarity, Naturalness
  • Resolution: Resolution

Aesthetics

  • Composition: Composition
  • Color Harmony: Color Harmony
  • Lighting: Lighting & Atmosphere
  • Anatomical Portraiture: Anatomical Fidelity
  • Emotional Expression: Emotional Expression
  • Style Control: Style Control

Alignment

  • Attributes: Quantity, Facial Expression, Material Properties, Color, Shape, Size
  • Actions: Contact Interaction, Non-contact Interaction, Full-body Action
  • Layout: 2D Space, 3D Space
  • Relations: Composition Relationship, Difference/Similarity, Containment
  • Scene: Real-world Scene, Virtual Scene

Real-world Fidelity

  • Fairness: Social Bias, Cultural Fairness
  • Safety & Compliance: Safety & Compliance
  • World Knowledge: Animals, Objects, Information Visualization, Temporal Characteristics, Cultural Elements

Creative Generation

  • Imagination: Imagination
  • Feature Matching: Feature Matching
  • Logical Resolution: Logical Resolution
  • Text Rendering: Text Accuracy, Text Layout, Font, Cross-lingual Generation
  • Design Applications: Graphic Design, Product Design, Spatial Design, Fashion Styling, Game Design, Art Design
  • Visual Storytelling: Cinematic Style, Camera / Lens Style, Storyboard Creation, Shot Sizes, Composition, Angles, Comic Creation
submitted by /u/jacek2023
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA