Hacker News — AI on Front Page · June 5, 2026 · 27 min read

Did Claude increase bugs in rsync?

Mirrored from Hacker News — AI on Front Page for archival readability. Support the source by reading on the original site.

203 pts · 205 comments on Hacker News

Like Read original ↗

In order to avoid accuastions of this "just being Claude defending Claude," "AI slop," "probably all hallucinations," etc., I've decided it's probably worth explaining a few key points about how this report was created:

All metrics, methodology, and data sources were exclusively chosen by me, in consultation with my wife, who has a Master's Degree in Statistics from Penn State University.
The methodology is directly based on my wife's input: she was the one that pointed out that trying to just compare bugs per ten lines of code before and after would likely be too effected by noise because of the low number of post-Claude samples, and that, for similar reasons, trying to build some kind of linear regression model to ascertain the relative effects of different variables would probably also not work. She specifically told me that looking at where the post-Claude releases fall into the historical distribution, and how likely from the historical distribution we would be to get releases as "bad" or worse than the post-Claude releases, was probably the best that could be done.
I spent several days on this, two before even creating the GitHub repo and had at least one major total rewrite of the report to use a better methodology (given the feedback from my wife mentioned above). This was a lot of manual, cognitive effort on my end.
The scripts used to fetch the data, collate it into a DuckDB database file, construct the views on that DB, and then do the statistical analysis on that data, were indeed written by GLM 5.1, as was the HTML and much of the original prose for the final report webpage you're looking at right now.
Crucially, however, all numbers, statistics, cards, and graphs in this report are automatically templated in directly by the Python script that ran the statistical analysis, thus avoiding any possibility of hallucinations or inconsistencies in the numbers.
After posting this on Hacker News and recieving almost no substantive input, discussion, or response on the actual content of the article, I decided to rewrite all of the prose in my own voice. If anyone complains about my verbosity or sentence structure — as they usually do, which is the reason I originally let the AI write the prose, among other reasons obsoleted by templating — they can go fuck themselves.
If you want to replicate the data and results here, and inspect exactly how they were calculated, you can find the repository here. I have purposefully made it so that the pipeline can be run end to end completely from scratch, so you can see the entire pipeline end-to end, with no mysterious DB blobs forcing you to trust that I didn't doctor or screw up the data. If you want to be mad about the numbers, look there first.

1 · Background: The rsync Outrage

In late May 2026, rsync blew up. First, an evidence-free Mastodon post was made pointing to a spurious correlation between a regression that particular user experienced upon upgrading to a release, and that release having Claude commits in it. It was viewed an unknown number of times, but even likes and boosts passed the thousands mark handily, and it gained significant traction — as all spurious anti-AI hate does —, seeing 58 replies from 32 unique users. Someone rages about "cognitive surrender" with no evidence; another suggests adding rsync to the famous open-slopware blacklist. From there, it spread to Hacker News, with 81 comments, full of mixed dread, anger, and crowing about how this finally proves once and for all no one can use LLMs safely. Among all that was one particular comment which spurred further the view that the regressions and bugs were caused by Claude.

This On May 30, 2026, this burgeoning outrage emergently coalesced into a single focal point: a GitHub issue titled "Please Do Not Vibe Fuck Up This Software", opened against the rsync repository. It attached a screenshot of the Mastodon post criticizing the project's use of Claude. That's it. No bug report, no technical content, no attempt to actually ascertain if the concern was real or justified; just 350+ comments ranging from thoughtful concern to outright harassment (most of the most egregious, unreasonable, and outright violent comments have since been deleted; few thought to preserve them).

GitHub issue screenshot — The GitHub issue that started it all. The original post was a screenshot of a Mastodon critique, no bug report, no technical content. It has since accumulated **329 comments**.

Hacker News thread screenshot — The thread quickly escalated, from "the software is free, if you don't like it then fork it or fuck off" to: *"just because you are giving free soup to the homeless, does not mean you can piss in it"*.

The thread did not stop at words. It eventually escalated to, at one point, visual depictions of fantasies of violence, when one user posted a now deleted comment including My Little Pony drawings of themselves strangling the "project janitor that pushed vibecoded commits":

Threatening drawing — A user posting drawings depicting violence against the rsync maintainer, one of several threats that escalated the issue from heated debate to harassment.

Completing the internet outrage cycle, this issue in turn spread to Hacker News, generating hundreds more comments. Some attempted to point at the number of regressions after the introduction of Claude — "The Linux Mint Timeshift tool has an issue open documenting a number of regressions that are currently open on the rsync issues page, that were only introduced post-vibecoding" — as evidence that it was worse. Others pointed out that those regressions were not caused by Claude, and in response, the goalposts were moved again. Over and over, the core theme was one central claim, repeated everywhere: Claude-assisted development introduced bugs into a previously stable tool. AI is cognitive surrender, is cocaine, is loss of craft, and the users are right to be angry as a result:

People are very justifiably angry that a very stable, well trusted tool, has started to immediately go downhill… all because the main dev is vibecoding that software.
— fao_ on Hacker News

However, this isn't doesn't have to be a question solved only on the basis of — ironically — vibes. This is something that could be, at least to a degree, empirically tested. Some even pointed that out:

On Lobste.rs, in response to the Medium essay Tridge himself posted in response, finally some users like boramalper begin to actually ask for evidence one way or another:

It'd be interesting if someone actually did a timechart of regressions after each release (if at all possible) to see if the number actually went up recently or not.
— boramalper on Lobsters

User bitshift replied: "I would also love to see such a chart. It wouldn't be completely informative… But at least it would be something objective we could measure."

This analysis is that chart. Or, well, as best as it can be made, given the limitations of the data (see the previous section).

2 · Executive Summary

36 releases with bug data, spanning v2.4.6 to v3.4.3
2 releases have Claude commits: v3.4.2 (9 Claude, 0.00 sev/10c) and v3.4.3 (28 Claude, 3.29 sev/10c)
The Claude releases bracket the IQR in opposite directions: v3.4.2 is below the IQR, v3.4.3 is above it. Neither is an outlier.
Exact permutation test p-value = 46%: pick any 2 releases at random, you'd score as bad or worse 46% of the time. This is the strongest available test and it finds nothing.
Fisher's exact test p-value = 74%: Claude releases are no more likely to fall above the historical median than any other releases (odds ratio 1.06).
The historical mean is 1.8× the Claude mean (2.95 vs 1.65 sev/10c)
v3.4.1 (59 bugs / 9 commits, no Claude) is an outlier but belongs in the baseline — it is a release, and the distribution already captures it

3 · The Metric

The analysis uses a single metric: severity-weighted bugs per 10 commits (sev/10c). Each bug is normalized to a 0–1 severity score (its LLM-assigned severity divided by 100), and those scores are summed per release instead of simply counting bugs. The raw bug count is also shown in the table for reference, but sev/10c drives all statistical tests.

sev/10c = (Σ severity/100 ÷ total_commits) × 10

How commits are assigned to releases

Every commit on the default branch was ordered by committer date to produce a sequential timeline. Each git tag points to a specific commit in this timeline. A release's range is all commits between the previous tag and its own tag. Pre-release tags ("pre", "rc") are skipped as boundaries and absorbed into their final release. Every commit belongs to exactly one release.

How bugs are found and assigned to releases

Bug reports come from three sources:

GitHub issues in the rsync repository (collated via the GitHub REST API),
the rsync Bugzilla instance (collected via the API),
and the rsync mailing list.

GitHub issues and mailing-list bugs are attributed to the most recent release that shipped before the bug was reported. For Bugzilla, each entry has a "Version" field that explicitly states which release the bug was reported against, and bugs are attributed to that release.

Severity scoring

To control for bug severity — so that, as someone on HN said, a typo in a button and a CVE aren't rated equally — every bug report was scored for severity on a 0–100 scale. The scorer is Qwen 3 35B, a small open-weight language model, prompted as a senior reliability engineer assessing real-world impact. Each bug report was given to the model as its title and body text (truncated to 3,000 characters), along with the following rubric:

Score	Category	Description
90–100	Data loss / corruption	Silent data corruption or data loss. The user's files or backups are wrong and they may not notice until it's too late. Security vulnerabilities that allow remote code execution or unauthorized access.
70–89	Crash / hang / broken backups	rsync crashes, hangs, or fails in a way that breaks automated backups or cron jobs. Data is not corrupted but backups are missed. High CPU or memory usage that makes rsync unusable in production. Build or compilation failures — if rsync cannot be built from source, users cannot install it at all. This is a blocking problem, not a minor inconvenience. Score these at least 70. Security vulnerabilities that expose sensitive data.
50–69	Feature regression	Feature regressed — something that used to work no longer works, but there's a workaround. Performance regressions large enough to disrupt production workflows. Incorrect output that is visible (errors, wrong filenames) but doesn't corrupt data.
30–49	Minor regression	Minor feature regression with easy workaround. Error messages are confusing but the operation still succeeds. Intermittent test failures. Portability issues on uncommon platforms.
10–29	Cosmetic / low impact	Cosmetic issues, documentation errors, minor UX annoyances. Test-only issues that don't affect users.
0	Feature request	If the issue is asking for a new feature, a change in default behavior, or a packaging suggestion — no matter how reasonable — it is NOT a bug. Score it 0.
0–9	Not a real bug	Spam, off-topic, duplicate. Issues that are clearly not about rsync or are empty/meaningless.

All three bug sources — GitHub issues, Bugzilla, and the rsync mailing list — were scored. Bugzilla and mailing list reports had only a title (no body), so the model scored those from the title alone. The model was instructed to fall back on the title and lean toward the middle of the range (40–60) when the body didn't provide enough information. The model was also told to output only a severity integer via structured output (JSON schema), so there were no free-text responses to parse. Scoring was done at temperature 0 for determinism — the same input always produces the same score.

Issues scored severity 0 — feature requests, spam, off-topic rants about AI, empty submissions — are excluded from bug counts by default. This matters because some releases attracted a lot of noise on GitHub. v3.4.2 had four issues filed; the model scored all four at severity 0 (a feature-request option, a missing tarball question, and two more feature requests).

Example scores from the database, one per tier:

Score	Release	Title
95	v3.4.1	Destination --chmod and --fake-super: chmod applied to fake-super, backup permissions lost
75	v3.4.3	error in rsync protocol data stream (code 12) at token.c(490) [sender=3.4.3]
55	v3.4.1	--dry-run doesn't work with --mkpath when copying files
35	v3.4.1	manpages not installed in out-of-tree builds
15	v3.4.1	Minor inconsistencies in options in manpage
0	v3.4.3	PAM Support - Open for Discussion

Why the release is the unit of analysis

Why group commits by release, bugs by release, and then ascertain the correlation — or lack thereof — between Claude commits and bugs through the intermediary of releases? This is for two reasons.

First, because the claim that the critics are making is also, itself, made in terms of releases: that having any Claude commits in a release makes the whole release more buggy as a whole in a noticeable way, not just that Claude-authored commits may introduce more bugs; the latter is a different metric, because later Claude- or human-authored commits could correct for those bugs within the same release, and nobody would then notice as part of the release, and overall it wouldn't matter to users; additionally, it's simply important, as stated elsewhere, to meet the claim of the critics where it's at. If this forces them to make their claims more nuanced — or otherwise move the goalposts — then mission accomplished.

Second, it's a problem of attribution: the vast, vast majority of bugs do not state exactly which commit caused them, because doing so would require extensive research and analysis that is often not worth it in favor of simply fixing-forward, and even if that analysis was done — via something like git bisect — it wouldn't necessarily result in anything useful, or anything at all. Many bugs can result from a combination of multiple commits, often separated significantly over time, where it's unclear whether one commit or the other really introduced the bug. Or, one commit can reveal several latent bugs introduced by other commits at once, and so on.

Why bugs and commits?

The critics' claim is simplistic, absolute, and universalistic: the rate of bugs in the Claude-exposed releases went up. Therefore, the simplest honest response is to analyize precisely what is being claimed: bugs, commits, releases, and Claude-exposed commits. If the Claude releases sit in the middle of the historical distribution, the burden shifts to the critics to explain why this particular middle is somehow worse than all the other middles that came before it. Even by weighting by severity, I feel that I am giving extensive generosity to the anti-AI point in all this, but enough of the more intelligent critics brought it up that I found it worth it.

Even if that results in is shifting the conversation toward a more nuanced discussion of the quality and type and user impact of the bugs in the releases, it will already have been a major win for the pro-AI crowd, and a shifting of the goalposts for the anti-AI crowd, and then we can do further analysis based on that. And the ball's in the anti-AI court for that game.

What this approach does not do

I'm aware that this metric does not control for commit complexity or security intensity. It is a blunt instrument. But the critics' accusation is also blunt: "Claude is making things worse." A blunt instrument is what is required in response. Blood begets blood.

4 · Results

Claude Releases

Before we jump into deeper analysis, let's just look at the two Claude releases themselves, to get a sense for them:

v3.4.2

0.00 sev/10c

0 bugs · 50 commits · 9 Claude

0th percentile (rank 0 of 35)

v3.4.3

3.29 sev/10c

17 bugs · 34 commits · 28 Claude

77th percentile (rank 27 of 35)

If that doesn't look like a red flag to you, you'd be right.

Exact Permutation Test

So the question is: are the Claude releases unusually buggy, or could you easily pull a group just as bad out of the historical distribution by dumb luck? The way you answer that question statistically is an exact permutation test, which just enumerates all pairs of two releases and asks: what fraction have a mean bug rate as bad or worse than the one we actually observed? That fraction is the p-value of the hypothesis under test.

46%

exact permutation test p-value (one-sided, H₁: Claude mean > historical)

272 of 595 possible groups of 2 historical releases have mean sev/10c ≥ 1.65. Nearly half. The Claude releases sit right in the middle of the permutation distribution — there is nothing extreme about them.

Test statistic: mean sev/10c per group · Claude group mean: 1.65 · Historical mean: 2.95

What this p-value tells us is that the hypothesis that Claude makes releases worse has, at least so far, about as much predictive power as a coin flip: if you closed your eyes and picked 2 releases at random, you'd do as bad or worse nearly half the time. There's nothing unusual about the Claude group.

Fisher's Exact Test

The permutation test asks: how likely is it that a random group of releases scores as badly as the Claude group? But there's another way to pose the question: are Claude releases more likely than non-Claude releases to fall above the historical median? That's a textbook 2×2 contingency table, and the standard test for it is Fisher's exact test.

	≤ median	> median
Non-Claude	18	17
Claude	1	1

74%

one-sided p-value (H₁: Claude more likely above median)

Fisher's exact test asks: if we split all releases at the historical median (0.74 sev/10c), are these Claude releases significantly buggy than previous releases (more likely to land above the median)? With a p-value of 74%, the answer is a decisive no. The odds ratio is 1.06 — essentially 1:1. Claude releases are no more likely to be above the median than any other releases.

Odds ratio: 1.06 · Median: 0.74 sev/10c

To emphasize, this does not mean that all Claude releases in the future will not be more buggy. We don't have nearly enough data to build a model and extrapolate out like that, and that's not what a Fisher's exact test is for. The point that's being made here is that these specific releases are not at all notable; if no one had known they were AI, no one would have cared or noticed anything out of the ordinary, and there is no evidence with which to conclude that Claude made anything worse yet, unlike the objective, absolutist, universal claims made by critics.

The Distribution

In case you're not convinced, here's a visual aid, showing where these releases fall in the distribution of all prior releases:

middle 50%

v3.4.2

v3.4.3

0.010.1110100

Historical Claude Middle 50% (IQR) Outside IQR

How to read this graph: Each dot is a release. The shaded green band is the interquartile range — the middle 50% of historical releases, from 0.29 to 2.59 sev/10c. The darker regions on either side are the lower and upper quarters.

This is another way of saying the same thing the previous two tests said, but more intuitively: the Claude releases (green dots) bracket the IQR in opposite directions. v3.4.2, with zero real bugs, sits just below the IQR; v3.4.3 sits just above it. They bracket the middle of the distribution in opposite directions. Neither is a negative outlier, and since they're on either side of the IQR, there's no evidence Claude usage bends releases in either direction.

Commit Rate

One possible objection I've seen is that while perhaps the defect rate of Claude-authored commits is not worse than human-authored ones, Claude sped up developmend so much — due to "vibe coding" perhaps — that the total number of bugs in each release got too high for comfort anyway, in which case it doesn't matter that the defect rate per commit isn't so bad, because that's not what downstream users experience. We can check this, though:

p=88%

exact permutation test: do Claude releases have more commits?

Claude releases averaged 42 commits; non-Claude releases averaged 185. If you pick any 2 releases at random, you'd see as many or more commits 88% of the time.

So it seems that Claude releases did not have meaningfully more commits. If anything, they had a lot fewer, but of course, the data is underpowered to say that for sure; just like the rest of this article, I'm only trying to show a startling lack of evidence for the supposed massive harms of using Claude, not trying to prove there must have been none, or that it was even "beneficial" in terms of focused releases.

But commit count is a blunt measure. A commit can be a one-line typo fix or a rewrite of the entire protocol handler. What about lines of code changed? GitHub's compare API gives us additions and deletions between each pair of consecutive tags:

p=5%

exact permutation test: do Claude releases have more lines changed?

Claude releases averaged 3756 lines changed; non-Claude releases averaged 696. If you pick any 2 releases at random, you'd see as many or more lines changed only 5% of the time. Claude releases were larger by this measure.

Lines changed went up by a lot. But did absolute bug numbers?

p=77%

exact permutation test: do Claude releases have more severity-weighted bugs?

Claude releases averaged 5.6 severity-weighted bugs; non-Claude releases averaged 14.9. If you pick any 2 releases at random, you'd see as many or more severity-weighted bugs 77% of the time. More lines changed, but not more bugs.

So: the Claude releases changed way more lines of code than historical ones, but didn't have more bugs. More code, same bugs. That's not what you'd expect if Claude were making things worse.

Regime Check

The obvious counterargument is that maybe earlier rsync releases were less in maintenence-mode, and so had more bugs, but recent rsync releases have been more stable, so comparing the two Claude-exposed releases to the full historical distribution is masking the fact that they're actually outliers for their regime. Luckily, there's a way to test this statistically.

Yes, the historical mean (2.95 sev/10c) was driven by a bimodal distribution: v2.x releases average 1.11 sev/10c; v3.x releases average 4.23. But even within the v3.x regime, the Claude releases sit in the middle of the pack or better:

middle 50%

v3.4.2

v3.4.3inside middle 50% ✓

0.010.1110100

v3.x historical Claude Middle 50% (IQR)

So the regime-shift argument doesn't just fail — it fails backwards. The v3.x era has a much higher mean sev/10c than v2.x. If you restrict the comparison to v3.x only, Claude releases don't stand out at all — and one of them is better than most. The only way to make Claude look like an outlier is to compare it against a quieter era and then blame the shift on Claude, when the data says the shift predates Claude entirely.

We can further test whether there are meaningfully different regimes in the version history, and thus whether using the full historical data is valid, by doing a runs test. If such regimes existed, the runs test would detect non-random clustering:

p=0.060

Wald–Wolfowitz runs test on 35 non-Claude releases

13 runs observed (expected 18.5 under randomness, z=-1.88). At p=0.060 this is marginal — not strong enough to reject randomness at the conventional 0.05 level, but close enough that it warrants the v3.x-only comparison above, which already accounts for the v2.x/v3.x difference. Using the full historical baseline is a conservative choice: it includes a noisier era, which makes it harder, not easier, to show that Claude releases are normal.

Observed runs: 13 · Expected: 18.5 · z=-1.88

The Pre-Claude Outlier

Here's my favorite part, though. Digging into the data, one of the first things that jumped out at me with blinding clarity was that the worst release, by far, in rsync history was entirely prior to the introduction of Claude:

39.39

bugs per 10 commits — v3.4.1, no Claude

The highest bug rate in the entire dataset. 59 bugs in 9 commits, a hotfix release the day after v3.4.0. It exceeds every other release by an order of magnitude.

And yet nobody noticed. There was no AI to blame so there was no GitHub issue with 300 comments, no death threats, no threats to fork or move to openrsync. A maintainer shipped a broken release and fixed it, just like normal. The only thing that made v3.4.3 special was the availability of an enemy everyone had already decided to hate.

All Releases (chronological)

Release	Bugs	Sev	Commits	Claude	Bugs/10c	Sev/10c ↕	Percentile
v2.4.6	2	1.2	13	0	1.54	0.92	54th percentile
v2.5.0	4	1.3	73	0	0.55	0.18	11th percentile
v2.5.1	4	1.8	69	0	0.58	0.26	20th percentile
v2.5.2	6	2.9	117	0	0.51	0.25	14th percentile
v2.5.4	5	3.5	21	0	2.38	1.64	69th percentile
v2.5.5	22	11.6	88	0	2.50	1.32	63rd percentile
v2.5.6	14	6.6	239	0	0.59	0.28	23rd percentile
v2.6.0	8	4.7	267	0	0.30	0.18	9th percentile
v2.6.1	5	3.3	444	0	0.11	0.08	0th percentile
v2.6.2	29	16.8	17	0	17.06	9.88	94th percentile
v2.6.3	49	22.7	381	0	1.29	0.59	34th percentile
v2.6.4	22	9.4	760	0	0.29	0.12	6th percentile
v2.6.5	16	7.1	146	0	1.10	0.49	31st percentile
v2.6.7	15	5.8	649	0	0.23	0.09	3rd percentile
v2.6.8	12	5.4	72	0	1.67	0.74	49th percentile
v2.6.9	53	18.8	261	0	2.03	0.72	43rd percentile
v3.0.0	64	27.9	909	0	0.70	0.31	26th percentile
v3.0.1	6	4.0	102	0	0.59	0.40	29th percentile
v3.0.2	10	4.1	9	0	11.11	4.56	80th percentile
v3.0.3	22	10.8	55	0	4.00	1.96	71st percentile
v3.1.0	170	52.4	571	0	2.98	0.92	51st percentile
v3.1.1	68	32.1	66	0	10.30	4.86	83rd percentile
v3.1.2	55	18.4	57	0	9.65	3.22	74th percentile
v3.1.3	85	30.9	61	0	13.93	5.07	86th percentile
v3.2.0	22	7.8	304	0	0.72	0.25	17th percentile
v3.2.1	7	4.5	63	0	1.11	0.72	46th percentile
v3.2.2	13	8.8	58	0	2.24	1.51	66th percentile
v3.2.3	95	55.3	157	0	6.05	3.52	77th percentile
v3.2.4	20	14.3	213	0	0.94	0.67	40th percentile
v3.2.5	9	5.5	53	0	1.70	1.04	57th percentile
v3.2.6	6	3.2	28	0	2.14	1.13	60th percentile
v3.2.7	88	52.4	60	0	14.67	8.73	91st percentile
v3.3.0	42	25.3	38	0	11.05	6.66	89th percentile
v3.4.0	6	4.0	60	0	1.00	0.66	37th percentile
v3.4.1	59	35.5	9	0	65.56	39.39	97th percentile
v3.4.2	0	0.0	50	9	0.00	0.00	0th percentile
v3.4.3	17	11.2	34	28	5.00	3.29	77th percentile

5 · What the Data Is Consistent And Inconsistent With

✓

"The Claude releases are statistically indistinguishable from historical releases"

One Claude release sits just below the IQR (v3.4.2, with zero real bugs), the other just above it. They bracket the middle of the distribution in opposite directions — neither is an outlier. The exact permutation test yields a p-value of 46% — pick any 2 releases at random and you'd do as bad or worse nearly half the time. There is no signal of abnormality.

✓

"The outrage selected on a single tail event and narrativized it"

A Mastodon user noticed a regression in v3.4.3, saw Claude commits, and concluded causation. But v3.4.3 at 3.29 sev/10c is at the 77th percentile — elevated but not extreme. 8 historical releases scored higher. The correlation is noise.

✗

"Claude clearly made things worse" &emdash; the main claim

The Claude releases bracket the IQR — one below, one above. Neither is an outlier. There is no distributional evidence of harm. The claim rests entirely on a post-hoc correlation observed by a social media user.

✗

"Claude commits (in general) do not and will not make things worse"

This is a common misrepresentation of my claims here. I am not trying to extrapolate out into the future, and say something like "in general, Claude won't make things worse" or "Claude will never make things worse." Instead, the point is this: there is no evidence of it having done so, and the two Claude releases we have currently are thoroughly unremarkable, so the outrage is totally unjustified.

✗

"The regressions speak for themselves"

v3.4.1 — a pre-Claude release — has the highest bug rate in the dataset (39.39 sev/10c). Nobody noticed, because there was no AI to be angry at. The regressions only "speak" when you ignore the historical distribution.

✗

"Just wait, more bugs will surface"

v3.4.3 has been out long enough that its rate (3.29) is already comparable to historical releases. The "wait and see" argument is an appeal to an unknowable future that shifts the burden of proof away from the critics. If more bugs surface, they will enter the distribution like every other release. There is no reason to expect a regime break.

Discussion, and Tridge's Response

So, why do people feel like they've been betrayed, and feel so sure that things have "clearly" gotten worse &emdash; that Claude "broke rsync" &emdash; when there is no evidence for this, and no data except two thoroughly unremarkable releases?

A lot of it is just sheer, blind outrage at the use of LLMs. However, there are some confounders that might have caused people to feel that way:

On the HN thread, user zos_kia pointed at the confound directly:

From a cursory look, it looks like a security fix in response to a CVE surfaced a coding error which has been present in the code since 2007. This is so banal that it's actually hilarious to see people lose their shit over it.
— zos_kia on Hacker News

On Lobsters, user jbert spelled out the causal chain:

The trigger for the increased volume of changes (and hence increased number of regressions) was the influx of (mostly) LLM-enabled security issues. i.e. the causal chain was: LLMs → more known security issues → more changes needed than usual → more regressions than usual.
— jbert on Lobsters

Essentially, this isn't a "Claude" problem, it's a "more security work" problem, something that Tridge himself confirmed in his response, describing how a flood of AI-generated CVE reports forced rapid, extensive changes to rsync's attack surface.

But, as with all things AI, it doesn't matter. In the end, the outrage isn't about whether rsync is worse or better now, it's about people not liking AI, and arguing from a priori definitions, not empirical results, to the desired conclusion: that AI is bad:

Like I said, the author "tried to balance security against feature regression." I don't dispute that he tried. I merely dispute that the chatbots are good at writing code; in fact, they are bad at writing code. If the author had approached these security bugs by hand with a mental model (a Naur theory!) which preserves their desired features and functionality then they would have caused fewer regressions...
— Corbin on Lobste.rs

In response to this sweeping, absolute, causal claim made with no evidence — and in fact, counter to the evidence — based on an old philosophical claim about the epistemology of programming, it is perhaps best to leave the victim of this outrage himself with the final word:

…for the people saying things like "I'm a PhD from xyz uni and I'm telling you LLMs are just stochastic tools that make everything up and the world will fall apart if you use them", I'm here to tell you that you are out of date. The world of software engineering has changed dramatically in the last few months. The world of IT security and maintaining software in the face of the flood of reports has completely and utterly changed just in the last few weeks. Anything you learned about this stuff last year might as well be from another planet… Bottom line is I do know (well, roughly!) how LLMs work, but that doesn't make them not useful. It does mean you have to be cautious, but I am being cautious, or as cautious as I can be given my desire to be sailing and not dealing with a flood of gunk from so-called internet experts.

Discussion (0)

No comments yet. Sign in and be the first to say something.

1 · Background: The rsync Outrage

2 · Executive Summary

3 · The Metric

How commits are assigned to releases

How bugs are found and assigned to releases

Severity scoring

Why the release is the unit of analysis

Why bugs and commits?

What this approach does not do

4 · Results

Claude Releases

v3.4.2

v3.4.3

Exact Permutation Test

Fisher's Exact Test

The Distribution

Commit Rate

Regime Check

The Pre-Claude Outlier

All Releases (chronological)

5 · What the Data Is Consistent And Inconsistent With

Discussion, and Tridge's Response

Discussion (0)

More from Hacker News — AI on Front Page