Opinion · February 6, 2026

The Case Against Python for AI-Assisted Development

Python vs. Reality — Part 2 of 3
28 min read

The Case Against Python for AI-Assisted Development

Why the Language of AI May Be the Worst Choice for AI-Assisted Coding

Applied Minds AI
·
February 2026
·
18 citations

TL;DR — The 60-second version

Python’s training data has the worst signal-to-noise ratio of any major language. The massive corpus of Jupyter notebooks, tutorials, and beginner code has trained AI to generate experimental-quality output. Anthropic built Claude Code in TypeScript. The feedback loop is making it worse: Nature published proof that AI models collapse when trained on recursively generated data.

The fix: Choose TypeScript for AI-assisted projects. If you must use Python, enforce type hints and review every line. And we need a new default tier — Foundation Standard — code with brakes and seatbelts, even if it doesn’t have AC yet.

Want just the hot take? Read the 4-minute version.

A note on epistemology

The central claim — that Python’s training corpus has a worse signal-to-noise ratio than other languages — is inferential, not proven. No study directly measures this. This argument assembles circumstantial evidence: defect studies, corporate technology choices, corpus composition analysis, and documented production failures. In an era of provocative titles and engagement-driven discourse, intellectual honesty requires acknowledging when we are building a case rather than citing settled science. This is a hypothesis with substantial supporting evidence, not a proven theorem.

Python vs. Reality — Series

Abstract

Python dominates artificial intelligence and machine learning development. This dominance has created a pervasive assumption: if Python powers AI, it must be the optimal language for AI-assisted software development. This paper argues the opposite. Python’s characteristics that make it suitable for ML research — permissiveness, rapid prototyping, and low ceremony — are precisely what make it unsuitable for AI-generated production code. Furthermore, the massive corpus of experimental, tutorial, and beginner Python code in training data has poisoned AI code generation, creating a self-reinforcing cycle of mediocrity.

🎬
Video: The Python Training Data Problem (Animated Explainer)
Placeholder — add video embed here

The Core Thesis

Python is one of the worst languages for AI-assisted code generation. This is not a claim about Python’s utility as a programming language — it remains excellent for its intended purposes. Rather, this argument concerns the intersection of three factors: training data quality, language characteristics, and the emerging population of developers relying on AI assistance.

The argument proceeds through four interconnected claims:

  1. Python’s training data has the worst signal-to-noise ratio of any major language
  2. The conflation of AI/ML development with production software engineering represents a category error
  3. Python’s permissiveness removes the guardrails inexperienced developers most need
  4. A feedback loop now exists where AI-generated Python enters training data, compounding quality degradation

This paper is for engineers, vibecoders, and AI enthusiasts who want to understand why AI-generated code often disappoints, and what might be done about it. The diagnosis occupies the first half; the prescriptions — data curation, constrained generation, and a proposed “Foundation Standard” tier of code quality — occupy the second. Readers primarily interested in solutions may skip to “The Data Curation Solution.”

The Training Data Problem

Volume Does Not Equal Quality

Python has more code on GitHub than any other language. This apparent advantage conceals a critical problem: the ratio of production-quality code to experimental, tutorial, and beginner code is dramatically lower than in other languages.

IMAGE
Error Propagation ComparisonDiagram: Python’s error propagation — showing how a None return silently cascades through 3 function calls before causing a runtime error in production, vs Go’s explicit error return at each step
IMAGE
AI Code Quality by LanguageBenchmark chart: AI-generated code quality by language — showing defect rates, type error rates, runtime failure rates across Python, Go, Rust, TypeScript

Consider what constitutes the Python corpus. Jupyter notebooks dominate — experimental code optimized for exploration, not production. Stack Overflow snippets abound — solutions that work once, stripped of context. Tutorial code proliferates — written to teach concepts, not to be maintained. University assignments accumulate — first attempts by learners. Quick scripts multiply — hacks that solved immediate problems with no thought to longevity.

When an AI model learns from this corpus, it learns the median Python programmer’s output. That median is dramatically lower than the median for languages with higher barriers to entry.

Empirical Evidence

A 2017 study from UC Davis (Ray et al., Communications of the ACM) analyzed 728 GitHub projects across 17 languages, examining 63 million lines of code and 1.5 million commits to understand the relationship between language choice and defect rates.

Languages with negative defect coefficients — meaning fewer defects than average — included Clojure, Haskell, Ruby, Scala, and TypeScript. Languages with positive coefficients — more defects than average — included C, C++, Objective-C, PHP, and Python. The study further found that functional languages outperformed procedural languages, static typing outperformed dynamic typing, and disallowing implicit type conversion outperformed allowing it. Python falls on the wrong side of each of these divides.

Important caveat

This study has been contested. Berger et al. (TOPLAS 2019) reproduced the analysis and identified methodological concerns. The scientific consensus is that language effects are “significant but modest” — language choice matters less than process factors like team size, commit size, and project complexity. However, even modest effects compound when AI generates millions of lines of code.

The Anthropic Evidence

Anthropic’s Claude Code — their own AI coding assistant — is written in TypeScript. According to Gergely Orosz’s reporting in The Pragmatic Engineer (September 2025), Anthropic chose TypeScript because they “wanted an ‘on distribution’ tech stack for Claude that it was already good at.” The result: ninety percent of Claude Code’s codebase was written by Claude itself.

Multiple factors influenced this choice. Python would be inappropriate for shipping a command-line tool as a consumer product — distribution, packaging, and runtime dependencies make Python poorly suited for end-user applications. But Anthropic explicitly cited typing as a reason: they wanted a stack the model was already proficient with, “which we didn’t need to teach.”

This is not definitive evidence that TypeScript produces better AI-generated code than Python. It is one data point suggesting that a company with deep insight into model capabilities chose a statically-typed language for an AI-assisted project.

They did not choose Python despite Python’s dominance in AI/ML.

The AI/ML Conflation Fallacy

The Category Error

The most common justification for using Python in AI-assisted development is that Python is “the language of AI.” This reasoning contains a category error. The tool for building a thing is not necessarily the tool for using that thing.

Python dominates AI/ML for legitimate reasons: the NumPy/PyTorch/TensorFlow ecosystem, rapid iteration for experiments, effectiveness as a glue language for C/CUDA backends, notebook support for visualization, and tolerance for disposable scripts. None of these reasons apply to production software development.

Divergent Requirements

ML research environments are protected and attended. A researcher watches experiments run. Production software runs unattended at 3am on Saturday. ML experiments are disposable. Production code lives for years. ML infrastructure is overpowered — Python’s performance overhead is rounding error when you have 8x A100 GPUs at $30,000 each. Production software pays for every CPU cycle in cloud bills.

Security Evidence: AI-Generated Code in Production

40%
Copilot code with security vulnerabilities
Pearce et al., 2021 (Copilot updated since)
29.5%
Python GitHub snippets with security weaknesses
Fu et al., TOSEM 2024
10.5%
of AI solutions both correct AND secure
Carnegie Mellon SusVibes, 2025
40-62%
AI code with security flaws (aggregate)
NYU + BaxBench synthesis

Vibe Coding Disasters

SusVibes Benchmark (Carnegie Mellon, December 2025): Researchers evaluated AI coding agents on 200 real-world feature requests. While 61% of solutions were functionally correct, only 10.5% were secure. “Agents frequently achieve functional correctness yet fail security checks on the same tasks.”

Palo Alto Unit 42 (January 2026): Security researchers documented “real-life catastrophic failures” including a sales lead application breached because the vibe coding agent “neglected to incorporate key security controls.”

Replit Incident: SaaStr’s Jason Lemkin documented an AI agent that “started lying about unit tests, ignored code freezes, and eventually deleted the entire production database.”

Counterpoint acknowledged

Some studies suggest AI doesn’t introduce new risks. Sandoval et al. found AI-assisted programming produced critical errors “no more than 10% of the control group.” But this supports the thesis differently: AI doesn’t create novel vulnerabilities — it replicates and scales existing problems at machine speed. The problem isn’t that AI is worse than humans; it’s that AI scales human mediocrity to unprecedented volume.

The Feedback Loop: Model Collapse

Since 2020, AI code generation has exploded. Billions of lines of AI-generated code have been committed to repositories and now enter training datasets for the next generation of models.

AI models collapse when trained on recursively generated data. Learning from data produced by other models causes model collapse — a degenerative process whereby models forget the true underlying data distribution.

Shumailov et al., Nature, July 2024

Even preserving 10% original human-authored data only slows collapse, it doesn’t prevent it. Independent replication by Borji (October 2024) confirmed: “the outcomes reported are a statistical phenomenon and may be unavoidable.”

Structural Defenses

Some languages have structural defenses against training data pollution. Rust’s compiler rejects code with memory errors — such code never becomes “working” code in training data. Go’s formatter enforces consistent style. Python lacks these structural defenses. Bad Python code runs. It produces results. It gets committed. It enters training data.

The Type System Advantage

Type-Constrained Code Generation (2025): Research on TypeScript found that type-constrained decoding “reduces compilation errors by more than half and increases functional correctness relatively by 3.5% to 5.5%.” For repair tasks, it enhanced correct repair “relatively by 37% on average.”

Error distribution (ICSE 2025): Most incorrect code solutions are “compilable and runnable without any compilation errors.” This is worse for Python precisely because errors only surface at runtime — often in production.

Python type hints exist, but…

Type hints are optional, not required. The training corpus contains Python overwhelmingly without consistent type hints. Code with incorrect type hints still runs. If Python’s training corpus were predominantly well-typed code with consistent mypy enforcement, this counterargument would have force. But that corpus does not exist at scale. Type hints remain a what-if: theoretically helpful, practically underutilized, and largely absent from the data AI models learned from.

Solution Quality: Beyond Bug-Free Code

The Context Mismatch Problem

AI tools don’t merely generate code — they generate solutions. The training corpus encodes approaches to problems, architectural decisions, and assumptions about deployment context.

In a Jupyter notebook: inline credentials, no validation, synchronous processing, no logging. All appropriate for research. Deploy that to serve 10 million users? Every characteristic becomes a vulnerability or bottleneck.

The code is not buggy. It runs correctly. It produces correct output. Static analysis finds no issues. But the solution is catastrophically wrong for its deployment context.

The model cannot distinguish context from code. It learned that Python solutions typically look a certain way. It reproduces that pattern. The pattern happens to assume a context that doesn’t match production deployment.

The Economic Cost

Python may be more expensive for AI-assisted development than languages with smaller but higher-quality training corpora. More data means more ways to be wrong before being right. When the probability distribution assigns high probability to solutions that will fail in context, the model burns tokens exploring dead-end paths.

This economic argument has no formal study measuring tokens-per-successful-solution across languages. But it is observable to anyone who has used AI coding tools extensively. Watch an AI assistant struggle: generating, failing, revising, circling back. What might take minutes with a good first attempt takes hours when the model keeps sampling from inappropriate patterns.

🎨
Infographic: The Vibecoder Trap — From “AI suggests Python” to Production Disaster
Placeholder — add infographic here

The Vibecoder Cascade

The vibecoder chooses Python because AI suggests it. They receive experimental-quality code. They lack the experience to recognize its deficiencies. The code runs, so it must be correct. They ship it.

What ships: a Flask endpoint with minimal error handling, SQL injection vulnerabilities, missing authentication, unpinned dependencies. Python’s permissiveness ensures nothing stops this code from reaching production.

This is not a problem with a passive solution. Every commit becomes part of the corpus, and an increasing percentage of those commits are AI-generated. AI companies must account for the feedback loop, predict its trajectory, and prevent corpus degradation through selective training data curation. Quality weighting, filtering heuristics, and detection of AI-generated code in training pipelines are not optional refinements. They are existential necessities.

Addressing Criticisms

“Python works fine for me” — Survivorship bias. You unconsciously correct AI output. The argument concerns developers who can’t.

“All languages have bad code” — True, but the ratio matters. Python’s signal-to-noise ratio is worse.

“AI tools are improving” — Capabilities improve, but training data quality may be declining due to the feedback loop.

“Go and Rust aren’t proven yet” — Fair. The argument rests on structural properties, not empirical AI generation studies.

“Popularity will ruin any language” — Perhaps the strongest criticism. But languages differ in structural resistance to degradation. Rust’s compiler rejects bad code before it enters training data. Python’s lesson: permissiveness plus popularity equals corpus pollution.

Limitations and Gaps

No direct training data quality study. No study directly measures “training data signal-to-noise ratio by language.” The connection is plausible but not proven.

UC Davis study contested. Effects are modest, not dramatic. Process factors dominate language factors.

TypeScript prediction speculative. Structural defenses (the compiler) may prevent TypeScript from following Python’s trajectory.

Rust absence. Rust was not in the UC Davis study. Arguments about Rust are theoretical only.

The Data Curation Solution

130x
fewer parameters, same performance (Phi-1 vs GPT-3.5 on coding)
Microsoft Research, 2023
5-10%
of data needed with cherry selection
NAACL 2024

Microsoft’s Phi models proved that curated “textbook quality” data beats massive undifferentiated corpora. Phi-1 (1.3B parameters) outperformed GPT-3.5 (175B) on coding benchmarks — a specialized model 130 times smaller. The caveat: phi-1 is specialized for Python coding, not general-purpose tasks. But it demonstrates data quality’s leverage.

But curation doesn’t fully solve it. Most commercial tools don’t employ aggressive curation — it’s expensive. General-purpose models need broad coverage, making aggressive filtering risky. And curation cannot add what doesn’t exist: if production-quality Python with proper error handling is a small fraction of the corpus, no filtering creates more of it.

A Concrete Recommendation: TypeScript as Default

For vibecoders starting new projects, the evidence increasingly points to TypeScript:

Idan Gazit, Head of GitHub Next (Copilot team)

“Statically typed languages give you guardrails. If an AI tool is going to generate code for me, I want a fast way to know whether that code is correct. Explicit types give me that safety net.”

GitHub’s Octoverse 2025 report documents TypeScript overtaking both Python and JavaScript to become the most-used language on GitHub, with 66% year-over-year growth — the most significant language movement in more than a decade.

TypeScript’s training corpus has structural advantages: small, isolated, self-contained solutions with type annotations that provide semantic information about intent. The types are not just constraints — they are documentation that helps the model understand what the code is supposed to do.

The feedback loop is already operating. AI performs better on typed languages. Developers notice. They choose typed languages for new projects. More typed code enters the corpus. AI improves further. As Gazit observes: “The more teams rely on AI assistance, the more language choice becomes an AI-compatibility decision.”

What about the solution for AI tools themselves?

The language recommendation is Part 1. Part 2 is changing how AI tools generate code: constrained generation, Foundation Standard defaults, and inverting the burden of knowledge.

Read Part 3: Foundation Standard & Airbags →

References

  • Ray, B., et al. (2017). A large-scale study of programming languages and code quality in GitHub. Communications of the ACM, 60(10), 91-100.
  • Berger, E., et al. (2019). On the Impact of Programming Languages on Code Quality. ACM TOPLAS.
  • Shumailov, I., et al. (2024). AI models collapse when trained on recursively generated data. Nature, 631, 755-759.
  • Fu, Y., et al. (2024). Security Weaknesses of Copilot-Generated Code. ACM TOSEM.
  • Pearce, H., et al. (2022). Asleep at the Keyboard? IEEE S&P.
  • Zhao, S., et al. (2025). Is Vibe Coding Safe? arXiv:2512.03262.
  • Wang, Z., et al. (2025). Code Generation Errors by LLMs. ICSE ’25.
  • Orosz, G. (2025). How Claude Code is built. The Pragmatic Engineer.
  • Palo Alto Networks Unit 42. (2026). Securing Vibe Coding Tools.
  • Gunasekar, S., et al. (2023). Textbooks Are All You Need. arXiv:2306.11644.
  • Microsoft Research. (2024). Phi-3 Technical Report. arXiv:2404.14219.
  • Li, R., et al. (2023). StarCoder. arXiv:2305.06161.
  • Lozhkov, A., et al. (2024). StarCoder 2 and The Stack v2. arXiv:2402.19173.
  • Kim, J., et al. (2024). DataRecipe. ICLR 2024.
  • Li, M., et al. (2024). Cherry Data Selection. NAACL 2024.
  • Training Data Optimization for Code Generation. (2025). ACM TOSEM.
  • GitHub. (2025). Octoverse 2025: AI leads TypeScript to #1. The GitHub Blog.
  • Gazit, I. (2025). TypeScript, Python, and the AI feedback loop. The GitHub Blog.
Python vs. Reality — Series

Want the engineering discipline without the overhead?