Last week I got access to the GPT-3 API beta. Like many developers, I’d been watching the demos circulating on Twitter with a mixture of fascination and skepticism — GPT-3 generating working React components from natural language descriptions, writing SQL queries from plain English, and producing surprisingly coherent essays on arbitrary topics. The demos looked almost too good. So I spent the past week testing it systematically, and I want to share what I’ve found beyond the cherry-picked screenshots.
OpenAI is offering API access to several model sizes: davinci (the full 175 billion parameter model), curie, babbage, and ada (progressively smaller and faster). Most of the impressive demos use davinci. The API itself is straightforward — you send a text prompt and get a completion back. The magic, as it turns out, is entirely in how you construct the prompt.
What Actually Works Well#
Code generation from descriptions is genuinely useful, with caveats. I gave GPT-3 prompts like “Write a Python function that takes a list of dictionaries and returns a new list sorted by the ‘date’ key in descending order” and consistently got working code back. For straightforward utility functions — string manipulation, data transformation, simple algorithms — it’s remarkably reliable.
Where it gets interesting is prompt engineering. If you provide a few examples of input-output pairs (what OpenAI calls “few-shot learning”), the accuracy improves dramatically. Instead of just describing what you want, you show the model two or three examples, and it extrapolates the pattern. For instance, show it three examples of a natural language query and the corresponding SQL, and it can generate SQL for a fourth query with surprising accuracy.
Text transformation is another strong suit. I’ve been using it to convert verbose documentation into concise summaries, reformat data between structures (JSON to YAML, XML to JSON), and generate documentation from code comments. These are tasks that are tedious for humans but well within GPT-3’s capabilities.
Natural language interfaces feel more achievable now than they ever have. Building a system where users type “show me all orders from last month over $100 sorted by date” and the system translates that to a database query is no longer a research project — it’s an API call with some prompt engineering.
Where It Falls Down#
Factual accuracy is a serious problem. GPT-3 generates plausible-sounding text that is frequently wrong. I asked it to explain specific library APIs, and it invented functions that don’t exist. I asked about historical events and got dates wrong. It confidently states incorrect information with the same tone it uses for correct information. There’s no uncertainty signal.
This isn’t a minor limitation — it fundamentally constrains the use cases. You cannot use GPT-3 as a knowledge base. You cannot trust its output without verification. For code generation, this means the generated code must be tested, not just skimmed. For text generation, every factual claim needs checking.
Consistency over long outputs degrades. For short completions (a paragraph, a function), GPT-3 is coherent. For longer outputs, it starts to contradict itself, repeat phrases, or drift off-topic. The model has no persistent memory — each API call is stateless, and while you can include previous context in the prompt, you’re limited by the token window (currently 2048 tokens for most models, 4096 for davinci).
Cost is non-trivial. Davinci, the most capable model, costs $0.06 per 1,000 tokens (roughly 750 words). For interactive applications where each user query might consume 500-1000 tokens in prompt and completion, the per-query cost adds up quickly. The smaller models are much cheaper but notably less capable. Finding the right model-cost trade-off for production use will be an engineering challenge.
The Prompt Engineering Discipline#
What strikes me most about working with GPT-3 is that effective use requires a new skill that doesn’t map neatly onto existing engineering disciplines. Prompt engineering — crafting the input text to reliably produce the desired output — is part copywriting, part programming, and part empirical science.
A naive prompt like “Write a Python web scraper” produces mediocre results. A well-crafted prompt that specifies the library to use, provides an example of the desired output format, and includes constraints (“handle pagination, use rate limiting, log errors to stderr”) produces dramatically better code. The difference between a good prompt and a bad one can be the difference between a useful tool and a party trick.
I’ve started maintaining a library of effective prompts — templates for different tasks that I can adapt. This feels like the early days of SQL or regex: a skill that starts as arcane knowledge and gradually becomes a standard part of the developer toolkit.
Implications for Software Development#
I don’t think GPT-3 is going to replace developers. But I do think it’s going to change how we work. Here are the near-term applications I’m most excited about:
Code scaffolding: generating boilerplate code, tests, and documentation from high-level descriptions. Not replacing the thinking, but eliminating the typing.
Internal tools: building natural language interfaces for databases and APIs that non-technical team members can use without learning SQL or API syntax.
Data transformation: converting between formats, generating sample data, and building migration scripts from examples rather than specifications.
Learning aid: explaining unfamiliar code, suggesting improvements, and answering “how do I do X in language Y” questions with working examples.
My Take#
GPT-3 is the most impressive language model I’ve worked with, and it’s not close. The jump from GPT-2 to GPT-3 is qualitatively different — it’s not just better at the same tasks, it can do tasks that GPT-2 simply couldn’t.
But the hype is outrunning the reality. The Twitter demos showing GPT-3 building entire applications from a one-sentence description are cherry-picked best cases, and they omit the many failed attempts that preceded the screenshot-worthy success. In practice, GPT-3 is a powerful but unreliable tool that requires careful prompt design, output validation, and realistic expectations.
The API pricing also signals that this is a premium tool, not a utility. For high-value use cases where the cost per query is justified — code generation in an IDE, natural language database queries for business users, content summarization — GPT-3 can deliver real value today. For high-volume, low-margin applications, the economics don’t yet work.
I’m going to keep experimenting. There’s something genuinely exciting about a tool that can understand and generate code and natural language with this level of fluency. But I’m keeping my expectations grounded in what I’ve actually tested, not what the demos promise.
