Challenges

Meta Releases Llama 2 Open Source: What It Means for Engineering Teams

By Marc Molas·July 31, 2023·9 min read

On July 18, 2023, Meta released Llama 2 -- a family of large language models available for both research and commercial use. The release includes models at 7B, 13B, and 70B parameters, pre-trained and fine-tuned for chat, with a license that permits commercial deployment. This is the first time a model competitive with GPT-3.5 has been available for anyone to download, run, and modify without paying per token.

For engineering teams building AI-powered products, this fundamentally changes the decision landscape. The question I'm hearing from technical founders is no longer "can we access a good LLM?" -- it's "should we run our own?" My answer: probably not yet, but for the first time it's a real question, and here's how I'd think it through.

What Llama 2 Actually Is

Llama 2 is a collection of transformer-based language models trained on 2 trillion tokens of publicly available data. The 7B model can run on a single GPU. The 70B model requires serious infrastructure but approaches GPT-3.5 performance on most benchmarks.

What matters for engineering teams:

Commercial license. Unlike the original Llama, Llama 2 can be used in commercial products. Restrictions only apply to applications with over 700 million MAUs -- for startups, the license is effectively open.
Chat-optimized variants. Meta released both base models and fine-tuned chat models trained with RLHF. No need to fine-tune from scratch for conversational use cases.
Available everywhere. Hugging Face, Microsoft Azure, direct download. The barrier is your hardware, not a waitlist.

Build vs. Buy: The New Calculation

Until now, the AI decision for most startups was simple: use the OpenAI API. GPT-3.5 and GPT-4 are good, the API is easy, and running your own models was impractical without dedicated ML engineers and GPU infrastructure.

Llama 2 adds a third option: run your own model. Here's when each path makes sense.

When the OpenAI API is still right

You're prototyping. Don't build infrastructure to test whether an AI feature adds value. Call the API, validate, iterate.
You need GPT-4 quality. Llama 2 70B competes with GPT-3.5, not GPT-4. If you need GPT-4 reasoning, the API is still your best option.
Your volume is low. A few hundred daily API calls cost almost nothing. The break-even for your own infrastructure starts in the thousands of daily requests.

When running Llama 2 makes sense

Data privacy is non-negotiable. This is the biggest driver. When you call OpenAI's API, your data passes through their servers. For healthcare, legal, financial, or any domain with strict regulations, that's a problem. With Llama 2, user data never leaves your environment. For European companies navigating GDPR, this is a compliance requirement for many use cases.
You need fine-tuning control. OpenAI's fine-tuning is limited. With Llama 2, you fine-tune on your domain data with full control. Medical terminology, legal documents, industry jargon -- an open model gives you far more flexibility.
Cost at scale. API costs scale linearly. Your own infrastructure has high fixed costs but low marginal costs. A single A100 GPU running Llama 2 7B handles significant throughput at a fixed monthly rate versus per-token pricing.

The Engineering Reality

Running your own LLM is not trivial. The marketing makes it sound like you download a model and you're in production. The reality:

Infrastructure. Llama 2 7B needs ~14GB of GPU VRAM. The 70B needs multiple GPUs. Cloud A100 instances run $2-4/hour. Quantized model versions reduce requirements at a small quality cost.

Model serving. You need a layer that handles concurrent requests, manages GPU memory, and batches efficiently. Tools like vLLM and Hugging Face's text-generation-inference handle this, but they require engineers who understand the inference stack.

Fine-tuning expertise. Training loops, data preparation, evaluation metrics, hyperparameter tuning -- this isn't a junior task. It requires ML engineering experience.

Monitoring. LLM evaluation is still an unsolved problem. You need evaluation pipelines, user feedback loops, and quality monitoring. Without this, you're flying blind.

The team implication: Running your own LLM requires ML engineers or significant upskilling. For a 5-10 person startup, this is a real investment.

The Path I'd Take

Start with the API. Validate AI features before investing in infrastructure.
Evaluate privacy. If data can flow through a third-party API, stay there. If it can't, put Llama 2 on your roadmap.
Sandbox first. Run your use cases through Llama 2 and compare quality against GPT-3.5 before committing to production infrastructure.
Build incrementally. Start with a single GPU and a quantized 7B model for one feature. Expand from there.
Watch the ecosystem. Fine-tuned variants, quantization techniques, and serving tools are appearing weekly. What's hard today will be easier in six months.

The long-term trajectory is clear: LLMs are being commoditized. The competitive advantage won't be model access -- it'll be how you apply it to your domain, your data, and your users. Teams that understand deployment, fine-tuning, and evaluation will build better products than those who treat AI as a black-box API call.

At Conectia, we're seeing increasing demand from startups that need engineers who can bridge the gap between a downloaded model and a production feature. That gap -- not the model -- is where the engineering value now lives.

Building AI features and need engineers who understand the full stack, from model serving to production infrastructure? Talk to a CTO.