Challenges

Meta Launches Llama 3: How Open-Source Is Changing What Your Engineering Team Needs to Know

By Marc Molas·April 22, 2024·10 min read

On April 18, Meta released Llama 3. Two models — 8B and 70B parameters — fully open and commercially licensed at no cost. According to Fortune, this release intensifies competition in a market that until recently was dominated by a handful of companies with closed models.

This isn't just a technical news item. It's a structural shift in how startups can build AI-powered products. I've been an engineer long enough to have seen this movie before — with Linux, and I'll get to that. If you have an engineering team — or you're building one — you need to understand what it means.

The License Matters More Than the Benchmarks

The numbers speak for themselves. Llama 3 70B outperforms Gemini Pro 1.5 and Claude 3 Sonnet on most public benchmarks. It was trained on 15 trillion tokens — seven times more than Llama 2. Model quality is no longer a valid argument for relying exclusively on proprietary APIs.

But the most important thing isn't the benchmarks. It's the license. Any company can download Llama 3, run it on its own infrastructure, and build commercial products on top of it without paying royalties or per-token fees.

A year ago, accessing a competitive language model required budget for OpenAI APIs or an enterprise agreement with Google. Today, the model is sitting on Hugging Face waiting for someone to download it.

Access Is No Longer the Bottleneck

This is where many founders get confused. They see the model is free and assume the cost of building AI features just dropped to zero. It hasn't.

The model is free. Deploying it, optimizing it, maintaining it, and running it in production is not. And that requires a type of engineering most teams don't have.

Think of the Linux analogy. Linux is free. It always has been. But the companies that actually get the most out of Linux are the ones with engineers who know how to configure servers, manage security, automate deployments, and scale infrastructure. Free software doesn't eliminate the need for talent — it transforms it.

The same thing is happening with Llama 3. The new bottleneck isn't the model. It's the engineer who knows how to put it in production.

The Skills Your Team Needs (and Probably Doesn't Have)

If you're considering using Llama 3 — or any open-source model — in your product, this is the skill stack you need to cover:

Model serving: tools like vLLM or Hugging Face's Text Generation Inference (TGI) for serving the model with acceptable latency and enough throughput for production.
Fine-tuning: techniques like LoRA and QLoRA let you adapt the model to your specific use case without needing hundreds of GPUs. But they require experience in data preparation, hyperparameters, and evaluation.
Evaluation pipelines: systematically measuring response quality. "Testing it by hand" isn't enough. You need metrics, evaluation datasets, and reproducible processes.
Inference optimization: quantization, dynamic batching, KV cache management. The difference between a deployment that costs 200 euros per month and one that costs 2,000 comes down to these details.
GPU infrastructure: selecting the right GPU (A100, L40S, T4), configuring the CUDA environment, managing memory, capacity planning. This isn't traditional DevOps.
Production monitoring: detecting model degradation, input data drift, anomalous latencies, silent failures. An AI model in production isn't a standard microservice — it needs specialized observability.

None of these skills are new on their own. But the combination of all of them in a startup team is. Until now, only large companies with dedicated ML teams needed this profile.

The Calculation You Should Be Making: API vs Self-Hosting

Not every use case justifies self-hosting. Here's a framework for deciding:

Third-party APIs (OpenAI, Anthropic, Google) make sense when:

Your volume is low (under 100K calls per month)
You don't need deep model customization
1-3 second latency is acceptable
You don't handle sensitive data that can't leave your infrastructure
You're validating the idea before investing in infrastructure

Self-hosting with Llama 3 makes sense when:

Your volume is high and per-token costs become prohibitive
You need fine-tuning for your specific domain
You have privacy or compliance requirements (GDPR, medical data, financial data)
You need full control over latency and availability
You want to avoid dependency on a provider that can change pricing or terms

The tipping point is usually volume. At 50,000 daily calls with long prompts, the monthly bill from an API can easily exceed 5,000-10,000 euros. A dedicated GPU running an optimized Llama 3 can serve the same volume for a fraction of that cost.

But — and this is key — the savings only materialize if you have the team that knows how to set it up and maintain it. If you rent a GPU and nobody on your team knows how to configure vLLM, you'll spend more, not less.

Why This Matters Especially for European Startups

The AI ecosystem in Europe has a peculiarity: many startups are building on top of APIs from American companies. That works until it doesn't — because prices go up, because GDPR complicates sending data to US servers, or because you need customization that a generic API doesn't offer.

Llama 3 opens a real door for European startups that want to build AI products with technological sovereignty. You can run the model on European servers, with European data, in full compliance with European regulations. No middlemen.

But the door only opens if you have engineers who know how to walk through it.

The Talent Exists — Just Not Where You're Looking

Here's the practical problem: engineers with ML infrastructure experience are scarce and expensive. In Western Europe, a senior ML engineer can cost between 90,000 and 150,000 euros per year. And they're not even easy to find — demand far exceeds supply.

LATAM has a growing pool of engineers with experience in this stack. Many have worked at American companies that already deploy open-source models in production. They have real hands-on experience with the tools, not just theoretical knowledge.

At Conectia, when a startup asks us for engineers for AI projects, we don't look for profiles who took a prompt engineering course. We look for engineers who have deployed models in production, who know the difference between serving an 8B and a 70B model, who understand when to quantize and when not to, who have set up real evaluation pipelines.

Every profile goes through a technical validation with a CTO — not a recruiter scanning buzzwords on a resume.

What You Should Do This Week

If you're building a product that uses — or will use — AI:

Download Llama 3 8B and try it out. You don't need an expensive GPU for the smaller model. Run it locally, understand its capabilities and limitations.
Run the cost calculation. Add up your current (or projected) spend on AI APIs. Compare it to the cost of self-hosting. Include the cost of the team that would maintain it.
Assess your team's skills. Does anyone know how to configure vLLM? Has anyone done fine-tuning? Do they have experience with GPU infrastructure? If the answer is "no" across the board, you need to bring in that profile.
Don't wait. The window of opportunity for open-source models is opening now. Startups that move fast will have a cost and flexibility advantage over those that keep relying exclusively on proprietary APIs.

The model is already free. Cloud infrastructure is accessible. The only thing missing is the team that connects both to your product.

Want to bring on engineers who know how to deploy open-source models in production? Talk to a CTO — we validate real ML infrastructure experience, not buzzwords.