LLM inference as a service

Just like other “as-a-service” offerings, LLMs are following the trend, with multiple providers stepping up to simplify their use and integration. Running large language models on your own can be complex—they typically require GPUs and constantly evolving libraries and models. Leveraging a provider that offers out-of-the-box inference can remove much of the incidental complexity, allowing you to focus on what’s essential for your use case.

Pricing comparison (July 2024)

Here’s a bird’s-eye view of some top LLM service providers and their pricing:

Provider	Llama 3 8B (per 1M tokens)	Mixtral-8X7B (per 1M tokens)	Pricing page
Anyscale	$0.15	$0.15	anyscale.com
Deep Infra	$0.06	$0.24	deepinfra.com
Fireworks	$0.20	$0.50	fireworks.ai
Groq	$0.08	$0.24	groq.com
Lepton AI	$0.07	$0.50	lepton.ai
Mistral AI	-	$0.70	mistral.ai
Novita AI	$0.06	-	novita.ai
OpenPipe	$0.45	$1.40	openpipe.ai
Together AI	$0.20	-	together.ai

Note the above table above is not supposed to be an exhaustive comparison of such providers, it’s missing important information such as context size, inference speed, and model quantization.

Checking out Groq

Groq was by far the choice that piqued my interest the most: it’s unbelievably cheap and fast.

Groq was founded in 2016 by former Google employees who worked on Google’s Tensor Processing Unit (TPU). Leveraging their expertise in custom-designed chips for deep learning, they are now creating their own hardware, specifically what they call the “Language Processing Unit” (LPU):

LPU Inference Engines are designed to overcome the two bottlenecks for LLMs–the amount of compute and memory bandwidth. An LPU system has as much or more compute as a Graphics Processor (GPU) and reduces the amount of time per word calculated, allowing faster generation of text sequences. With no external memory bandwidth bottlenecks an LPU Inference Engine delivers orders of magnitude better performance than Graphics Processor.

Installing their official package:

pip install groq

Testing it out:

from groq import Groq

prompt = 'Tell me a joke'

groq_client = Groq(
    api_key='GROQ_SECRET_GOES_HERE',
)

chat_completion = groq_client.chat.completions.create(
    messages=[
        {
            'role': 'user',
            'content': prompt,
        }
    ],
    model='llama3-8b-8192',
)

print(f'Prompt time: {chat_completion.usage.prompt_time}')
print(f'Prompt tokens: {chat_completion.usage.prompt_tokens}')
print(f'Completion time: {chat_completion.usage.completion_time}')
print(f'Completion tokens: {chat_completion.usage.completion_tokens}')
print(f'Model: {chat_completion.model}')
print(f'Prompt: {prompt}')
print(f'Response: {chat_completion.choices[0].message.content}')

Output:

Prompt time: 0.003391819
Prompt tokens: 14
Completion time: 0.013561327
Completion tokens: 18
Model: llama3-8b-8192
Prompt: Tell me a joke
Response: Why couldn't the bicycle stand up by itself?

Because it was two-tired!

You can grab your API key here and check your usage here.

Comparing providers

Due how early and disruptive LLMs are (and how high inflation is nowadays), the pricing comparison table of this article will be quickly outdated. LLM Explorer is a good resource on finding more providers, and OpenRouter not only allows you to quickly compare prices but also offers an API for routing models and providers on the fly.

Pricing comparison (July 2024)

Checking out Groq

Comparing providers

Sources