Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.rkat.ai/llms.txt

Use this file to discover all available pages before exploring further.

Self-hosted models are configured as first-class Meerkat model IDs. Once an alias such as gemma-4-31b is registered, users can pass it anywhere they would pass a hosted model:
rkat run -m gemma-4-31b "Summarize this repository"
rkat run --resume -m gemma-4-e4b "Keep going, but use the faster local model"
rkat models
rkat doctor
This guide covers the general self-hosting contract and the Gemma 4 worked example. There is no separate Gemma page: Gemma 4 is one family under the same self-hosted model system as Ollama, LM Studio, vLLM, or another private OpenAI-compatible endpoint.

Model

Every self-hosted setup has two pieces:
Config sectionOwns
self_hosted.servers.<server_id>Transport, base URL, API style, and credentials for one serving endpoint
self_hosted.models.<alias>The Meerkat-facing model ID, display metadata, capabilities, and upstream remote_model
The alias is the name users type. The remote_model is the model name returned or expected by the serving stack.
Prefer bearer_token_env over bearer_token so secrets stay out of checked-in configuration.

Server Shape

For Ollama, LM Studio, vLLM, and most private gateways, use the OpenAI-compatible transport:
config.toml
[self_hosted.servers.local]
transport = "openai_compatible"
base_url = "http://127.0.0.1:11434"
api_style = "chat_completions"
bearer_token_env = "LOCAL_LLM_TOKEN"
api_style = "chat_completions" is the conservative default for Gemma 4 and other self-hosted tool-calling models. Use another API style only after the specific serving stack has been validated with Meerkat tools, structured output, and multimodal input.

Alias Shape

config.toml
[self_hosted.models.gemma-4-31b]
server = "local"
remote_model = "gemma4:31b"
display_name = "Gemma 4 31B"
family = "gemma-4"
tier = "supported"
context_window = 256000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true
call_timeout_secs = 600
supports_thinking and supports_reasoning describe the behavior Meerkat should expose through the configured transport. Gemma 4 is reasoning-capable, but normalized reasoning controls and trace streaming still vary by serving stack, so validate the behavior you plan to rely on.
Meerkat’s current self-hosted path does not expose self-hosted Gemma audio or a self-hosted realtime transport. Treat these aliases as text, image-input, and tool-capable models unless a dedicated self-hosted realtime path is documented.

Gemma 4 Aliases

Recommended aliases:
AliasGood default use
gemma-4-e2bLowest-footprint local experiments
gemma-4-e4bFast local iteration with more headroom
gemma-4-26b-a4bStronger quality on a serious local or remote GPU setup
gemma-4-31bBest quality of the four, usually best on a dedicated server

Ollama

Use Ollama when the model runs on the same machine as Meerkat and you want the lightest local setup.
1

Serve the model

ollama pull gemma4:31b
ollama list
2

Register Ollama

config.toml
[self_hosted.servers.ollama]
transport = "openai_compatible"
base_url = "http://127.0.0.1:11434"
api_style = "chat_completions"
3

Add aliases

config.toml
[self_hosted.models.gemma-4-e2b]
server = "ollama"
remote_model = "gemma4:e2b"
display_name = "Gemma 4 E2B"
family = "gemma-4"
tier = "supported"
context_window = 128000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true

[self_hosted.models.gemma-4-31b]
server = "ollama"
remote_model = "gemma4:31b"
display_name = "Gemma 4 31B"
family = "gemma-4"
tier = "supported"
context_window = 256000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true
call_timeout_secs = 600

LM Studio

Use LM Studio when you want a desktop-managed OpenAI-compatible server.
1

Start the local server

Load the Gemma 4 model in LM Studio, then start the local server.
2

Register LM Studio

config.toml
[self_hosted.servers.lmstudio]
transport = "openai_compatible"
base_url = "http://127.0.0.1:1234"
api_style = "chat_completions"
3

Alias the served model

Use the model name LM Studio exposes in its /v1/models output.
config.toml
[self_hosted.models.gemma-4-e4b]
server = "lmstudio"
remote_model = "google/gemma-4-e4b"
display_name = "Gemma 4 E4B"
family = "gemma-4"
tier = "supported"
context_window = 128000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true

vLLM

Use vLLM when you want a private server with more deployment control.
1

Launch vLLM

Start vLLM with the Gemma 4 model you want to expose.
2

Register the server

config.toml
[self_hosted.servers.vllm]
transport = "openai_compatible"
base_url = "http://my-gpu-box:8000"
api_style = "chat_completions"
bearer_token_env = "VLLM_API_TOKEN"
3

Add remote aliases

Point aliases at the exact model names your vLLM endpoint exposes.
config.toml
[self_hosted.models.gemma-4-26b-a4b]
server = "vllm"
remote_model = "google/gemma-4-26b-a4b"
display_name = "Gemma 4 26B A4B"
family = "gemma-4"
tier = "supported"
context_window = 256000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true
call_timeout_secs = 600

[self_hosted.models.gemma-4-31b]
server = "vllm"
remote_model = "google/gemma-4-31b"
display_name = "Gemma 4 31B"
family = "gemma-4"
tier = "supported"
context_window = 256000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true
call_timeout_secs = 600

Validation

Run these after adding or editing self-hosted model config:
rkat models
rkat doctor
rkat run -m gemma-4-31b "Say hello in one sentence."
The expected result:
  • rkat models shows a self_hosted provider group and the aliases you added.
  • rkat doctor reports the server as reachable.
  • rkat run -m ... works without an explicit --provider.
  • The alias points at the exact upstream remote_model exposed by the server.

See Also

Providers

Hosted and self-hosted provider model.

CLI configuration

Config file locations and model settings.

CLI commands

Commands for running, diagnosing, and inspecting models.