Skip to main content
Meerkat can now treat configured self-hosted models as first-class model IDs. Once you register an alias such as gemma-4-31b, you can use it anywhere you would use gpt-5.4 or gemini-3.1-pro-preview. This guide uses the four Gemma 4 aliases we recommend for Meerkat:
  • gemma-4-e2b
  • gemma-4-e4b
  • gemma-4-26b-a4b
  • gemma-4-31b
Gemma 4 is a reasoning-capable family with native system role support and native function calling. All four variants fit well into Meerkat’s current self-hosted text, image, and tool-calling path. For Meerkat, the safest OpenAI-compatible serving default today is chat_completions, because that is the path most clearly documented for Gemma 4 tool calling across the serving stacks below.
Upstream Gemma capability tables may advertise audio on some variants, but Meerkat’s current self-hosted path does not surface self-hosted Gemma audio as a realtime/audio capability. Treat the current integration as text + image + tools unless and until a dedicated self-hosted realtime/audio path is documented.
Treat supports_thinking and supports_reasoning as transport-facing settings, not raw model facts. Gemma 4 itself is reasoning-capable, but normalized reasoning controls and trace streaming still vary by serving stack.

What you configure

Every self-hosted setup has two pieces:
  1. A server definition under self_hosted.servers.<server_id>
  2. One or more model aliases under self_hosted.models.<alias>
The alias is the model name users type. The remote_model is whatever your serving stack exposes upstream.
Prefer bearer_token_env over bearer_token so secrets stay out of checked-in config files.

Common Meerkat config

This is the shared shape Meerkat expects regardless of serving stack:
config.toml
[self_hosted.servers.local]
transport = "openai_compatible"
base_url = "http://127.0.0.1:11434"
api_style = "chat_completions"
bearer_token_env = "LOCAL_LLM_TOKEN"

[self_hosted.models.gemma-4-31b]
server = "local"
remote_model = "gemma4:31b"
display_name = "Gemma 4 31B"
family = "gemma-4"
tier = "supported"
context_window = 256000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true
call_timeout_secs = 600

Quick use

Once the alias is present in your active realm config:
rkat run -m gemma-4-31b "Summarize the architecture of this repository"
rkat run --resume -m gemma-4-e4b "Keep going, but faster"
rkat models
rkat doctor

Ollama

Use Ollama when the model runs on the same machine as Meerkat and you want the lightest setup.
1

Serve Gemma 4

Pull the model you want and make sure Ollama is running.
ollama pull gemma4:31b
ollama list
2

Register the server

Prefer chat_completions mode for Gemma 4 agent flows.
config.toml
[self_hosted.servers.ollama]
transport = "openai_compatible"
base_url = "http://127.0.0.1:11434"
api_style = "chat_completions"
3

Add aliases

Map friendly Meerkat aliases to Ollama model IDs.
config.toml
[self_hosted.models.gemma-4-e2b]
server = "ollama"
remote_model = "gemma4:e2b"
display_name = "Gemma 4 E2B"
family = "gemma-4"
tier = "supported"
context_window = 128000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true

[self_hosted.models.gemma-4-31b]
server = "ollama"
remote_model = "gemma4:31b"
display_name = "Gemma 4 31B"
family = "gemma-4"
tier = "supported"
context_window = 256000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true
call_timeout_secs = 600

LM Studio

Use LM Studio when you want a desktop app with a local OpenAI-compatible server.
1

Start the local server

Load your Gemma 4 model in LM Studio, then start the local server.
2

Register LM Studio

Prefer chat_completions for Gemma 4 tool use and structured output.
config.toml
[self_hosted.servers.lmstudio]
transport = "openai_compatible"
base_url = "http://127.0.0.1:1234"
api_style = "chat_completions"
3

Alias the served model

Use the model name LM Studio exposes in its /v1/models output.
config.toml
[self_hosted.models.gemma-4-e4b]
server = "lmstudio"
remote_model = "google/gemma-4-e4b"
display_name = "Gemma 4 E4B"
family = "gemma-4"
tier = "supported"
context_window = 128000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true

vLLM

Use vLLM when you want a self-managed server and the most control over deployment.
1

Launch vLLM

Start vLLM with the Gemma 4 model you want to expose.
2

Register the server

Use chat_completions mode for vLLM.
config.toml
[self_hosted.servers.vllm]
transport = "openai_compatible"
base_url = "http://my-gpu-box:8000"
api_style = "chat_completions"
bearer_token_env = "VLLM_API_TOKEN"
3

Add remote aliases

Point aliases at the exact model name your vLLM endpoint exposes.
config.toml
[self_hosted.models.gemma-4-26b-a4b]
server = "vllm"
remote_model = "google/gemma-4-26b-a4b"
display_name = "Gemma 4 26B A4B"
family = "gemma-4"
tier = "supported"
context_window = 256000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true
call_timeout_secs = 600

[self_hosted.models.gemma-4-31b]
server = "vllm"
remote_model = "google/gemma-4-31b"
display_name = "Gemma 4 31B"
family = "gemma-4"
tier = "supported"
context_window = 256000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true
call_timeout_secs = 600

Reasoning and transport notes

  • chat_completions is the recommended Gemma 4 default for Ollama, LM Studio, and vLLM.
  • Text, image, and tool-calling requests fit well through OpenAI-compatible APIs today.
  • Reasoning traces and reasoning controls are less uniform. Some servers expose Gemma 4 thinking through Gemma-specific mechanisms rather than OpenAI-native reasoning events, so validate the exact behavior you need before depending on it in production.
  • If you only need chat, coding, tools, and image input, OpenAI-compatible serving is a good fit for Gemma 4.

Choosing a Gemma 4 size

AliasGood default use
gemma-4-e2bLowest-footprint local experiments
gemma-4-e4bFast local iteration with a bit more headroom
gemma-4-26b-a4bStronger quality on a serious local or remote GPU setup
gemma-4-31bBest quality of the four, usually best on a dedicated server
Upstream capability tables may differentiate the smaller variants on audio, but Meerkat’s current self-hosted integration treats all four aliases as text/image/tool-capable models rather than a separate self-hosted audio transport surface.

Validation checklist

  • rkat models shows a self_hosted provider group and your aliases
  • rkat doctor reports the server as reachable
  • rkat run -m gemma-4-31b "say hello" works without --provider
  • The alias you configured matches the upstream remote_model shown by /v1/models

See also