Meerkat can now treat configured self-hosted models as first-class model IDs. Once you register an alias such as gemma-4-31b, you can use it anywhere you would use gpt-5.4 or gemini-3.1-pro-preview.
This guide uses the four Gemma 4 aliases we recommend for Meerkat:
gemma-4-e2b
gemma-4-e4b
gemma-4-26b-a4b
gemma-4-31b
Gemma 4 is a reasoning-capable family with native system role support and native function calling. All four variants fit well into Meerkat’s current self-hosted text, image, and tool-calling path. For Meerkat, the safest OpenAI-compatible serving default today is chat_completions, because that is the path most clearly documented for Gemma 4 tool calling across the serving stacks below.
Upstream Gemma capability tables may advertise audio on some variants, but Meerkat’s current self-hosted path does not surface self-hosted Gemma audio as a realtime/audio capability. Treat the current integration as text + image + tools unless and until a dedicated self-hosted realtime/audio path is documented.
Treat supports_thinking and supports_reasoning as transport-facing settings, not raw model facts. Gemma 4 itself is reasoning-capable, but normalized reasoning controls and trace streaming still vary by serving stack.
Every self-hosted setup has two pieces:
- A server definition under
self_hosted.servers.<server_id>
- One or more model aliases under
self_hosted.models.<alias>
The alias is the model name users type. The remote_model is whatever your serving stack exposes upstream.
Prefer bearer_token_env over bearer_token so secrets stay out of checked-in config files.
Common Meerkat config
This is the shared shape Meerkat expects regardless of serving stack:
[self_hosted.servers.local]
transport = "openai_compatible"
base_url = "http://127.0.0.1:11434"
api_style = "chat_completions"
bearer_token_env = "LOCAL_LLM_TOKEN"
[self_hosted.models.gemma-4-31b]
server = "local"
remote_model = "gemma4:31b"
display_name = "Gemma 4 31B"
family = "gemma-4"
tier = "supported"
context_window = 256000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true
call_timeout_secs = 600
Quick use
Once the alias is present in your active realm config:
rkat run -m gemma-4-31b "Summarize the architecture of this repository"
rkat run --resume -m gemma-4-e4b "Keep going, but faster"
rkat models
rkat doctor
Ollama
Use Ollama when the model runs on the same machine as Meerkat and you want the lightest setup.
Serve Gemma 4
Pull the model you want and make sure Ollama is running.ollama pull gemma4:31b
ollama list
Register the server
Prefer chat_completions mode for Gemma 4 agent flows.[self_hosted.servers.ollama]
transport = "openai_compatible"
base_url = "http://127.0.0.1:11434"
api_style = "chat_completions"
Add aliases
Map friendly Meerkat aliases to Ollama model IDs.[self_hosted.models.gemma-4-e2b]
server = "ollama"
remote_model = "gemma4:e2b"
display_name = "Gemma 4 E2B"
family = "gemma-4"
tier = "supported"
context_window = 128000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true
[self_hosted.models.gemma-4-31b]
server = "ollama"
remote_model = "gemma4:31b"
display_name = "Gemma 4 31B"
family = "gemma-4"
tier = "supported"
context_window = 256000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true
call_timeout_secs = 600
LM Studio
Use LM Studio when you want a desktop app with a local OpenAI-compatible server.
Start the local server
Load your Gemma 4 model in LM Studio, then start the local server.
Register LM Studio
Prefer chat_completions for Gemma 4 tool use and structured output.[self_hosted.servers.lmstudio]
transport = "openai_compatible"
base_url = "http://127.0.0.1:1234"
api_style = "chat_completions"
Alias the served model
Use the model name LM Studio exposes in its /v1/models output.[self_hosted.models.gemma-4-e4b]
server = "lmstudio"
remote_model = "google/gemma-4-e4b"
display_name = "Gemma 4 E4B"
family = "gemma-4"
tier = "supported"
context_window = 128000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true
vLLM
Use vLLM when you want a self-managed server and the most control over deployment.
Launch vLLM
Start vLLM with the Gemma 4 model you want to expose.
Register the server
Use chat_completions mode for vLLM.[self_hosted.servers.vllm]
transport = "openai_compatible"
base_url = "http://my-gpu-box:8000"
api_style = "chat_completions"
bearer_token_env = "VLLM_API_TOKEN"
Add remote aliases
Point aliases at the exact model name your vLLM endpoint exposes.[self_hosted.models.gemma-4-26b-a4b]
server = "vllm"
remote_model = "google/gemma-4-26b-a4b"
display_name = "Gemma 4 26B A4B"
family = "gemma-4"
tier = "supported"
context_window = 256000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true
call_timeout_secs = 600
[self_hosted.models.gemma-4-31b]
server = "vllm"
remote_model = "google/gemma-4-31b"
display_name = "Gemma 4 31B"
family = "gemma-4"
tier = "supported"
context_window = 256000
max_output_tokens = 8192
vision = true
image_tool_results = true
inline_video = false
supports_temperature = true
supports_thinking = true
supports_reasoning = true
call_timeout_secs = 600
Reasoning and transport notes
chat_completions is the recommended Gemma 4 default for Ollama, LM Studio, and vLLM.
- Text, image, and tool-calling requests fit well through OpenAI-compatible APIs today.
- Reasoning traces and reasoning controls are less uniform. Some servers expose Gemma 4 thinking through Gemma-specific mechanisms rather than OpenAI-native reasoning events, so validate the exact behavior you need before depending on it in production.
- If you only need chat, coding, tools, and image input, OpenAI-compatible serving is a good fit for Gemma 4.
Choosing a Gemma 4 size
| Alias | Good default use |
|---|
gemma-4-e2b | Lowest-footprint local experiments |
gemma-4-e4b | Fast local iteration with a bit more headroom |
gemma-4-26b-a4b | Stronger quality on a serious local or remote GPU setup |
gemma-4-31b | Best quality of the four, usually best on a dedicated server |
Upstream capability tables may differentiate the smaller variants on audio, but Meerkat’s current self-hosted integration treats all four aliases as text/image/tool-capable models rather than a separate self-hosted audio transport surface.
Validation checklist
rkat models shows a self_hosted provider group and your aliases
rkat doctor reports the server as reachable
rkat run -m gemma-4-31b "say hello" works without --provider
- The alias you configured matches the upstream
remote_model shown by /v1/models
See also