Kimi-K2 API

Use State-of-the-art mixture-of-experts agentic intelligence model with 1 T parameters, 128K context, and native tool (Kimi-K2 API) at Together.ai!

Model details

Architecture Overview:

‍‍1 T-parameter MoE with 32 B activated parameters
Hybrid MoE sparsity for compute efficiency
128K token context for deep document comprehension
Agentic design with native tool usage & CLI integration‍

Training Methodology:

‍‍Pre-trained on 15.5 T tokens using MuonClip optimizer for stability
Zero-instability training at large scale

‍Performance Characteristics:‍

SOTA on LiveCodeBench v6, AIME 2025, MMLU-Redux, and SWE-bench (agentic)

Prompting Kimi K2 API Instruct

Use natural language instructions or tool commands
Temperature ≈ 0.6: Calibrated to Kimi‑K2‑Instruct’s RLHF alignment curve; higher values yield verbosity.
Kimi K2 autonomously invokes tools to fulfill tasks: Pass a JSON schema in tools=[…]; set tool_choice="auto". Kimi decides when/what to call.
Supports multi-turn dialogues & chained workflows: Because the model is “agentic”, give a high‑level objective (“Analyse this CSV and write a report”), letting it orchestrate sub‑tasks.
Chunk very long contexts: 128 K is huge, but response speed drops on >100 K inputs; supply a short executive brief in the final user message to focus the model.

Kimi K2 QuickStart Guide

Overview

Kimi K2 is a cutting-edge Mixture-of-Experts (MoE) language model developed by Moonshot AI. Boasting a total of 1 trillion parameters (with 32B active during inference), it stands out as one of the most powerful open-source models available for non-reasoning tasks.

Trained on 15.5 trillion tokens and equipped with a 128k token context window, Kimi K2 excels in agentic workflows, programming, reasoning, and tool use. Despite its massive scale, its efficient activation of just 32 billion parameters delivers near-frontier performance with significantly lower computational requirements compared to dense models.

This guide covers key use cases, setup instructions, best practices, and prompting strategies to help you maximize your experience with Kimi K2.

Getting Started with Kimi K2

You can start using Kimi K2 in just 10 lines of code.

Model ID: moonshotai/Kimi-K2-Instruct

Pricing: $1.00 per million input tokens, $3.00 per million output tokens

Example (Python):

from together import Together

client = Together()
resp = client.chat.completions.create(
    model="moonshotai/Kimi-K2-Instruct",
    messages=[{"role":"user","content":"Code a hacker news clone"}],
    stream=True,
)

for tok in resp:
    print(tok.choices[0].delta.content, end="", flush=True)

Best Use Cases

Kimi K2 excels in autonomous and complex problem-solving scenarios, particularly those involving code and tools:

Agentic Workflows: Automate multi-step tasks like flight booking, data analysis, and research using APIs and tools.
Coding & Debugging: Ideal for software engineering challenges (e.g., SWE-bench), generating code patches, or fixing bugs.
Technical Research & Reporting: Summarize long technical documents, generate reports, or analyze trends using its extended context window.
STEM Problem-Solving: Handles advanced math (e.g., AIME, MATH), logic puzzles (e.g., ZebraLogic), and scientific reasoning.
Tool Integration: Build AI agents that dynamically interact with APIs, such as databases or weather services.

Prompting Tips

Tip	Why It Matters
Use a simple system prompt	“You are Kimi, an AI assistant created by Moonshot AI.” — matches the prompt used during instruction tuning.
Set temperature ≈ 0.6	Optimized for RLHF alignment; higher values increase verbosity.
Use native tool calling	Pass a JSON schema in tools=[…] and set tool_choice=”auto” — Kimi chooses when and what tool to use.
Focus on goals, not steps	Provide high-level objectives (e.g., “Analyze this CSV and write a report”) and let the model plan the execution.
Chunk long contexts smartly	Although 128k context is supported, processing over 100k tokens can slow response time. Provide a concise executive summary in your prompt.

Most of these insights are sourced from the Kimi K2 GitHub repository.

Known Limitations

While Kimi K2 is highly capable, there are scenarios where it may not be the ideal choice:

Latency-sensitive applications: Tasks like real-time voice agents may suffer due to the model’s slower inference speed.
Fast summarization: Though capable of processing long documents (up to 128k tokens), performance may lag when immediate output is needed. For async summarization tasks, however, Kimi K2 is a strong contender.