Bosun v1.1: Open-Source Knowledge Graph Reranker (2026)

Bosun v1.1 by Hanno Labs judges agent knowledge graph edges: 0.91 PAWS, 0.945 WarrantBench steerability. Apache 2.0, 4B and 0.6B variants, free on HuggingFace.

Bosun v1.1, released June 11, 2026 by Hanno Labs, is an Apache 2.0 knowledge graph judge model in two sizes -- 4B and 0.6B (XS) -- scoring 0.91 on PAWS, 0.945 WarrantBench steerability, and placing first on FollowIR (+17.9 p-MRR), beating Gemini 3.1 Flash Lite on paraphrase detection. Fully free to self-host via GGUF or PEFT with no API cost, Q4_K_M inference runs on CPU with as little as 0.5 GB RAM for the XS variant.

Bosun v1.1 is an Apache 2.0 open-source judge model from Hanno Labs (released June 11, 2026) that evaluates whether edges in agent knowledge graphs are warranted. Available in 4B and 0.6B (XS) sizes as LoRA fine-tunes of Qwen3-Reranker, it scores 0.91 on PAWS, beating Gemini 3.1 Flash Lite (0.81), and 0.945 on WarrantBench steerability. Free to download and run locally via GGUF builds.

Provider: Hanno Labs · Family: Bosun

Context window: 8,192 tokens

Input modalities: text · Output: text

About Bosun

Bosun is a programmable judge model released June 11, 2026 by Hanno Labs, a small AI research lab focused on causal intelligence and knowledge systems. Available in two sizes -- 0.6B (Bosun-XS) and 4B (Bosun-4B) -- both are LoRA fine-tunes of Qwen3-Reranker, trained to evaluate whether connections in an agent's knowledge graph are warranted: supported by evidence, non-redundant, and still factually current. Unlike general-purpose rerankers that score document relevance against a fixed query, Bosun is programmed at inference time by a natural-language instruction defining the specific rule to apply, allowing reprogramming per batch without fine-tuning. Version 1.1 expanded training data and added support for directional and typed-edge judgment, enabling the model to assess asymmetric relationships like supersession, dependency, support, and contradiction. On the PAWS paraphrase adversarial benchmark, Bosun-4B v1.1 scores 0.91, compared to Gemini 3.1 Flash Lite at 0.81 and a blind baseline near 0.53. On FollowIR, which measures how well models adjust retrieval behavior based on per-query instructions, Bosun-4B achieves +17.9 in preference-MRR, placing first on that leaderboard; most standard retrievers score near or below 0 because they ignore the instruction signal entirely. On WarrantBench steerability -- the ability to flip a judgment when the instruction is negated -- Bosun-4B scores 0.945 and Bosun-XS scores 0.94, vastly outperforming Gemini 3.1 Flash Lite (0.575). On e-CARE causal direction, Bosun-4B reaches 0.85, closely matching Gemini (0.86). The one area where Bosun trails frontier LLMs is ANLI adversarial NLI: 0.57 for the 4B and 0.44 for XS, versus Gemini's 0.74, a known gap for smaller specialized models. Both Bosun variants process input via the Qwen3-Reranker template: an instruction block defining the rule, a fixed query string, and a document block containing the two findings being compared. The model outputs a single calibrated probability P = sigmoid(logit_yes - logit_no) per pair. Effective max sequence length is approximately 8,192 tokens per pair, making Bosun well suited for paragraph-length findings but not multi-page documents. Long-context recall in the traditional sense does not apply since the model produces a scalar output, not a sequence. Bosun is a text-in, score-out model. It accepts a text instruction and two text findings, and outputs a float in the range 0 to 1. It does not support vision, audio, structured JSON generation, or chat interaction. The instruction-following capability is the key differentiator: the model can be reprogrammed with any natural-language rule per batch, reversing its judgment when the rule is negated (0.97 negation accuracy for XS) and generalizing to rules never seen during training (0.95 novel-rule accuracy). Version 1.1 added directional typed-edge judgment -- supersession, depends-on, supports, and contradicts -- which v1.0 could not distinguish from symmetric co-occurrence. Both Bosun-XS and Bosun-4B are released under Apache 2.0 and are free to download and self-host with no API fee and no rate limit. The only cost is the infrastructure you run inference on. GGUF quantizations are available at Hanno-Labs/bosun-xs-GGUF and Hanno-Labs/bosun-4b-GGUF for CPU and edge deployment. A Q4_K_M quantization of Bosun-XS requires approximately 0.5 GB RAM; the 4B Q4_K_M needs roughly 2.5 GB. On a modern CPU, Bosun-XS can evaluate thousands of graph edges per minute. Both models are available on HuggingFace as LoRA adapters (adapter_model.safetensors and adapter_config.json) loadable via PEFT on top of the Qwen3-Reranker base. GGUF quantizations (f16, Q8_0, Q4_K_M) support llama.cpp, Ollama, and other local inference runtimes. The serving.json file in each repository contains the exact prompt template, yes and no token IDs, and max sequence length for fully reproducible inference. There is no official hosted API, no Bedrock or Vertex deployment, and no official SDK beyond the PEFT-based reference implementation. As a specialized scoring model, Bosun does not produce natural language outputs and is not subject to the same jailbreak or prompt injection risks as chat models. The model outputs a calibrated float; its attack surface is narrow. Hanno Labs has not published a system card, red-teaming report, or safety benchmark for Bosun. The Apache 2.0 license places no restrictions on use case, so deployers are entirely responsible for ensuring the judgment rules they supply are appropriate for their application. Bosun is a strong fit for teams building agent memory systems with knowledge graphs, where it acts as a fast, cheap pruning layer removing stale, redundant, or unsupported edges. RAG pipelines that need instruction-tuned filtering -- where the acceptance rule changes per query -- benefit from Bosun's runtime reprogrammability over static embedding similarity. Teams that should not use Bosun include those who need a generative chat model, those processing documents above 8,000 tokens per pair, those needing vision or audio, and those where ANLI-class adversarial NLI robustness is critical. As a self-hosted model, data governance is entirely under the deployer's control. No data leaves the premises without the operator explicitly sending it. Hanno Labs has not disclosed SOC 2, ISO 27001, HIPAA, or GDPR certifications for Bosun; for regulated deployments, the user organization's own compliance posture applies. Training data for v1.1 included DialAM-2024 argument edges, NLI datasets, PAWS paraphrase pairs, e-CARE and COPA causal data, deduplicated hard-negatives, completeness examples, and synthetic directional data. Bosun launched as v1.0 on June 11, 2026 with WarrantBench steerability of 0.935 for XS. Version 1.1, released approximately one week later, expanded the training blend and added directional typed-edge judgment: the model can now reason about asymmetric relationships that v1.0 could not distinguish. GGUF builds were added at the same time for local CPU deployment. The WarrantBench dataset is available at github.com/Hanno-Labs/warrantbench. Rapid iteration from v1.0 to v1.1 within two weeks of launch suggests active development continues.

Pricing

Apache 2.0 open-source model, free to download and self-host. No commercial API. Infrastructure cost only: roughly 0.5-2.5 GB RAM depending on quantization and model size.

Key Features

Pros

Cons

Benchmarks

Frequently Asked Questions

What is Bosun v1.1 and who built it?

Bosun v1.1 is a programmable judge model released June 11, 2026 by Hanno Labs, an AI research lab focused on causal intelligence and knowledge systems. It is a LoRA fine-tune of Qwen3-Reranker, available in two sizes: 0.6B (Bosun-XS) and 4B (Bosun-4B). Bosun is designed to evaluate whether connections in an agent's knowledge graph are warranted -- supported by evidence, non-redundant, and still factually current. Instead of using fixed criteria, Bosun is programmed at inference time by a natural-language instruction defining the specific rule to apply, allowing it to be reprogrammed per batch without fine-tuning. Version 1.1 added directional and typed-edge judgment (supersession, depends-on, supports, contradicts), expanding on the symmetric judgment of v1.0. The model outputs a calibrated probability P = sigmoid(logit_yes - logit_no) in the range 0 to 1. It is released under Apache 2.0 with no hosted API and no commercial license requirement.

How much does Bosun cost?

Bosun is completely free. Both Bosun-XS (0.6B) and Bosun-4B are released under Apache 2.0 and available for free download from HuggingFace with no per-token fee, no rate limit, and no commercial license requirement. The only cost is your own infrastructure. GGUF builds at Q4_K_M quantization require approximately 0.5 GB RAM for Bosun-XS and 2.5 GB RAM for Bosun-4B, both runnable on CPU without a GPU. For a team processing 10,000 knowledge graph edges per run on a modern CPU, the electricity cost is negligible. For GPU-accelerated batch inference on a leased A100, running 100,000 pairs through Bosun-4B costs roughly 0.30 USD in cloud compute time. Compare this to calling a frontier LLM API at 3 to 15 USD per million tokens for equivalent judgment workloads, where costs accumulate quickly at scale. Apache 2.0 allows modification and redistribution without any vendor royalty.

What is Bosun's context window and how does it handle long document pairs?

Bosun is built on Qwen3-Reranker, which has an effective maximum sequence length of approximately 8,192 tokens per pair. This covers the combined length of the instruction, the fixed query string, and the two findings being compared. Bosun is designed for paragraph-length comparisons, not multi-page document analysis. If you supply findings that together exceed roughly 8,000 tokens, the Qwen3 tokenizer will truncate the input silently, degrading accuracy without raising an error. There is no sliding window or long-context mode available. For knowledge graph curation, where individual facts are typically one to five sentences each, 8,192 tokens is more than adequate. For comparing long documents such as research papers or legal contracts, a frontier LLM with a 200,000-token context window is more appropriate. The model produces only a scalar score as output, so there is no max output token limit to consider.

How does Bosun compare to using GPT or Gemini as a knowledge graph judge?

Bosun outperforms Gemini 3.1 Flash Lite on PAWS paraphrase detection (0.91 vs 0.81) and WarrantBench steerability (0.945 vs 0.575), making it the better choice when judgments must reliably flip when instructions are negated -- the core requirement for graph edge curation. On FollowIR instruction-following retrieval, Bosun-4B scores +17.9 p-MRR (first place) compared to Gemini's capped 12.0. On e-CARE causal direction, Bosun-4B (0.85) nearly matches Gemini (0.86). However, Gemini outperforms Bosun on ANLI adversarial NLI (0.74 vs 0.57 for Bosun-4B), so for ANLI-heavy workloads a frontier LLM API is more accurate. The cost difference is substantial: Bosun runs free on CPU, while calling Gemini or GPT at scale costs real money per token. For most knowledge graph and RAG filtering workloads, Bosun's instruction-following accuracy exceeds what a general-purpose API model offers at zero ongoing cost after setup.

Is Bosun open source?

Yes, Bosun is fully open-source under Apache 2.0. Both the 0.6B (Bosun-XS) and 4B (Bosun-4B) model weights are publicly available on HuggingFace at Hanno-Labs/bosun-xs and Hanno-Labs/bosun-4b. The Apache 2.0 license allows commercial use, modification, redistribution, and sublicensing with no restrictions beyond attribution. GGUF quantizations (f16, Q8_0, Q4_K_M) are available at Hanno-Labs/bosun-xs-GGUF and Hanno-Labs/bosun-4b-GGUF for use with llama.cpp, Ollama, and other local inference runtimes. The WarrantBench evaluation benchmark and dataset are also open-source at github.com/Hanno-Labs/warrantbench. There is no private enterprise version or closed-weight variant; the publicly released weights are the production weights Hanno Labs uses.

What inputs and outputs does Bosun support?

Bosun accepts text inputs only and produces a single floating-point score as output. The input follows a three-part template: an instruction block defining the rule to apply, a fixed query string ('These two findings share the specified relationship'), and a document block containing FINDING A and FINDING B as text strings. The output is a probability P = sigmoid(logit_yes - logit_no) in the range 0 to 1, where values closer to 1 indicate the pair satisfies the supplied rule more strongly. Bosun does not support vision, audio, video, structured JSON output, natural language generation, function calling, or code execution. It is a pure scoring model. The instruction block accepts any natural-language rule: 'Finding B supersedes Finding A', 'These two facts contradict each other', 'Finding B depends on Finding A being true', and so on. Version 1.1 specifically added training for directional and typed-edge rules, improving accuracy on asymmetric relationships like supersession and dependency.

Does Bosun train on user data?

No. Bosun is a self-hosted model with no vendor-operated API. Because all inference runs on your own infrastructure, no data is ever sent to Hanno Labs. There is no telemetry, no usage monitoring, no input logging, and no model training on your data. The Apache 2.0 license gives you full control over the model weights and all outputs. For regulated industries or air-gapped deployments, the GGUF builds work completely offline with no external network calls. Hanno Labs has not published SOC 2 Type II, ISO 27001, HIPAA, or GDPR certifications for Bosun, which is expected for a self-hosted open-source model since those certifications apply to vendor-operated services. Your organization's own compliance posture governs your deployment.

Who is Bosun best for and who should avoid it?

Bosun is best for AI engineers building agent memory systems with knowledge graphs who need a fast, cheap, accurate pruning layer that removes stale or unsupported edges at scale. RAG pipeline engineers who need instruction-following reranking -- where the acceptance rule changes per query -- will find Bosun's +17.9 FollowIR score far more relevant than static embedding similarity. Open-source teams needing a free judge model with full data sovereignty and no API dependency are the core user group. Teams that should not use Bosun include those who need a generative or chat model (Bosun outputs a float, not text), those processing document pairs over 8,000 tokens (inputs will be truncated), teams needing high adversarial NLI accuracy (Gemini 3.1 Flash Lite scores 0.74 vs Bosun-4B's 0.57), and organizations without ML engineers who can manage a self-hosted PEFT or GGUF inference stack. For non-English text, performance is untested and likely degraded.

Visit Bosun Official Page