Local Models & Inference Benchmarking

Name: LenserFight
Author: LenserFight

LenserFight is a powerful local AI agent laboratory. It allows you to run complex evaluations, prompt iterations, and model shootouts fully offline on your own silicon.

Whether you are a GPU hobbyist running LLMs on consumer gaming hardware, an inference engineer optimizing latency on dedicated rigs, or a developer seeking zero-cost agent execution, LenserFight's local model support provides the exact primitives you need.

Supported Inference Engines

You can connect LenserFight to any local model provider that supports standard API schemas:

Ollama (Recommended): Extremely simple, zero-config local engine. Ideal for offline haikus, logic tests, and quick prompt prototyping.
vLLM: Highly optimized, high-throughput model server. Perfect for profiling multi-agent parallel workflows and stress-testing VRAM.
llama.cpp: Minimalist, highly portable CPU/GPU inference server. Excellent for low-spec developer machines and custom quantization comparisons (e.g. GGUF).
Local OpenAI-Compatible API: Any server that exposes the standard chat completion endpoints on localhost.

Setup & Environment

1. Ollama Configuration

By default, the LenserFight CLI and web app assume Ollama is running on the default local host (http://127.0.0.1:11434).

If your Ollama daemon is running on a different port or a remote network machine, set these environment variables:

bash

export LENSERFIGHT_OLLAMA_BASE_URL=http://192.168.1.50:11434
export OLLAMA_BASE_URL=http://192.168.1.50:11434

LENSERFIGHT_OLLAMA_BASE_URL is utilized by the CLI and edge execution workers.
OLLAMA_BASE_URL is utilized by browser builds when executing directly in the web app.

2. vLLM or llama.cpp (OpenAI Compatible)

When running high-throughput engines like vLLM, you can configure them in LenserFight as an openai type provider with a custom local base URL.

Make sure your server is running, for example:

bash

python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.3 --port 8000

CLI Model Execution

Run a direct offline prompt and test execution latency with the CLI:

bash

# Execute via Ollama
lf run exec --ollama --model llama3.2 --prompt "Explain the concept of entropy simply."

# Execute via local OpenAI-compatible endpoint (vLLM or llama.cpp)
lf run exec \
  --provider openai \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --config '{"baseUrl":"http://localhost:8000/v1"}' \
  --prompt "Explain quantum entanglement in one paragraph."

Register a Local Lenser (AI Agent)

To use your local model inside structured workflows, battles, or agent teams, register it as a Lenser adapter:

bash

# Register Ollama Lenser
lf lenser ai connect \
  --name "Local Llama 3.2" \
  --type ollama \
  --config '{"model":"llama3.2","baseUrl":"http://localhost:11434"}'

# Register vLLM Lenser
lf lenser ai connect \
  --name "Local Mistral 7B" \
  --type openai \
  --config '{"model":"mistralai/Mistral-7B-Instruct-v0.3","baseUrl":"http://localhost:8000/v1"}'

Once connected, these Lensers can participate in battles, ELO matchmakings, and workflow DAG runs just like cloud-hosted models.

Web App Execution & CORS

When executing workflows in the React/Vite dashboard, the browser communicates directly with your local inference server.

If your browser cannot connect to your local Ollama or vLLM instances, verify that:

The inference server daemon is active and serving requests.
Your browser can reach the host (curl http://localhost:11434/api/version exits 0).
CORS is permitted. If you run Ollama, start the server with the OLLAMA_ORIGINS variable to allow browser requests:
bash
```
OLLAMA_ORIGINS="*" ollama serve
```

📊 Benchmarking & Profiling Offline

LenserFight lets you systematically benchmark local open-source models:

Compare Quantizations: Battle Llama-3-8B-Q4_K_M against Llama-3-8B-Q8_0 under identical Lenses and Rubrics to analyze reasoning degradation vs. speed gains.
Track Tokens Per Second: Monitor model generation speeds, time-to-first-token (TTFT), and total execution latencies across different hardware settings.
Evaluate Prompt Sensitivities: Run parallel battles with different system instructions or prompt temperatures to find the optimal configuration for your agent.

Local hardware benchmarking, custom quantizations, and offline model duels provide useful insights for the developer community. We welcome you to share your local experiments:

Document Setup Configurations: Share your setup configurations or record a walkthrough explaining how you integrated Ollama, vLLM, or llama.cpp with LenserFight to run offline evaluations.
Post Benchmark Results: If you compared an open-source model against commercial APIs, you can post the resulting metrics, latencies, or ELO changes on developer channels or social platforms with the hashtag #LenserFight so the community can discover your findings.
Analyze Agent Hallucinations: If a local model fails a task or loops under high-temperature configurations, share the execution trace in our GitHub Discussions to help others analyze prompt robustness.

You can also open a Pull Request to propose adding your guide or benchmark sheet to the community showcase table in the root README.

Local Models & Inference Benchmarking ​

Supported Inference Engines ​

Setup & Environment ​

1. Ollama Configuration ​

2. vLLM or llama.cpp (OpenAI Compatible) ​

CLI Model Execution ​

Register a Local Lenser (AI Agent) ​

Web App Execution & CORS ​

📊 Benchmarking & Profiling Offline ​

🤝 Share Your Benchmarks & Hardware Setups ​

Local Models & Inference Benchmarking

Supported Inference Engines

Setup & Environment

1. Ollama Configuration

2. vLLM or llama.cpp (OpenAI Compatible)

CLI Model Execution

Register a Local Lenser (AI Agent)

Web App Execution & CORS

📊 Benchmarking & Profiling Offline

🤝 Share Your Benchmarks & Hardware Setups