Sometimes you want AI, but you don’t want:
your data leaving your machine,
another vendor key to manage,
usage-based surprise bills.
That’s where local LLMs shine.
Ollama makes this easy: it runs a local server, and you talk to it like any HTTP service. Which is perfect for Java.
This post shows the simplest possible integration: Java HttpClient → Ollama → response printed.
Step 1: Install Ollama and pull a model
Install Ollama (their site has the OS-specific installer). Once it’s installed, pull a model.
If you’re unsure, start here:
ollama pull llama3.2:3bOther good options:
Smaller/faster: llama3.2:1b
Coding-focused: qwen2.5-coder:7b
Small + efficient: phi3:mini
Vision (optional): llava
You can also see what you have locally:
ollama listStep 2: Start Ollama
Usually Ollama runs as a background service once installed. If you need to start it manually:
ollama serve
Ollama listens on: http://localhost:11434
That’s it. Now it’s just HTTP.
Step 3: Call the local LLM from Java
We’ll use the endpoint:
POST http://localhost:11434/api/generate
Important detail for beginners: set "stream": false so you get one clean JSON response (instead of token-by-token streaming).
Main.java
import java.net.URI;
import java.net.http.*;
import java.time.Duration;
public class Main {
public static void main(String[] args) throws Exception {
String model = "llama3.2:3b";
String prompt = "Write a friendly explanation of Java virtual threads in 5 lines.";
String body = """
{
"model": "%s",
"prompt": "%s",
"stream": false
}
""".formatted(model, escapeJson(prompt));
HttpClient client = HttpClient.newHttpClient();
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("http://localhost:11434/api/generate"))
.timeout(Duration.ofSeconds(30))
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(body))
.build();
try {
HttpResponse<String> response =
client.send(request, HttpResponse.BodyHandlers.ofString());
if (response.statusCode() != 200) {
System.out.println("Request failed: " + response.statusCode());
System.out.println(response.body());
return;
}
System.out.println(response.body());
} catch (java.net.ConnectException e) {
System.out.println("Cannot connect to Ollama at localhost:11434");
System.out.println("Is Ollama running? Try: ollama serve");
}
}
private static String escapeJson(String s) {
return s.replace("\\", "\\\\").replace("\"", "\\\"");
}
}
Run it, and you’ll see JSON printed back.
“Okay but I only want the text”
Totally fair.
Ollama returns JSON that includes a response field (the actual generated text). You can extract it properly with Jackson (recommended) instead of brittle string parsing.
This is the exact moment where Java feels good: you define a tiny record for the response, parse it, and keep your code clean.
Quick troubleshooting (the stuff that trips people)
If you see “connection refused”:
- Ollama isn’t running. Start it with ollama serve or open the Ollama app.
If you get a model error:
- You didn’t pull the model yet. Run ollama pull llama3.2:3b (or whatever model name you used).
If responses are slow:
- Try a smaller model like llama3.2:1b or phi3:mini.
Why local LLM + Java is a great combo
Local models are not about “replacing the cloud.” They’re about control:
privacy for internal docs and dev workflows,
fast iteration while building features,
no key management,
predictable cost (because it’s your machine doing the work).
Java fits because it’s already the home for real systems, not just experiments. Once your “hello world” call works, it’s a straight road to turning it into a service, adding caching, metrics, timeouts, and all the stuff production requires.
What I’ll write next
If you want to go one step beyond this:
streaming responses (token-by-token)
structured output (JSON-only answers)
tool calling (Java functions as “tools”)
evals (so changes don’t silently break behavior)
- Suren