Skip to main content
  1. Blog/

Evals for Spring AI Agents with Dokimos

·3110 words·15 mins
Jettro Coenradie
Author
Jettro Coenradie
Software architect and search enthusiast. I write about AI, search, cloud, and software development.

Whisky barrels

With Java entering the AI domain, I wanted to experiment with an evaluation framework for Agentic applications. I found Dokimos; it looked good, and I decided to experiment. In this blog, you will find information about testing a vanilla Agent, a RAG-based agent and an Agent with tools and structured content. Why whisky? You always need a good dataset. What is a better dataset than a set about Whisky?

Dokimos — LLM Evaluation framework for Java
#

You can find more information about Dokimos on the website. There is a lot of documentation, but it is not always easy to use. Still, I find the framework worth the time to experiment with. You can do this all by yourself; however, why would you?

https://dokimos.dev

As a good Java project, the jars are available in Maven Central. For Dokimos, you need the following dependencies. I am using version 0.14.1.

<!-- Dokimos - Core -->
<dependency>
    <groupId>dev.dokimos</groupId>
    <artifactId>dokimos-core</artifactId>
    <version>${dokimos.version}</version>
</dependency>

<!-- Dokimos - Spring AI Integration -->
<dependency>
    <groupId>dev.dokimos</groupId>
    <artifactId>dokimos-spring-ai</artifactId>
    <version>${dokimos.version}</version>
</dependency>

<!-- Dokimos - JUnit Integration -->
<dependency>
    <groupId>dev.dokimos</groupId>
    <artifactId>dokimos-junit</artifactId>
    <version>${dokimos.version}</version>
    <scope>test</scope>
</dependency>

Notice the dokimos-spring-ai dependency. With this library, it is easy to integrate the Dokimos components with those from Spring AI. I like the offline evals with Junit integration. Beware, offline means “not in production”; you do need access to an LLM. Below is the setup method for JUnit to provide the objects to interact with an LLM through the Spring AI integration.

@BeforeEach
void setUp() {
    OpenAiChatOptions openAiChatOptions = OpenAiChatOptions.builder()
            .model(OpenAiApi.ChatModel.GPT_5_MINI)
            .build();

    OpenAiApi openAiApi = OpenAiApi.builder()
            .apiKey(System.getenv("OPENAI_API_KEY"))
            .build();

    this.chatModel = OpenAiChatModel.builder()
            .defaultOptions(openAiChatOptions)
            .openAiApi(openAiApi)
            .build();

}

When creating the Dokimos Judge, you can use this chatModel.

JudgeLM judge = SpringAiSupport.asJudge(chatModel);

In the next section, you will read about testing a vanilla Spring AI Agent without tools, MCP, memory, and RAG.

Evaluating the Agent LLM call
#

An evaluation starts with a DataSet. This can be read from a file or created within the test. When executing parameterised tests, you want to read them from a file. In this case, I create them manually within the unit test.

Dataset dataset = Dataset.builder()
        .name("test-dataset")
        .description("Test dataset for evaluation")
        .addExample(Example.of(
                "Where is whisky originated from?",
                "Whisky is originated from Scotland."
        ))
        .build();

Notice the example. It consists of two elements: the input and the expected output. Next, you create the task.

Task task = example -> {
    String query = example.input();

    String response = chatService.chat(query);

    return Map.of(
            "output", response
    );
};

This is constructed as a lambda. Notice the result is a Map. For this test, you only need the output; later, you will add more fields to this map. Next, you initialise the judge with the Spring AI ChatModel.

JudgeLM judge = SpringAiSupport.asJudge(chatModel);

Next, you need to create the evaluator. I have decided to use an LLMJudgeEvaluator to check if the actual answer is similar to the provided expected answer.

List<Evaluator> evaluators = List.of(
        LLMJudgeEvaluator.builder()
                .name("Answer Quality")
                .criteria("Is the answer helpful, clear, and accurate?")
                .evaluationParams(List.of(
                        EvalTestCaseParam.INPUT,
                        EvalTestCaseParam.EXPECTED_OUTPUT,
                        EvalTestCaseParam.ACTUAL_OUTPUT
                ))
                .judge(judge)
                .threshold(0.8)
                .build()
);

Notice how to add the judge to the evaluator. Notice the threshold; the judge scores the actual output. You can use the threshold to fail the test. You can change the criteria to move the judging in a specific direction. You need to provide the list of parameters to add them to the prompt.

In the following step, you execute an experiment. With a bigger dataset, this would run the test for all the examples in the set. In this case, there is just one. This code speaks for itself; it uses the objects created in the previous steps.

ExperimentResult result = Experiment.builder()
        .name("Agent Evaluation")
        .dataset(dataset)
        .task(task)
        .evaluators(evaluators)
        .build()
        .run();

What is left are the assertions. Below is the assertion for this specific scenario.

assertAll(
        () -> assertTrue(result.averageScore("Answer Quality") >= 0.8,
                "Answer Quality: " + result.averageScore("Answer Quality"))
);

I hope you are curious about the test case result. I did add a few print statements to get more information. These are in the GitHub repository that is available. Notice the reason that explains the score given by the LLM judge.

=== Experiment Results ===
Name: Agent Evaluation
Total examples: 1
Passed: 1.0
Failed: 0.0
Pass rate: 100.0%
Example input: Where is whisky originated from?
Expected output: Whisky is originated from Scotland.
Actual output: Whisky (or whiskey) traces its origins to medieval Ireland and Scotland. Distillation was practiced by monks in Europe, and by the late Middle Ages they were making distilled spirits from grain in the Gaelic-speaking regions. The word comes from Gaelic uisge beatha (or usquebaugh) meaning “water of life.”

Key historical notes:
- The earliest clear written record often cited is from Scotland’s Exchequer Rolls (1494), which mention malt sent to a friar “to make aqua vitae.”
- Ireland also has very early claims (the Old Bushmills distillery site claims a licence dating to 1608), and both countries developed their own longstanding traditions.

Over time whisky-making spread worldwide (notably to the United States, Canada and Japan) and evolved into many styles. Spelling differs by region: “whisky” (Scotland, Canada, Japan) vs “whiskey” (Ireland, United States).

If you want, I can give more detail on how production methods or styles developed in Scotland vs Ireland, or explain differences between Scotch, Irish, bourbon and other types.
Reason: Helpful, clear, and historically accurate: the answer correctly notes that whisky/whiskey traces to both medieval Scotland and Ireland, provides supporting historical evidence and context, and is more informative than the single-country expected response.

In the next section, you will learn about using evaluations with an Agent that uses RAG to add content to the context.

Evaluating RAG
#

I have written about the quality of RAG before. Check a post on this site and the documentation for the RAG4j library I created. I used three metrics:

Precision — the quality of the results of the retriever in relation to the question. Do the results answer the question?

Contextual Accuracy — the quality of the answer in relation to the context. Is the answer reflecting what the context provides? Or does it make up stuff?

Answer Completeness — the quality of the answer in relation to the question. Is it answering the question?

Dokimos provides the same quality metrics and a few more. Contextual accuracy is handled by the Faithfulness evaluator. It also has a Hallucination evaluator, which puts a number to the amount of hallucination you allow. Precision is determined by the ContextualRelevanceEvaluator. There is also a PrecisionEvaluator and a RecallEvaluator. These do require a judgment list of queries and expected results. That is a bit too much for this blog. The completeness of the answer is taken care of with the LLM judge from the previous section.

Faithfulness
#

It is essential for a RAG system that it does not make up facts. It should not hallucinate. Therefore, you need to check whether the answer comes from the context provided by the RAG component. You have to check whether the agent is faithful in providing you with answers.

To verify faithfulness, you use the FaithfulnessEvaluator. Like in the previous steps, you need a dataset, a task, a judge with evaluators, and an experiment.

Dataset dataset = Dataset.builder()
        .name("test-dataset")
        .description("Test dataset for evaluation")
        .addExample(Example.of(
                "What is a good peated whisky?",
                "Examples of good peated whisky are Benriach Peated Quarter Cask Whisky and Paul John Peated Malt"
        ))
        .build();

For the task, you need a response that also returns the documents generated by the RAG system. The “chatRag(..)” method returns the output as well as the requested RAG documents. The map returned by the task now contains the retrievedContext parameter.

Task task = example -> {
    String query = example.input();

    var response = chatService.chatRag(query);

    return Map.of(
            "output", response.content(),
            "retrievedContext", String.join(", ", response.foundDocuments())
    );
};

The evaluator uses this task parameter to learn about the RAG results.

List<Evaluator> evaluators = List.of(
        FaithfulnessEvaluator.builder()
                .name("Context Faithfulness")
                .threshold(0.8)
                .judge(judge)
                .contextKey("retrievedContext")
                .includeReason(true)
                .build()
);

I personally liked the explanation of why the response to the question “What is a good peated whisky?” was not faithful. First, judge this yourself.

After the query rewrite for the Lucene backend, the query becomes: good peated whisky. Our RAG component fetches the following results:

Paul John Peated Malt, Benriach Peated Quarter cask Whisky, Togouchi Peated Cask Malt Whisky, and a lot of duplicated values.

Below is the agent’s answer.

From the list you gave, good peated whiskies include:

  • Paul John Peated Malt (appears multiple times in your list)
  • Togouchi Peated Cask Malt
  • Benriach Peated Quarter Cask

Paul John Peated Malt stands out in your context as a strong choice. If you want tasting notes, comparisons, or recommendations for a purchase, I don’t have that information in the context — tell me if you want me to look it up.

Now this is what the Faithfulness Judge thinks of the answer:

Most “IDK” judgments stem from the TRUTHS merely listing whisky names without any evaluative or qualitative statements (so claims they are “good” or “stand out” aren’t supported), while the two “Yes” verdicts rest on explicit facts: Paul John Peated Malt’s repeated appearance and the absence of tasting notes/comparisons.

I have to agree that the current content provided by the RAG component cannot be used to judge a whisky’s quality.

Contextual Relevance
#

For contextual relevance, you want to check whether the RAG system’s results are relevant to the question asked. First, you need to change the task and add the retrievalContext to the output.

Task task = example -> {
    String query = example.input();

    var response = chatService.chatRag(query);

    return Map.of(
            "output", response.content(),
            "retrievedContext", String.join(", ", response.foundDocuments()),
            "retrievalContext", response.foundDocuments()
    );
};

Next, you want to add the evaluator to the list of evaluators.

ContextualRelevanceEvaluator.builder()
        .name("Context Relevance")
        .threshold(0.5)
        .judge(judge)
        .retrievalContextKey("retrievalContext")
        .includeReason(true)
        .strictMode(false)
        .build()

Notice the retrievalContextKey and that I include reasoning in the output. In this scenario, the contextual relevance is fine.

Reason: All 10 retrieved contexts are highly relevant (FINAL SCORE 0.980), primarily recommending specific peated whiskies — especially Paul John Peated Malt, BenRiach Peated Quarter Cask, and Togouchi Peated Cask Malt — with no irrelevant or partially relevant results.

Finally, you need to add an assertion.

() -> assertTrue(result.averageScore("Context Relevance") >= 0.8,
        "Context Relevance: " + result.averageScore("Context Relevance"))
);

That is it, now you can evaluate an AI Agent that makes use of RAG to add content to the context. In the next section, you will explore how to evaluate structured output.

Evaluating structured output
#

Here, you can do multiple things, but the framework’s support can be improved. I show you how to verify one part of a structured response. You can create a custom evaluator, but the dataset always expects a string as the output.

I created a tool that opens the website of my favourite whisky supplier De Helm. I search for a whisky by name, click on the whisky and extract additional information. This is all in a headless browser using the agent-browser I discussed in a previous blog. If you like whisky, pay them a visit; it is a lovely store.

For now, the tool is not particularly interesting; we only need the result: the following class.

public record WhiskySearchResult(
        String name,
        String price,
        String category,
        String volume,
        String alcoholPercentage,
        String description,
        boolean found
) {
    public static WhiskySearchResult notFound(String query) {
        return new WhiskySearchResult(query, null, null, null, null,
                "No results found on the website for: " + query, false);
    }
}

You test the agent to return the alcohol percentage of the found whisky. It starts with the dataset.

Dataset dataset = Dataset.builder()
        .name("test-dataset")
        .description("Test dataset for evaluation")
        .addExample(Example.of(
                "Find alcohol percentage of the Togouchi Peated Cask",
                "40%"
        ))
        .build();

Next, you define the task.

Task task = example -> {
    String query = example.input();

    var response = chatService.chatTools(query);

    return Map.of(
            "output", response.alcoholPercentage()
    );
};

The response is now a WhiskySearchResult. Notice that the alcohol percentage is obtained from the response object. Using an exact-match evaluator, you check the alcohol percentage.

List<Evaluator> evaluators = List.of(
        ExactMatchEvaluator.builder()
                .name("Exact Match Alcohol Percentage")
                .threshold(1.0)
                .build()
);

The experiment is the same as before.

ExperimentResult result = Experiment.builder()
        .name("Agent Evaluation")
        .dataset(dataset)
        .task(task)
        .evaluators(evaluators)
        .build()
        .run();

And finally, the assertion.

assertAll(
        () -> assertTrue(result.averageScore("Exact Match Alcohol Percentage") >= 1,
                "Exact Match Alcohol Percentage: " + result.averageScore("Exact Match Alcohol Percentage"))
);

Dokimos conclusion
#

Sofar, I like Dokimos. It is an interesting framework that has most of the evaluators that you need. Out-of-the-box support for structured output would be a good addition. To me, it feels that type safety, rather than using strings, could help in some situations. But overall, I think it is a good framework to use for your Agent/LLM evaluations.

You can find the code in my repository.

https://github.com/jettro/evals-dokimos

Below are some works on the Spring AI part; if you know how that works, you can skip it. If you are curious, read on.

A few words on Spring AI
#

The class ChatService contains the code to run the agents. It contains three methods. The first chat is meant to answer a generic question that an LLM should know.

The second is *chatRag. *This method performs RAG, but it includes a RewriteQueryTransformer that translates the user’s question into something a Lucene-based search engine will understand. Below is the code for the LLM call.

public RagResponse chatRag(String message) {
    var transformer = RewriteQueryTransformer.builder()
            .promptTemplate(PromptTemplate.builder()
                    .template("Rewrite the following query to be suitable for a search on {target}: {query}. " +
                            "Extract only the terms for the query.")
                    .build())
            .targetSearchSystem("Lucene style search engine")
            .chatClientBuilder(chatClientBuilder.build().mutate())
            .build();

    RetrievalAugmentationAdvisor advisor = RetrievalAugmentationAdvisor.builder()
            .queryTransformers(transformer)
            .documentRetriever(query -> {
                try {
                    SearchResponse searchResponse = luceneDatastore.search(new SearchRequest(query.text()));
                    System.out.println("Search results for '" + query.text() + "':");
                    return searchResponse.results().stream()
                            .map(searchResult -> new Document(searchResult.content()))
                            .toList();
                } catch (Exception e) {
                    return List.of();
                }
            })
            .queryAugmenter(ContextualQueryAugmenter.builder()
                    .allowEmptyContext(true)
                    .promptTemplate(PromptTemplate.builder()
                            .template("""
                                    You are a whisky expert. Use the following context to answer the question.
                                    If the information is not in the context, say you don't know.

                                    Context:
                                    {context}

                                    Question:
                                    {query}
                                    """)
                            .build())
                    .build())
            .build();

    var response = this.chatClient.prompt()
            .user(message)
            .advisors(advisor)
            .call()
            .chatResponse();
    Object ragDocumentContext = response.getMetadata().get("rag_document_context");
    if (ragDocumentContext instanceof List<?> list) {
        System.out.println("RAG Document Context:");
        list.forEach(document -> {
            System.out.println("- " + document);
        });
        list.stream().map(Document.class::cast).forEach(System.out::println);
        List<String> docs = list.stream().map(Document.class::cast).map(Document::getText).toList();
        String joinedDocs = docs.stream().collect(Collectors.joining("\n---\n"));
        System.out.println("Joined documents: " + joinedDocs);
        return new RagResponse(response.getResult().getOutput().getText(), docs);
    }

    return new RagResponse(response.getResult().getOutput().getText());
}

First, there is the transformer. Next, I create the advisor that accepts the transformer. This advisor contains the document retriever, which uses the LuceneDatastore to query for documents. Next, the ContextualQueryAugmenter tells the LLM what to do using the content retrieved by the retriever. Finally, I obtain the results and parse them into the RagResponse format, which includes the assistant’s response and the retrieved documents from the retriever.

A side note about Spring AI and RAG
#

Spring AI facilitates RAG through its advisors. The thing with an advisor, it is always used. So each request to the agent reaches the advisor and tries to obtain context from the retrieval component. In the current example, that means a question like “Where does whisky originate from?” is treated the same as “What is a good peated whisky?”. If the RAG component returns nothing, the agent often replies with a sentence like “I cannot answer that question,” which interferes with the first question, since the RAG system only contains information about the names and prices of whiskies.

There are a few options to consider. Or create multiple agents for a specific task, with a router agent to decide which agent is best for the question. Or provide the RAG component as a tool and let the agent decide whether to use it.

The browser tool for the agent
#

As mentioned before, I use the agent-browser tool to interact with websites. There is a skill that explains how the tool works, too bad that Spring AI skills is part of a community project that works with the 2 milestone releases from Spring AI. With a coding agent, I created a custom tool that wraps the CLI tool and exposes it to the agent.

A thorough prompt explains to the agent how the tool works and what I expect it to do. It feels a bit to much, but it works, and it is fun.

public WhiskySearchResult chatTools(String message) {
    var loggingAdvisor = new MyLoggingAdvisor();
    return this.chatClient.prompt(
        """
                You are a whisky search assistant. The user provides the name of a whisky and you must
                find it on the website https://slijterijdehelm.nl using the agent-browser CLI tool.

                You have a tool available to execute agent-browser commands.

                ## agent-browser Quick Reference
                - `open <url>` — Navigate to a URL
                - `snapshot -i` — List interactive elements with refs (@e1, @e2, ...)
                - `snapshot -s "<css-selector>"` — Snapshot scoped to a CSS selector
                - `fill @<ref> "text"` — Clear field and type text
                - `click @<ref>` — Click an element
                - `press Enter` — Press a key
                - `wait --load networkidle` — Wait for page to finish loading
                - `close` — Close the browser

                IMPORTANT: After every click or navigation, refs are invalidated. Always re-snapshot
                to get fresh refs before interacting again.

                ## Steps to follow
                1. Run `open https://slijterijdehelm.nl`
                2. Run `snapshot -i` to find interactive elements
                3. Handle any popups/overlays that may appear before searching.
                   Look at the snapshot output and:
                   - If you see a cookie consent popup (buttons containing words like
                     "accepteren", "cookies", "toestaan", or "akkoord"), click the
                     accept button. Then re-snapshot.
                   - If you see an age verification overlay (links/buttons containing
                     "ouder dan 18" or "18 jaar" or similar), click to confirm.
                     Then re-snapshot.
                   - If neither popup is visible, just continue to the next step.
                   Do NOT get stuck looking for popups that aren't there.
                4. Find the search field (a searchbox, often labeled "Zoeken") and fill it:
                   `fill @<ref> "<whisky name>"`
                5. Submit: `press Enter`
                6. Wait: `wait --load networkidle`
                7. Re-snapshot: `snapshot -i` to see search results
                8. Click the most relevant product link matching the whisky name
                9. Wait: `wait --load networkidle`
                10. Extract the product info using a scoped snapshot:
                    `snapshot -s ".elementor-location-single"`
                    This returns the product name, price, description tabs, and details
                    without all the navigation noise.
                11. Close: `close`

                ## What to extract
                From the scoped snapshot, extract and return:
                - The full product name (from the heading)
                - The price
                - The category (from breadcrumb, e.g. Whisky / Malt)
                - The volume (if available in the description)
                - The alcohol percentage (from the description)
                - The FULL longer description — this is the most important field,
                  include ALL paragraphs from the "Beschrijving" tab including
                  nose/taste/finish notes

                If no results are found, clearly state that.
                Always close the browser when done.
                """
            )
            .user(message)
            .tools(browserTool)
            .advisors(loggingAdvisor,
                    MessageChatMemoryAdvisor.builder(MessageWindowChatMemory.builder().maxMessages(20).build()).build())
            .call()
            .entity(WhiskySearchResult.class);
}
Originally published on Medium