Agents and Ensemble Reasoning

The recent release of open source reasoning models such as QwQ (Qwen) and Phi-4 (Microsoft) have opened up new possibilities in Agents to reason via a methodology that can be called: “Ensemble Reasoning

Ensemble Reasoning brings together an ensemble of different A.I. components that can directly participate in reasoning via a collective process.

The release of Open AI’s o1 model kicked off this era by releasing a model specifically trained to generate reasoning steps.  Prior to this, techniques such as “ReAct” prompted models to generate reasoning steps, but o1 “baked in” this reasoning directly in the model.

Now with the release of open source reasoning models akin to o1, the A.I. developer community can experiment, develop, and deploy Agents based on ensemble reasoning, including components of their own design and choosing, creating a new category of A.I. software components: Ensemble Reasoners.

So what is different?

Today, Agents utilize tools, such as a web search or database query, to help accomplish tasks.  This occurs in a loop with an Agent using an LLM model to decide what to do next, such as selecting a tool, then the Agent uses the tool, and then the Agent provides the output of the tool back to the model to decide the next step.  This cycle repeats as long as needed to accomplish the task.  At each step, the model starts reasoning from scratch.

With Ensemble Reasoning, the model can utilize the ensemble directly during the reasoning process, short-circuiting the need to restart reasoning after each step.

This not only compresses the Agent “loop” to greatly speed up the process, but open source reasoning models also open up the “black box” of the model to enable integrating highly optimized ensemble members to improve the reasoning process in a human understandable way.

To a degree, Ensemble Reasoning is a generalization of the Mixture-of-Experts LLM algorithm (non-reasoning) which splits a large model into smaller “expert” modules such that only a portion of the model is active for a given request.   Reasoning models are specifically trained to produce reasoning steps as tokens (words) which can be sent to an ensemble “expert” to process.  These reasoning tokens can be read by humans directly to understand the reasoning process.  This is in contrast to Mixture-of-Experts where expert modules are enabled via the internal parameters of the model and cannot be directly understood.

When the reasoning steps are dynamically routed to the ensemble members to process, the results are then provided back to the reasoning model which then continues reasoning.

The key difference is that the model keeps inferencing while output from the ensemble members is fed into it to further the reasoning process.  By comparison, the current Agentic loop starts over for each iteration which introduces a large amount of overhead and latency.

So what are these Ensemble Reasoning members?

There is the reasoning model, which acts as the conductor of the ensemble and produces reasoning steps that can be processed by the ensemble members.

An ensemble member takes a reasoning step as input and produces output which is added into the reasoning, allowing the reasoning model to be aware of this information as it continues reasoning.

Examples of types of ensemble members include:

  • A separate LLM Model, trained for a specific task or usage scenario
  • Knowledge Graph Search
  • Web, Document, Database Search
  • Code Executor
  • Math Calculator, Constraint Solver, Formal Planner
  • Logical (Semantic) Reasoner, Rule Engine 
  • Machine Learning Prediction Model, Recommendation Model (Graph Neural Network)

Any “tool” currently used by an Agent could potentially become an ensemble member.  However, only tools which affect reasoning should be used in this way, and only tools which can operate efficiently to keep pace with reasoning.  Also, such tools should not affect the state of the Agent as reasoning is thinking about what to do and not actually doing it (yet).  Reasoning may “change its mind” many times before deciding on an action.

As a counter example, an API to send an SMS text message should not be an ensemble member as it changes the Agent’s state (message sent vs not sent) and cannot be retracted.

It’s helpful to make a distinction between ensemble member tools as “ensemble tools” or “internal tools” and tools that the Agent uses as “Agent tools” or “external tools” (as in, external to the model).

So what are some examples of how Ensemble Reasoning works?

Let’s consider an agent, named Haley, that we are going to ask to do certain tasks. These are simple illustrative examples whereas an Agent can have complex multi-step workflows and act autonomously.  Here’s how ensemble reasoning can affect the reasoning process for these tasks.

Planning a trip:

>> Haley, give me directions to get to the MoMA Museum.

The reasoning ensemble can use real-time traffic and transportation information to plan a route directly during the reasoning process and avoid the Agentic “loop” of having to have many LLM requests to check each routing option.  A constraint solver or planner could be leveraged for a complex route or one with multiple waypoints. 

Writing fiction:

>> Haley, give me some ideas of what should happen next in this Sci-Fi story I’m writing.

The reasoning ensemble can take advantage of reasoning to understand the plot and motivations of characters in the story and use an expert LLM trained in creative fiction writing to generate the text of the ideas.

Recommending a Movie to watch:

>> Haley, what movie should I watch out with my friends tonight?

The reasoning ensemble can take advantage of real-time movie schedules and a trained recommendation model (based on GNNs) to correlate recommendations with the available options.  Queries to a knowledge graph can provide information about the friends for their movie interests and locations to find a suitable theater for the group.

Shopping Recommendation:

>> Haley, what should I get my mom for her birthday?

The reasoning ensemble can query the knowledge graph to get information about my mother’s likes, use a recommendation model (based on GNNs) to get product recommendations, and check shipping options to confirm a gift will arrive in time.

Produce a financial report:

>> Haley, generate a financial report as a PDF based on projections for next quarter and email it to the finance team.

The reasoning ensemble can leverage document and database search to collect the requisite information, use prediction models to make financial projections, write and execute code to produce the sums needed for the report, and produce Agent tool calls for generating the report PDF and emailing it which the Agent can then execute.  So, this is an example of combining “ensemble tools” with “Agent tools”.

Counting things:

>> How many R’s in Strawberry?

This is a classic problem LLMs have, partly due to how text is encoded in tokens when provided to the LLM, so LLMs often give the wrong answer of “2” to this seemingly trivial request.

Reasoning models specifically spell this out as “s-t-r-a-w-b-e-r-r-y” during reasoning and then count the letters.  But, even so, mistakes are made.

I personally like the variant of this:

>> How many vowels are in this exact sentence?

When the reasoning model is told that the ensemble includes code execution and it should use this for any request which can be solved by coding, the QwQ model generates the code:

sentence = "how many vowels are in this exact sentence?"
vowels = "aeiouAEIOU"
count = 0
for char in sentence:
    if char in vowels:
        count += 1
print(count)

This code can be executed by the ensemble member Code Executor, which then gets the correct answer.  The LLM models in general are much better at producing code then trying to do anything directly quantitative, so ideally all such requests are routed to an ensemble member.

See below for a sample reasoning trace of the QwQ model working through a logic puzzle. This will give a sense of a reasoning model working through different options to find a solution.

What are the real benefits?

Ensemble Reasoning has some immediate benefits and opportunities including:

  • Dramatically speeding up the Agentic “loop” by pushing more processing directly into the model removing overhead and latency.
  • As the reasoning model generates reasoning steps, it only chooses expensive operations when necessary, decreasing the overall cost.
  • Mix and match LLM models into a Reasoning workflow.  Want to combine QwQ with LLama and Mistral? Sure! Want to use Open AI o1 as a reasoning “tool” within QwQ? Sure!
  • Enterprise guardrails integrated directly into the reasoning process to approve or deny reasoning steps as they occur.
  • Integrating sources of dynamic knowledge like Knowledge Graphs into the reasoning, exploring many more cases than would be possible with the Agentic loop.
  • Integrating prediction and recommendation models, such as Graph Neural Networks (GNNs), into reasoning for applications such as eCommerce personalization. 

More broadly, with reasoning models focusing on reasoning, other aspects of the LLM like knowledge retrieval can be “out-sourced” to the Ensemble, making for smaller reasoning models that are faster, cheaper to operate, and smarter with the Ensemble.

What needs to be implemented?

A Reasoning Ensemble first and foremost requires a reasoning model.  The Ensemble is directly integrated with the reasoning model, which means either the reasoning model is open source or the developer has full access to the model (via creating it or a commercial license).  Current open source reasoning models include QwQ and Phi-4, with others on the way.

The reasoning model runs within server software such as vLLM or llama.cpp (both open source).

Vital.ai is developing a Reasoning Ensemble framework to run within vLLM.  This will leverage the Vital AI Agent Ecosystem to provide ensemble tools like KGraphService, which is a Knowledge Graph service leveraging Vector and Graph queries.

What is next?

At Vital.ai we are developing Ensemble Reasoning as a core capability with implementations using the open-source Vital A.I. Agent Ecosystem.

The current open source reasoning models are impressive but still experimental in nature and we’re excited to use these in our Agent deployments as they mature.

We’re excited to work with clients interested in making Ensemble Reasoning part of their Agent strategy and implementations.  Please contact us (https://www.vital.ai) to discuss your current Agent initiatives and to learn about Ensemble Reasoning!

Fun thing to try

One thing I haven’t tried yet is directly streaming tokens between two models with the second model acting to critique the first.  One model critiquing another (or even self-critiquing) is a technique that has worked quite well to improve the final output of prior LLM models.  Having this criticism incorporated dynamically during reasoning would be a very interesting approach and may have excellent results.

Notes

  1. https://huggingface.co/Qwen/QwQ-32B-Preview
  2. https://qwenlm.github.io/blog/qwq-32b-preview/
  3. https://huggingface.co/microsoft/phi-4
  4. https://techcommunity.microsoft.com/blog/aiplatformblog/introducing-phi-4-microsoft%E2%80%99s-newest-small-language-model-specializing-in-comple/4357090
  5. https://venturebeat.com/ai/microsoft-makes-powerful-phi-4-model-fully-open-source-on-hugging-face/
  6. https://github.com/vllm-project/vllm
  7. https://github.com/ggerganov/llama.cpp
  8. https://github.com/vital-ai
  9. https://www.vital.ai/agent-ecosystem.html
  10. ReAct
    https://arxiv.org/abs/2210.03629
  11. https://github.com/langchain-ai/react-agent
  12. Large Language Models Cannot Self-Correct Reasoning Yet
    https://arxiv.org/abs/2310.01798
  13. CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
    https://arxiv.org/abs/2305.11738
  14. https://arbisoft.com/blogs/why-ll-ms-can-t-count-the-r-s-in-strawberry-and-what-it-teaches-us
  15. https://en.wikipedia.org/wiki/Graph_neural_network
  16. https://en.wikipedia.org/wiki/Autonomous_agent
  17. https://openai.com/o1/
  18. https://en.wikipedia.org/wiki/Knowledge_graph
  19. https://en.wikipedia.org/wiki/Semantic_reasoner

Sample QwQ Reasoning Trace

“Selena, Jennifer and Miley wear a blue dress, yellow dress, and green dress in an unknown order. It is known that:

1) If Selena wears blue, then Jennifer wears green.
2) If Selena wears yellow, then Miley wears green.
3) If Jennifer does not wear yellow, then Miley wears blue.

What is the color of the dress Selena is wearing?”


Let’s tackle this logic puzzle step by step. We have three girls: Selena, Jennifer, and Miley, each wearing a different colored dress: blue, yellow, and green. The problem gives us three conditions, and we need to figure out what color dress Selena is wearing based on these conditions.

First, let’s list out the conditions again for clarity:

1. If Selena wears blue, then Jennifer wears green.

2. If Selena wears yellow, then Miley wears green.

3. If Jennifer does not wear yellow, then Miley wears blue.

Our goal is to determine the color of Selena’s dress.

Since there are only three colors and three people, each wearing a different color, we can start by assuming each possible color for Selena and see which one fits all the conditions.

**Option 1: Selena wears blue.**

From condition 1: If Selena wears blue, then Jennifer wears green.

So, if Selena is blue, Jennifer is green.

Then, since there are only three colors and Selena has blue and Jennifer has green, Miley must have yellow.

Now, let’s check condition 2: If Selena wears yellow, then Miley wears green.

But in this case, Selena is wearing blue, not yellow, so this condition doesn’t directly apply.

Next, condition 3: If Jennifer does not wear yellow, then Miley wears blue.

In this scenario, Jennifer is wearing green, which is not yellow, so according to this condition, Miley should wear blue.

But wait, earlier we deduced that if Selena is blue and Jennifer is green, then Miley must be yellow.

This creates a contradiction because condition 3 says Miley should wear blue, but our earlier deduction says Miley is yellow.

Therefore, this option is invalid because it leads to a contradiction.

**Option 2: Selena wears yellow.**

From condition 2: If Selena wears yellow, then Miley wears green.

So, if Selena is yellow, Miley is green.

Then, since Selena has yellow and Miley has green, Jennifer must have blue.

Now, let’s check condition 1: If Selena wears blue, then Jennifer wears green.

But in this case, Selena is wearing yellow, so this condition doesn’t directly apply.

Next, condition 3: If Jennifer does not wear yellow, then Miley wears blue.

In this scenario, Jennifer is wearing blue, which is not yellow, so according to this condition, Miley should wear blue.

But earlier, from condition 2, we have Miley wearing green.

This is a contradiction.

Therefore, this option is also invalid.

**Option 3: Selena wears green.**

If Selena wears green, then we don’t have any direct conditions from the given statements.

Let’s see what we can deduce.

Since Selena is green, then Jennifer and Miley must be wearing blue and yellow, but we don’t know which one yet.

Now, look at condition 1: If Selena wears blue, then Jennifer wears green.

But Selena is green, not blue, so this condition doesn’t apply.

Condition 2: If Selena wears yellow, then Miley wears green.

Again, Selena is green, not yellow, so this doesn’t apply.

Condition 3: If Jennifer does not wear yellow, then Miley wears blue.

In this scenario, since Selena is green, Jennifer could be blue or yellow.

If Jennifer is yellow, then condition 3 doesn’t apply because Jennifer is wearing yellow.

If Jennifer is blue, then condition 3 says Miley wears blue, but that’s impossible because Jennifer is already wearing blue, and each person has a different color.

Therefore, if Selena is green and Jennifer is blue, it leads to Miley also wearing blue, which is a contradiction.

So, if Selena is green, Jennifer cannot be blue; she must be yellow.

Therefore, Jennifer is yellow, and Miley is blue.

Let’s check if this fits all conditions.

Condition 1: If Selena wears blue, then Jennifer wears green.

Selena is green, not blue, so doesn’t apply.

Condition 2: If Selena wears yellow, then Miley wears green.

Selena is green, not yellow, so doesn’t apply.

Condition 3: If Jennifer does not wear yellow, then Miley wears blue.

Jennifer is yellow, so this condition doesn’t apply because she is wearing yellow.

Therefore, there are no contradictions in this scenario.

So, Selena is wearing green.

Wait a minute, but the problem asks for the color of Selena’s dress, and according to this, it’s green.

But in the initial options, both Selena wearing blue and yellow led to contradictions, so the only remaining option is green.

However, let’s double-check to make sure there are no hidden contradictions.

If Selena is green, Jennifer is yellow, and Miley is blue, does this satisfy all conditions?

Condition 1: If Selena wears blue, then Jennifer wears green.

Selena is not wearing blue, so no issue.

Condition 2: If Selena wears yellow, then Miley wears green.

Selena is not wearing yellow, so no issue.

Condition 3: If Jennifer does not wear yellow, then Miley wears blue.

Jennifer is wearing yellow, so this condition doesn’t apply.

Everything seems to check out.

Therefore, Selena is wearing the green dress.

**Final Answer**

green