FoVer: Training Step-Level Reasoning Verifiers with Formal Verification Tools

Overview of our approach, FoVer. We train PRMs on step-level error labels automatically annotated by formal verification tools on LLM responses to tasks compatible with formal verification. We observe that the resulting LLM-based PRMs improve verification across broad reasoning tasks.

Introduction

Process Reward Models (PRMs), which provide step-by-step feedback on the reasoning generated by Large Language Models (LLMs), are receiving increasing attention. However, two key research gaps remain: collecting accurate step-level error labels for training typically requires costly human annotation, and existing PRMs are limited to math reasoning problems.

In response to these gaps, this paper aims to address the challenges of automatic dataset creation and the generalization of PRMs to diverse reasoning tasks. To achieve this goal, we propose FoVer, an approach for training PRMs on step-level error labels automatically annotated by formal verification tools, such as Z3 for formal logic and Isabelle for theorem proof, which provide automatic and accurate verification for symbolic tasks.

Our experiments highlight the following findings:

We propose FoVer, an approach to train PRMs on step-level error labels automatically annotated by formal verification tools, such as Z3 and Isabelle. FoVer aims to address two core challenges in PRMs: automatic training dataset creation and generalization to diverse reasoning tasks.
We synthesize a training dataset for PRMs consisting of accurate step-level error labels on LLM responses for formal logic and theorem proof tasks, without relying on human annotation.
Although this data synthesis is feasible only for tasks compatible with formal verification, we empirically demonstrate that training on our dataset improves LLM-based PRMs across broad reasoning tasks, demonstrated in Best-of-K performance on 12 reasoning benchmarks.
We show that PRMs trained with FoVer perform competitively with state-of-the-art PRMs built on the same LLMs across broad reasoning tasks. This is a noteworthy result because these PRMs are trained on labels annotated by humans or stronger models, which FoVer does not employ.

Best-of-K (K=5) performance of PRMs on 12 reasoning tasks. FoVer creates training datasets for PRMs on tasks where we can automatically annotate step-level error labels using formal verification tools. FoVer significantly improves verification performance on broad out-of-distribution reasoning tasks, compared to baseline PRMs based on the original LLMs.

Creation process of the FoVer dataset.
(1) We first generate LLM reasoning in the format compatible with formal verification tools: Z3 and Isabelle.
(2) We use the formal verification tools to automatically annotate step-level error labels, without involving human annotation.

We synthesize a dataset that includes step-level error labels on formal logic and formal theorem proof tasks.
The FoVer dataset includes step-level error labels on responses from Llama 3.1 8B and Qwen 2.5 7B.
The step-level error labels are automatically annotated using formal verification tools, Z3 and Isabelle.

Formal Logic

Base Task: We use the logical entailment task in first-order logic (FOL) reasoning, in which the goal is to determine whether a hypothesis is logically entailed by a given set of premises.
Base Dataset: We use responses from LLMs to the symbolic version of FLDx2, a dataset for multi-step FOL deduction.
Automatic Error Annotation: We use Z3 for the step-level error annotation. Each step is checked independently by supplying Z3 with the current premises and the step's conclusion to determine logical validity

Formal Theorem Proof

Base Task: We use the task of formal theorem proving for verifying solutions for math word problems, which are studied in research for formally verifying informal solutions from LLMs.
Base Dataset: We use informal responses from LLMs to GSM8K, GSM8K-based cases in MetaMathQA, and math word problems in Big-Math.
Informal to Formal Conversion: We convert informal problems from math reasoning datasets, as well as informal reasoning generated by LLMs, into formal proofs in Isabelle's format using Llama 3.3 70B with few-shot prompting.
Automatic Error Annotation: Isabelle is designed for solution-level verification and stops verification at the first error. To obtain step-level error labels, we implement code for step-level verification, based on an existing library for verifying Isabelle proofs.

Statistics

Statistics of the FoVer dataset.

We evaluate our PRMs trained on the FoVer dataset on broad reasoning benchmarks. We compare our PRMs with baseline PRMs based on the same LLMs without additional training and state-of-the-art PRMs in prior work built on the same LLMs. Specifically, we evaluate PRMs in Best-of-K on 12 reasoning benchmarks and step-level verification on ProcessBench. The results are summarized as follows:

LLM-based PRMs trained with FoVer significantly improve verification capabilities on a broad range of reasoning tasks compared to baseline PRMs based on the same LLMs without additional training, demonstrating effective cross-task generalization of our approach.
PRMs trained with FoVer are competitive with or better than state-of-the-art PRMs built on the same LLMs. This result is remarkable because the state-of-the-art PRMs are trained on labels annotated by humans or stronger models, which FoVer does not use.
Our manual analysis verifies that FoVer genuinely improves step-level verification and rarely degrades it from baseline PRMs based on the same LLMs without additional training.

Training Step-Level Reasoning Verifiers
with Formal Verification Tools

Introduction

Dataset

Formal Logic

Formal Theorem Proof

Statistics

Experiments

Best-of-K

ProcessBench

BibTeX