Introduction
Artificial intelligence agents are becoming increasingly capable of researching companies, analyzing spreadsheets, and answering complex questions. But when you work in finance or commercial real estate, the real question is not which AI sounds the smartest. It is which one produces the most reliable results. That is where benchmark testing becomes useful.

The latest benchmark data from Leni provides a structured way to compare leading AI agents across tasks that matter to analysts, including spreadsheet accuracy, deep research, and real-world reasoning. For professionals evaluating Leni vs Manus, Genspark, and OpenAI Deep Research, these results offer a clearer picture of which platform is best suited to financial work.
In commercial real estate, where underwriting models and investment memos depend on precise numbers, benchmark-tested performance can be more meaningful than marketing claims.
Comparing AI Agents for Financial Research
General-purpose AI agents such as Manus, Genspark, and OpenAI Deep Research are designed to handle a wide range of research tasks. Leni takes a different approach by focusing specifically on finance and commercial real estate workflows. According to Leni’s benchmark page, the platform scored strongly across four independent tests:
- GAIA: 77.0%
- DRACO: 71.6%
- SpreadsheetBench Verified: 91.25%
- BullshitBench v2: 98%
These results are relevant because they measure the same capabilities analysts rely on every day: reasoning, spreadsheet accuracy, research depth, and the ability to avoid hallucinations.
Leni’s Benchmark Results at a Glance
| Benchmark | What It Measures | Leni Result |
|---|---|---|
| GAIA | Real-world reasoning and research | 77.0% |
| DRACO | Deep research quality | 71.6% |
| SpreadsheetBench Verified | Spreadsheet task accuracy | 365/400 tasks |
| BullshitBench v2 | Ability to reject false premises | 98% |
The latest Leni benchmark results suggest that the platform performs well in the areas that matter most for financial analysis.
How Leni Compares to Manus, Genspark, and OpenAI Deep Research
When comparing Leni with Manus, Genspark, and OpenAI Deep Research, the biggest difference is in the specialization.
#1 – Leni
Designed specifically for finance and commercial real estate.
#2 – Manus AI
A general-purpose autonomous agent known for handling a broad range of tasks.
#3 – Genspark
An AI research agent focused on information gathering and synthesis.
#4 – OpenAI Deep Research
An advanced research tool that performs web-based analysis and report generation. General-purpose tools are powerful, but finance professionals often require more than broad capabilities. They need dependable outputs in spreadsheets, models, and research reports.
It is where Leni stands out. Its benchmark results suggest stronger performance on spreadsheet-heavy and data-intensive workflows than general-purpose AI agents such as Manus, Genspark, and OpenAI Deep Research.
SpreadsheetBench: Testing Spreadsheet Skills for Financial Analysis
Spreadsheet work is at the core of financial analysis. Analysts routinely use spreadsheets to:
- Build cash flow models
- Forecast revenue
- Analyze debt schedules
- Run sensitivity analysis
Among the benchmarks, SpreadsheetBench is especially relevant because it tests the exact tasks analysts perform in Excel every day. Leni’s result of 365 out of 400 verified tasks indicates strong spreadsheet capability. Therefore, it is well-suited for financial modeling and data analysis workflows. However, general-purpose AI agents such as Manus, Genspark, and OpenAI Deep Research are built for a broader range of tasks and may not be as optimized for spreadsheet-intensive work. Leni connects spreadsheet capabilities to practical tasks such as building underwriting models directly from operating statements.
For finance professionals, this benchmark is crucial because even a minor formula error can change investment decisions. If you are looking for the best AI for finance, spreadsheet accuracy should be one of your top evaluation criteria.
GAIA: Measuring Real-World Reasoning and Tool Use
GAIA is a benchmark that tests whether AI systems can solve complex real-world tasks using reasoning and external tools. Leni achieved a 77.0% score on GAIA according to its benchmark page. This score is relevant to finance because analysts frequently need to:
- Gather information from multiple sources
- Reconcile conflicting data
- Perform multi-step calculations
- Produce well-supported conclusions
When comparing OpenAI Deep Research vs Leni, the GAIA benchmark helps show which platform performs better on multi-step research and reasoning tasks.
DRACO: Evaluating Deep Research Quality
DRACO measures the quality of long-form research produced by AI systems. Leni scored 71.6% on DRACO. This benchmark reflects how well an AI can synthesize large amounts of information into coherent, useful analysis.
For investment professionals, that translates into better market studies, company research, and investment memos.
What These Benchmark Results Mean for Finance Professionals
The benchmark scores highlight four strengths that matter in day-to-day finance work:
- Accurate Spreadsheet Handling: Strong SpreadsheetBench performance suggests dependable spreadsheet task execution.
- Better Research Quality: High DRACO scores indicate stronger synthesis and analytical reporting.
- Reliable Reasoning: GAIA tests the multi-step reasoning analysts use every day.
- Lower Hallucination Risk: BullshitBench v2 measures whether the AI rejects unsupported assumptions.
These capabilities are especially valuable in commercial real estate, where errors in assumptions or formulas can materially affect valuation and underwriting outcomes.
How Leni Supports Real Financial Workflows
Benchmark comparisons are very meaningful when they reflect what analysts do each day. Leni positions itself as an AI platform that helps professionals move from raw files to deliverables that are decision-ready deliverables.
Users can upload spreadsheets, PDFs, and operational documents and ask Leni to build underwriting models. Besides, they can also use Leni to generate market studies, draft investment committee memos, and so on. The platform is built to work with multiple steps and produce a desired output rather than needing constant back-and-forth prompting.
Leni also highlights verifiable outputs with source links and structured reasoning, which can reduce the time analysts spend checking AI-generated work. For finance professionals comparing Leni vs Manus, Genspark, and OpenAI Deep Research, this focus on completed, audit-friendly deliverables may be as important as benchmark scores themselves.
Which AI Agent Is Best for Finance?
The answer depends on your priorities.
Choose Manus, Genspark, or OpenAI Deep Research if you need a broad-purpose AI research assistant.
Choose Leni if you need a specialized platform built for financial analysis and commercial real estate workflows. The benchmark data suggest that Leni handles spreadsheet modeling, data reconciliation, and research workflows more reliably than broader AI agents designed for general use cases.
For teams that value source-linked outputs, industry integrations, and decision-ready deliverables, Leni’s specialized approach is especially compelling.
Conclusion
Benchmark scores are only useful when they reflect real work. For professionals comparing Leni vs Manus, Genspark, and OpenAI Deep Research, Leni’s benchmark performance stands out because it focuses on the core requirements of financial analysis – spreadsheet accuracy, deep research quality, reasoning, and reliability.
The latest Leni benchmark results suggest that specialized AI tools may provide more dependable support than general-purpose agents when the task involves financial modeling, underwriting, or investment research.
If your work depends on accurate numbers and well-supported analysis, benchmark-tested specialization may matter more than broad AI versatility.