3 Comments
User's avatar
Josh's avatar

For me, the most striking comment: "Tidalwave published a benchmark with Columbia University showing its mortgage-trained agent outperforming general-purpose LLMs on underwriting and compliance tasks - 84% overall accuracy vs. 71% for Claude 4.5, with the largest gap on yes-or-no compliance checks."

Tidalwave built its own specially trained agent, and it only outperformed a 2-generation-old Claude model (w/ undetermined thinking intensity) by a bit. How will 4.6 and 4.7 perform? What about 5.0? What about 5.0 w/ grounding skills and context?

The one lesson from the past 2 years is that verticalized AI gets clobbered on quality by rapid improvements in foundation models and on cost by rapid improvements in open weight/open source models.

This is not the space I'd want to compete.

Anuj Adhiya's avatar

Fair point on the benchmark and a grounded, context-prompted frontier model would have been a more honest comparison.

But the broader conclusion doesn't quite follow for me, because it treats model accuracy and workflow position as the same thing. They're not. The companies that got clobbered by foundation model improvements were mostly selling model accuracy as the product. A surface player whose loan officers run their whole day, i.e, pipeline, borrower communication, task management, conditions through one interface can swap in a better foundation model underneath without the customer noticing. The workflow is the moat, not the model.

But somewhere I think you're dead on is the distribution threat: if ICE builds something comparable and bundles it into Encompass at renewal, Tidalwave gets squeezed regardless of how good its model is. That's a real risk, but that's a bundling problem not a benchmark problem.

Josh's avatar

Oh, strongly agree, and you're right - I was responding more to the quote, but you're 💯 on the value of the harness. Either way, we're relearning lessons about pricing, packaging, and business model moats!