The Document Infrastructure That Makes LLM Analysis Actually Work

March 25, 2026

/

David Stolz

A-Team published a piece this week worth reading if you're building with LLMs on top of financial documents.

Written following the Eagle Alpha Alternative Data Conference, the article identifies two recurring failure modes: single-document extraction is imperfect but manageable, while cross-company comparison and longitudinal tracking is where models consistently break down. The other finding that stuck with me: in production systems, retrieval architecture matters more than which model you choose. How you index and structure your document corpus drives accuracy more than the LLM itself.

This is exactly the problem our Machine-Readable Filings (MRF) product is designed to solve.

Structure is the starting point. When documents come in as raw HTML or unstructured text, LLMs are essentially doing the “plumbing work” themselves and doing it inconsistently. We pre-structure filings into a standardized, section-by-section format before they ever touch a model. That means the LLM works on clean, organized inputs rather than reverse-engineering structure from a mess of tags and inconsistent formatting. We also standardize headings across companies, so when you want to look at how every company in a sector discusses intellectual property or climate risk, you're comparing the same thing across filers. That's what makes systematic cross-company analysis tractable, and it's the specific failure mode the article flags.

Mapping solves the other half of the comparison problem

A big part of the cross-company comparison problem isn't a model issue, it's an entity mapping issue. If your documents aren't linked to consistent identifiers, you can't reliably compare fields across companies or combine filing text with financials, estimates, or events. Our filings are linked directly to Xpressfeed, which means everything sits in the same ecosystem your quant team already works in.

The collection problem nobody talks about

There's a third challenge the article doesn't address: collection. For EDGAR filings, collection is straightforward because there's a central repository. For global filings, there isn't one. Documents are filed across exchanges and regulatory bodies in different formats, different languages, and with no standardization on where to find them.

MRF covers over 6+ million documents from 200+ countries in native languages, covering 90,000+ entities. Getting there required building collection pipelines across dozens of sources, including international exchanges, company websites, and third-party newswires. It's not glamorous work, but it's the prerequisite for everything else.

The broader point

The article's conclusion is right: organizations that skip the plumbing build on unstable foundations. The model is the easy part. Your team should be spending their time on analysis, not on sourcing, cleaning, and structuring documents before the analysis can even start. That's what we handle.

CA Updates

Case Studies

News Sentiment

Performance & Benchmarks