h100envy on x

This paper completely changed how I think about when RAG should go fetch documents:

Draft the next sentence -> Check token confidence -> If low, use it as a query -> Retrieve documents -> Regenerate the sentence

Here is the 5-step blueprint:

Forward-looking: instead of retrieving on past context, the model first drafts the upcoming sentence and uses it to decide what it is missing.

Confidence threshold: the draft is scanned for low-confidence tokens by probability; a confident sentence is kept as is, with no extra retrieval.

Query from the future: a low-confidence sentence is masked on its weak tokens or rewritten into a question and sent to the retriever.

Regeneration: the sentence is rewritten on the fetched documents, then the loop moves to the next sentence.

Active loop: retrieval fires not on a fixed interval but at the exact moment in generation where the model actually goes shaky.

Key insight: to know what to fetch, do not query the past context, query the draft of what the model is about to say next.

One generic training-free loop beats single-time and fixed-interval retrieval across all four long-form datasets.

Read this, then check the article below.