This paper completely changed how I think about when RAG should go fetch documents:
Draft the next sentence -> Check token confidence -> If low, use it as a query -> Retrieve documents -> Regenerate the sentence
Here is the 5-step blueprint:
Forward-looking: instead of retrieving on past context, the model first drafts the upcoming sentence and uses it to decide what it is missing.
Confidence threshold: the draft is scanned for low-confidence tokens by probability; a confident sentence is kept as is, with no extra retrieval.
Query from the future: a low-confidence sentence is masked on its weak tokens or rewritten into a question and sent to the retriever.
Regeneration: the sentence is rewritten on the fetched documents, then the loop moves to the next sentence.
Active loop: retrieval fires not on a fixed interval but at the exact moment in generation where the model actually goes shaky.
Key insight: to know what to fetch, do not query the past context, query the draft of what the model is about to say next.
One generic training-free loop beats single-time and fixed-interval retrieval across all four long-form datasets.
Read this, then check the article below.