TREC 2025 DRAGUN Track Guidelines #

Detection, Retrieval, and Augmented Generation for Understanding News

Important announcement: The submission deadline has been extended by one week to August 22nd.

Overview #

Welcome to the TREC 2025 DRAGUN Track, the successor to the TREC 2024 Lateral Reading Track. This track aims to support readers in assessing the trustworthiness of online news. There are two tasks: (1) Question Generation and (2) Report Generation.

Question Generation focuses on identifying critical questions that readers should consider when evaluating news trustworthiness. These questions should guide further investigation, such as examining source bias, credentials, and cross-referencing with other sources, much like how LLMs generate queries in the “Deep Research” mode.

Report Generation involves creating a well-attributed and comprehensive report that provides readers with background and context for a more informed trustworthiness evaluation. This report is expected to address the most important questions from Task 1.

Both tasks run in parallel, sharing the same submission deadline. Unlike traditional fact-checking, this track is designed to help readers form their own judgments by offering multi-source context from a neutral perspective, rather than establishing an “absolute truth” and delivering a verdict.

Participation and Communication #

Please follow the TREC 2025 registration guidelines from their Call for Participation. After completing the registration process, you should receive an invitation to NIST’s workspace and be granted access to our primary communication channel: #trec-dragun-2025. This channel will serve as the main hub for discussions and important announcements.

Data #

Web Collection: This track will use MS MARCO V2.1 (Segmented) as the web collection for the report generation task, the same corpus used by the TREC Retrieval-Augmented Generation Track. This segmented collection consists of approximately 114 million segments derived from around 11 million web documents. The collection is available for download here.
News Articles: trec-2025-dragun-topics.jsonl provides 30 selected target news articles (i.e., topics), each on a different subject and from a range of sources. These articles were extracted from the MS MARCO V2.1 Document collection. Each line in this JSONL file includes the following fields for a selected news article: docid, url, title, headings, and body.

Tasks #

Assume there is a (general public) reader who is viewing each of these 30 news articles. This track has the following two parallel tasks with the same submission due date. Participants may choose to engage in either one or both tasks. As organizers, we have provided a starter kit for participants to modify or build upon. Participants are also welcome to develop their own systems from scratch.

Note: If any article includes reader comments at the bottom of the page (which should be few, if any), disregard these comments for both tasks, i.e., reader comments following the articles are out of scope.

Task 1: Question Generation #

For each topic (i.e., target news article), participants need to generate 10 critical and investigative questions that a thoughtful reader should look into when assessing its trustworthiness, on aspects such as source bias, motivation, or breadth of viewpoints, including centrist and right wing viewpoints. The generated questions should be ranked from most to least important, similar to the “Deep Research” mode, where LLMs (search agents) proactively generate queries to guide the investigation of an article’s trustworthiness. Participants may use any resources (online or offline) beyond the specified web collection to develop their systems for this task.

Questions should meet the following requirements:

They should be at most 300 characters long.
Avoid compound questions (e.g., Who is X and when did Y happen? ). Each question should focus on a single topic.
Avoid overly general or ambiguous questions, in the context of the article (e.g., Are there other sources corroborating the details presented in this article? ).

All questions for the articles should be compiled into a single file, formatted as follows, and submitted to NIST via Evalbase:

Tab-separated file, encoded in UTF-8.
Each line contains the following fields, in this order: topic_id, team_id, run_id, rank, question.
- topic_id: docid of the target news article.
- team_id: Unique identifier for the team.
- run_id: Unique identifier for the run, specifying both the team and the method used. Each run must have a different run_id, and all run_ids submitted by the same team should share a common prefix to associate them with the team.
- rank: Rank for the question, from 1 to 10.
- question: Question in plaintext, with no tabs or newlines.

Submissions can be either manual (involving human intervention to generate questions, such as hiring people to produce questions or manually selecting questions from candidate questions generated by algorithms, or other human involvement) or automatic (produced entirely by automatic systems without human input beyond the construction of the systems). Teams submitting automatic runs should make a good-faith effort not to read or study the articles.

Task 2: Report Generation #

This is the core task of this track: for each topic (i.e., target news article), generate a well-attributed report to provide more background and context, such as the bias and motivation of the source, details on the cited support in the article, and viewpoints from other sources, to help readers form their own trustworthiness assessments. These reports are expected to thoughtfully address the most important questions from Task 1. It can be viewed as a RAG-style task, with a fixed query: tell me what I should know about this article to better assess its trustworthiness, but a varying context: the news article.

Each sentence should have at most three references (i.e., segment’s docid from MS MARCO V2.1 (Segmented)). The total length of the generated reports should be within 250 words. Note that it is possible that no answer exists in MS MARCO V2.1 (Segmented) for some generated questions from Task 1. In such cases, it is suggested to skip those questions in the generated reports, as answers without citations (not grounded on the web collection) are not desired for RAG systems.

Similar to Task 1, runs may be either automatic or manual. We use a subset of the required fields from the shared submission format among RAG-related tracks in TREC 2025. One submission (run) should be a UTF-8-encoded JSONL file, with each line being a JSON object for a topic. Participants should follow this format and submit their runs to NIST via Evalbase.

{
	"metadata": {
		"team_id": "organizers",
		"run_id": "organizers-run-example", 
		"topic_id": "msmarco_v2.1_doc_xx_xxxxx0",
		"type": "automatic",
		"use_starter_kit": 0
	},
	"responses": [
		{
			"text": "This is the first sentence.",
			"citations": [
				"msmarco_v2.1_doc_xx_xxxxxx1#x_xxxxxx3",
				"msmarco_v2.1_doc_xx_xxxxxx2#x_xxxxxx4",
			]
		},
		{
			"text": "This is the second sentence.",
			"citations": []
		}
	]
}

Above is an example line from the final JSONL run file, with the following fields:

metadata: Metadata for this run. It should be the same across all 30 lines (topics).
team_id: Unique identifier for the team.
run_id: Unique identifier for the run, specifying both the team and the method used. Each run must have a different run_id, and all run_ids submitted by the same team should share a common prefix to associate them with the team.
topic_id: The docid of the target news article (the topic).
type: Indicates whether the run is “automatic” or “manual”.
use_starter_kit: Set to 1 if the run is based on the starter kit, or 0 if not.
responses: A list of objects, each representing a generated sentence with:
- text: One generated sentence in plaintext. All text fields in the list must together not exceed a total of 250 words.
- citations: A list of up to 3 docids of segments that this sentence is based on. The citations for a sentence may be empty, and their order does not matter.

Schedule #

News Articles (Topics) Released: May 7
-> Task 1 and Task 2 Submission Due: ~~August 15~~ August 22
* Use this script to validate your runs before submitting them.
Example use: python trec_dragun_validate.py dragun-organizers-task-1 task1
Evaluation Results Release: Late September
Notebook Paper Due: Late October
TREC 2025 Conference: November 17-21 at NIST in Gaithersburg, MD, USA

Evaluation #

NIST assessors will evaluate the trustworthiness of each of the 30 news articles, generate questions (similar to those in Task 1), and identify candidate answers through online searches. These will serve as rubrics for evaluating submissions to both tasks. Here are the assessing instructions: TREC 2025 DRAGUN Track Assessing Instructions.pdf. For Task 1, participant questions will be automatically graded based on their alignment with the assessors’ questions. For Task 2, participant reports will be manually evaluated by NIST assessors, focusing on how many assessor questions each report answers. If no answer can be found in MS MARCO V2.1 (Segmented) for an assessor question, that question will be dropped from the rubrics during report evaluation. Further details are to be determined.

Q&A #

Is there a limit on how many runs each team can submit?
Participating teams are allowed to submit as many runs as they like, but they need authorization from the track organizers before submitting more than 10 runs per task. Since not all runs are guaranteed to be evaluated, teams also need to specify a priority ordering during submission.

Organizers #

	Dake Zhang University of Waterloo Waterloo, Ontario, Canada Website LinkedIn Twitter
	Mark D. Smucker University of Waterloo Waterloo, Ontario, Canada Website LinkedIn
	Charles L. A. Clarke University of Waterloo Waterloo, Ontario, Canada Website LinkedIn Twitter

There are other tracks in TREC 2025 that feature RAG tasks, with a similar submission format to encourage cross-participation. Participants are welcome to adapt their RAG systems for those tracks to explore their performance across diverse scenarios.