dataset construction issue

Your work is very innovative, but I still have some questions.

1. According to the project, gen_qa uses Wiki data, and cot_construct uses Google Search. Will there be any impact if I change the data source in cot_construct to Wiki as well? Is it necessary to use Google Search?

2. The paper mentions two sizes regarding the amount of training data: one is *"During the SFT phase, we utilize the agent-based method to synthesize and filter 58K correct CoT trajectories for training data"*, and the other is *"We propose a hybrid approach for constructing CoT data that combines the two approaches, and construct a 10M CoT dataset (14B tokens) to validate the scalability of MASKSEARCH as a pre-training framework."* Could you elaborate on the relationship between these two datasets and the filtering rules?

3. In the comparative experiments, were all the external databases for RAG based on Wiki, or were some based on Google Search?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dataset construction issue #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

dataset construction issue #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions