-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Your work is very innovative, but I still have some questions.
-
According to the project, gen_qa uses Wiki data, and cot_construct uses Google Search. Will there be any impact if I change the data source in cot_construct to Wiki as well? Is it necessary to use Google Search?
-
The paper mentions two sizes regarding the amount of training data: one is "During the SFT phase, we utilize the agent-based method to synthesize and filter 58K correct CoT trajectories for training data", and the other is "We propose a hybrid approach for constructing CoT data that combines the two approaches, and construct a 10M CoT dataset (14B tokens) to validate the scalability of MASKSEARCH as a pre-training framework." Could you elaborate on the relationship between these two datasets and the filtering rules?
-
In the comparative experiments, were all the external databases for RAG based on Wiki, or were some based on Google Search?