Skip to content

dataset construction issue #2

@Victoriaheiheihei

Description

@Victoriaheiheihei

Your work is very innovative, but I still have some questions.

  1. According to the project, gen_qa uses Wiki data, and cot_construct uses Google Search. Will there be any impact if I change the data source in cot_construct to Wiki as well? Is it necessary to use Google Search?

  2. The paper mentions two sizes regarding the amount of training data: one is "During the SFT phase, we utilize the agent-based method to synthesize and filter 58K correct CoT trajectories for training data", and the other is "We propose a hybrid approach for constructing CoT data that combines the two approaches, and construct a 10M CoT dataset (14B tokens) to validate the scalability of MASKSEARCH as a pre-training framework." Could you elaborate on the relationship between these two datasets and the filtering rules?

  3. In the comparative experiments, were all the external databases for RAG based on Wiki, or were some based on Google Search?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions