Skip to content

scout merge — Multi-Corpus Merge

Consolidates 2–20 cleaned source projects into one master project so downstream commands (scout model, scout analyze, scout visualize) accept the merged corpus unchanged.

Spec: specs/003-merge-compare-charts/contracts/cli-merge.md.

Synopsis

scout merge <MASTER_PROJECT> \
  --source PATH[:LABEL] \
  [--source PATH[:LABEL] ...] \
  [--label-column NAME] \
  [--extra-flags NAME:SOURCE_PATTERN[,NAME:SOURCE_PATTERN ...]] \
  [--dedup-strategy first|prefer-source=PATH] \
  [--overwrite] \
  [--quiet]

Options

Token Type Required Default Notes
MASTER_PROJECT str yes Output project name (becomes projects/<MASTER_PROJECT>/). Slug-validated: [a-zA-Z0-9][a-zA-Z0-9_-]{0,63}.
--source PATH[:LABEL] repeated yes Source project path. Optional :LABEL suffix attaches that value to the per-source label column. Order matters for --dedup-strategy first.
--label-column NAME str no Categorical column name applied per-source. When set, every --source MUST carry a :LABEL suffix.
--extra-flags NAME:SOURCE_PATTERN[,...] str no Add Boolean column NAME, true for rows whose source_project contains SOURCE_PATTERN.
--dedup-strategy str no first first (source-order precedence) or prefer-source=PATH (named precedence on id collision).
--overwrite flag no false Replace an existing master project.
--quiet flag no false Suppress tqdm progress bars.

Exit codes

Code Meaning
0 Success — merged project written.
1 Source path not found, or source missing required schema.
2 Source count out of [2, 20] (FR-001).
3 Output project exists without --overwrite.
4 --dedup-strategy prefer-source=PATH references a path absent from --source.
5 Optional column type mismatch across sources.
6 Internal error during read or write.

Output layout

projects/<MASTER_PROJECT>/
├── cleaned/cleaned_data.parquet     # union, deduped, with source_project + label/flag columns
├── preprocess_meta.json              # synthesized for downstream compatibility
├── merge_meta.json                   # provenance for the merge operation
└── objectives.toml                   # stub with [merge] section

source_project column is always added. <label_column> column is added when --label-column is set. One Boolean <flag_name> column per --extra-flags entry.

Examples

# Two-source merge, all rows keep their source name
scout merge master \
  --source projects/r1 \
  --source projects/r2

# Three sources with cohort labels and a pilot flag
scout merge master \
  --source projects/round1:control \
  --source projects/round2:treatment \
  --source projects/round3:treatment \
  --label-column cohort \
  --extra-flags is_pilot:round1

# Dedup with named precedence
scout merge master \
  --source projects/cs_v1 \
  --source projects/cs_v2 \
  --dedup-strategy prefer-source=projects/cs_v2 \
  --overwrite

See also