scout merge — Multi-Corpus Merge¶
Consolidates 2–20 cleaned source projects into one master project so downstream commands (scout model, scout analyze, scout visualize) accept the merged corpus unchanged.
Spec: specs/003-merge-compare-charts/contracts/cli-merge.md.
Synopsis¶
scout merge <MASTER_PROJECT> \
--source PATH[:LABEL] \
[--source PATH[:LABEL] ...] \
[--label-column NAME] \
[--extra-flags NAME:SOURCE_PATTERN[,NAME:SOURCE_PATTERN ...]] \
[--dedup-strategy first|prefer-source=PATH] \
[--overwrite] \
[--quiet]
Options¶
| Token | Type | Required | Default | Notes |
|---|---|---|---|---|
MASTER_PROJECT |
str | yes | — | Output project name (becomes projects/<MASTER_PROJECT>/). Slug-validated: [a-zA-Z0-9][a-zA-Z0-9_-]{0,63}. |
--source PATH[:LABEL] |
repeated | yes | — | Source project path. Optional :LABEL suffix attaches that value to the per-source label column. Order matters for --dedup-strategy first. |
--label-column NAME |
str | no | — | Categorical column name applied per-source. When set, every --source MUST carry a :LABEL suffix. |
--extra-flags NAME:SOURCE_PATTERN[,...] |
str | no | — | Add Boolean column NAME, true for rows whose source_project contains SOURCE_PATTERN. |
--dedup-strategy |
str | no | first |
first (source-order precedence) or prefer-source=PATH (named precedence on id collision). |
--overwrite |
flag | no | false | Replace an existing master project. |
--quiet |
flag | no | false | Suppress tqdm progress bars. |
Exit codes¶
| Code | Meaning |
|---|---|
| 0 | Success — merged project written. |
| 1 | Source path not found, or source missing required schema. |
| 2 | Source count out of [2, 20] (FR-001). |
| 3 | Output project exists without --overwrite. |
| 4 | --dedup-strategy prefer-source=PATH references a path absent from --source. |
| 5 | Optional column type mismatch across sources. |
| 6 | Internal error during read or write. |
Output layout¶
projects/<MASTER_PROJECT>/
├── cleaned/cleaned_data.parquet # union, deduped, with source_project + label/flag columns
├── preprocess_meta.json # synthesized for downstream compatibility
├── merge_meta.json # provenance for the merge operation
└── objectives.toml # stub with [merge] section
source_project column is always added. <label_column> column is added when --label-column is set. One Boolean <flag_name> column per --extra-flags entry.
Examples¶
# Two-source merge, all rows keep their source name
scout merge master \
--source projects/r1 \
--source projects/r2
# Three sources with cohort labels and a pilot flag
scout merge master \
--source projects/round1:control \
--source projects/round2:treatment \
--source projects/round3:treatment \
--label-column cohort \
--extra-flags is_pilot:round1
# Dedup with named precedence
scout merge master \
--source projects/cs_v1 \
--source projects/cs_v2 \
--dedup-strategy prefer-source=projects/cs_v2 \
--overwrite
See also¶
- Comparison report —
scout compare. - User-defined charts —
scout visualize --extra-charts.