Changelog¶
v0.12.0 — Multi-Corpus Merge, Subset Comparison, Configurable Visualization (2026-05-05)¶
New commands¶
scout merge MASTER --source PATH[:LABEL] ...(US1, FR-001 through FR-009): consolidate 2–20 cleaned source projects into one master project that downstream commands accept unchanged. Supports per-source label columns, extra Boolean flag columns, and two dedup strategies (first,prefer-source=PATH). Atomic write ofcleaned_data.parquet, synthesizedpreprocess_meta.json,merge_meta.json, andobjectives.tomlstub. See docs/merge.md.scout compare PROJECT --include PATTERN ...(US2, FR-010 through FR-018): partition any single project into lexically-defined subsets and emitcompare_summary.jsonpluscompare_report.mdcovering up to four metrics (sentiment, communities, topics, timeline). Supports regex patterns, exclusion-takes-precedence semantics, three--againstforms (rest / filter:COL=VAL / include:PATTERN), pre-filter, group-by breakdown, and an optional--chartsPlotly HTML output. See docs/compare.md.
New scout visualize flag¶
--extra-charts FILE.toml(US3, FR-019 through FR-024): append user-defined charts from a fixed library of eleven generic types (group_sentiment_bar,quadrant_scatter,diverging_phrase_bar, etc.) to the dashboard. Two-pass validation collects every error before any chart renders (FR-022). When the flag is omitted, dashboard output is semantically equivalent to v0.11.0 (SC-006). See docs/extra_charts.md.
Architectural commitment (FR-025)¶
The codebase contains no research vocabulary. Brand-token lists, autonomy classification keywords (cs_baseline, consultative, delegated, autonomous), quadrant coordinates, community→industry mappings, and similar dataset-specific constants live exclusively in user-edited TOML configs (--extra-charts FILE.toml, --filter, --include). The eleven chart types in _charts_user.py are generic and swap research domains by editing the TOML alone.
Module independence (FR-027 / SC-007)¶
services/merger.py, services/comparator.py, and services/visualizer/_extra_charts_loader.py share no cross-imports. Each user story can ship independently — release candidates v0.12.0-rc1 (Merge), v0.12.0-rc2 (Compare), v0.12.0-rc3 (Configurable Visualization) are independently shippable.
Tests added¶
tests/contract/test_merge_cli.py,tests/integration/test_merger.py,tests/unit/test_merge_dedup.py,tests/unit/test_merge_meta_schema.pytests/contract/test_compare_cli.py,tests/integration/test_comparator.py,tests/integration/test_compare_perf.py(@pytest.mark.slow),tests/unit/test_compare_metrics.py,tests/unit/test_compare_summary_schema.pytests/contract/test_extra_charts_cli.py,tests/integration/test_extra_charts.py,tests/integration/test_visualize_backward_compat.py,tests/unit/test_charts_user.py,tests/unit/test_extra_charts_loader.py
v0.8.0 — Performance & Analysis Yield (2026-02-28)¶
Phase C: Preprocessing Performance¶
- Single Polars collect: Refactored
run_preprocessing()to use one.collect()call (was 5×), reducing memory pressure on large datasets - NER batch size: GLiNER default batch size increased from 8 → 32; forwarded via
ner_batch_sizeparam andobjectives.toml [preprocess] - Sentiment batch size:
_run_sentiment_scoring()now accepts configurablesentiment_batch_sizeand passes to HuggingFace pipeline - GPU auto-detection:
_detect_device()helper in preprocessor; GLiNER and sentiment pipeline auto-select CUDA with CPU OOM fallback
Phase D: Topic Modeling Quality¶
- Outlier threshold:
min_cluster_sizeandmin_topic_sizedefaults reduced 100 → 50 for finer-grained topics - Outlier reduction:
--reduce-outliersoption (default:distributions) appliesBERTopic.reduce_outliers()post-clustering;topics_info.jsonwritten after reduction - objectives.toml: New
reduce_outliersfield in[model]section
Phase E: Analysis Yield¶
- Broader context window:
_build_topic_context()now includes up to 50 topics (was 20) - More findings per agent: Task descriptions request ≥ 5 findings (was 3)
- Fuzzy citation matching:
_fuzzy_match()in verifier uses 60% token overlap, tolerating minor LLM paraphrase differences while still flagging hallucinations - objectives.toml: New
max_topicsandmin_findingsfields in[analyze]section
v0.7.0 — Data Quality & Topic Transparency (2026-02-28)¶
New Features¶
- Relevance Filtering:
scout preprocess --relevance-filterscores documents by keyword/topic similarity - Data Quality Report:
data_quality_report.jsonwith community relevance breakdown - Topic Keywords Persistence:
topics_info.jsonsaves topic labels and keywords from BERTopic - Topic Description Table: Dashboard chart showing topic keywords at a glance
- Semantic Chart Labels: Topic charts show keyword-based labels instead of "Topic 0", "Topic 1"
- Enriched LLM Context: Interpretation prompts include project keywords, topic keywords, and research purpose
Bug Fixes / Improvements¶
- Agent task descriptions include topic keywords and project context
- New
[preprocess]section in objectives.toml for relevance settings
v0.6.0 — 2026-02-24¶
Sentiment Analysis (scout preprocess --sentiment)¶
- Added
_run_sentiment_scoring()inpreprocessor.pyusing HuggingFace Transformers (cardiffnlp/twitter-roberta-base-sentiment-latest). - Adds
sentiment_label(positive / neutral / negative) andsentiment_score(0–1) columns tocleaned_data.parquet. - Batch processing with configurable batch size (default 32); truncation at 512 tokens.
- Graceful fallback: if
transformersis not installed, preprocessing continues unchanged. - New CLI flag:
scout preprocess <project> --sentiment - New pipeline flag:
scout run <project> --sentiment
Sentiment & Perception Dashboard Section¶
- New
_charts_sentiment.pymodule with 6 Plotly charts: - Sentiment Distribution — donut chart (positive / neutral / negative)
- Sentiment by Topic — average sentiment score heatmap per topic
- Controversy by Topic — standard deviation bar chart (opinion polarisation)
- Sentiment Over Time — 3-line time series of label proportions
- Community × Sentiment — heatmap of sentiment distribution per subreddit
- Perception Map — scatter plot (avg sentiment vs. post volume)
- Section appears in the dashboard only when
sentiment_labelcolumn is present.
LLM Selection Bug Fix + Ensemble Mode¶
- Fixed
--llmrouting bug: previouslyllm_config = {"default": model_id}did not match any agent role, rendering--llmineffective. - New
ensembleoption (now the default): each agent uses its own default LLM (DEFAULT_LLM_MAPinpersonas.py). --llm clauderoutes all five agents toclaude-sonnet-4-6.--llm geminiroutes all five agents togemini/gemini-3.1-pro-preview.- Default changed from
claude→ensemblein both CLI options andobjectives.toml.
Report Language Selection (--report-language)¶
- New
--report-language / -rloption onscout analyzeandscout run. Choices:english(default) |korean. - When
koreanis selected, a Korean language requirement block is prepended to the agent task description, directing all agents to write findings in Korean. - Structural keys (
FINDING:,DETAIL:,CONFIDENCE:,CATEGORY:,CITATION:) remain in English for reliable parsing. - Persisted in
objectives.tomlasreport_language = "english"under[analyze].
LLM Interpretations + PNG Export + Visualization Report¶
scout visualizegains four new options:--interpret / --no-interpret— generate LLM-written section interpretations--report-language / -rl— language for interpretations--llm— LLM for interpretations (claude | gemini, default claude)--export-png— export all charts as high-resolution PNG (requires kaleido)_interpretations.py: calls LiteLLM once per dashboard section, saves tointerpretations.json. Graceful skip on API error or missing litellm._png_exporter.py: exports charts tovisualizations/charts/{section}/{chart}.pngat 1200×800px scale=3 (~300 DPI). Graceful skip when kaleido is absent.visualization_report.md: structured Markdown combining interpretations (blockquotes) and PNG image references — suitable for academic papers and reports.- Dashboard HTML now renders a highlighted interpretation box (
.interpretation-box) above each section's chart grid when--interpretis used. kaleido>=0.2.1added to[viz]optional dependencies.
Dependency and Version¶
- Version bumped to
0.6.0. kaleido>=0.2.1added topyproject.toml[project.optional-dependencies.viz].
v0.5.0 — 2026-02-20¶
- Added
scout visualizecommand generating a 30-chart standalone HTML dashboard. - New
[viz]optional dependency group:plotly>=5.18.0,wordcloud>=1.9.0. - 7 visualizer modules under
services/visualizer/.
v0.4.0¶
- Reddit direct collector (bypasses Apify for simple URL scraping).
- NER chunking improvements.
- CrewAI LiteLLM routing fix.
v0.3.0¶
- Per-project
objectives.tomlconfiguration with CLI > file > default merge. --llmoption onscout analyze.
v0.2.0¶
- Configurable Apify actor (
--actor). - LLM-based keyword generation from project topic.
v0.1.x¶
- Initial release: collect → preprocess → model → analyze pipeline.
- GitHub Actions CI + GitHub Pages documentation.