Changelog¶

v0.12.0 — Multi-Corpus Merge, Subset Comparison, Configurable Visualization (2026-05-05)¶

New commands¶

scout merge MASTER --source PATH[:LABEL] ... (US1, FR-001 through FR-009): consolidate 2–20 cleaned source projects into one master project that downstream commands accept unchanged. Supports per-source label columns, extra Boolean flag columns, and two dedup strategies (first, prefer-source=PATH). Atomic write of cleaned_data.parquet, synthesized preprocess_meta.json, merge_meta.json, and objectives.toml stub. See docs/merge.md.
scout compare PROJECT --include PATTERN ... (US2, FR-010 through FR-018): partition any single project into lexically-defined subsets and emit compare_summary.json plus compare_report.md covering up to four metrics (sentiment, communities, topics, timeline). Supports regex patterns, exclusion-takes-precedence semantics, three --against forms (rest / filter:COL=VAL / include:PATTERN), pre-filter, group-by breakdown, and an optional --charts Plotly HTML output. See docs/compare.md.

New `scout visualize` flag¶

--extra-charts FILE.toml (US3, FR-019 through FR-024): append user-defined charts from a fixed library of eleven generic types (group_sentiment_bar, quadrant_scatter, diverging_phrase_bar, etc.) to the dashboard. Two-pass validation collects every error before any chart renders (FR-022). When the flag is omitted, dashboard output is semantically equivalent to v0.11.0 (SC-006). See docs/extra_charts.md.

Architectural commitment (FR-025)¶

The codebase contains no research vocabulary. Brand-token lists, autonomy classification keywords (cs_baseline, consultative, delegated, autonomous), quadrant coordinates, community→industry mappings, and similar dataset-specific constants live exclusively in user-edited TOML configs (--extra-charts FILE.toml, --filter, --include). The eleven chart types in _charts_user.py are generic and swap research domains by editing the TOML alone.

Module independence (FR-027 / SC-007)¶

services/merger.py, services/comparator.py, and services/visualizer/_extra_charts_loader.py share no cross-imports. Each user story can ship independently — release candidates v0.12.0-rc1 (Merge), v0.12.0-rc2 (Compare), v0.12.0-rc3 (Configurable Visualization) are independently shippable.

Tests added¶

tests/contract/test_merge_cli.py, tests/integration/test_merger.py, tests/unit/test_merge_dedup.py, tests/unit/test_merge_meta_schema.py
tests/contract/test_compare_cli.py, tests/integration/test_comparator.py, tests/integration/test_compare_perf.py (@pytest.mark.slow), tests/unit/test_compare_metrics.py, tests/unit/test_compare_summary_schema.py
tests/contract/test_extra_charts_cli.py, tests/integration/test_extra_charts.py, tests/integration/test_visualize_backward_compat.py, tests/unit/test_charts_user.py, tests/unit/test_extra_charts_loader.py

v0.8.0 — Performance & Analysis Yield (2026-02-28)¶

Phase C: Preprocessing Performance¶

Single Polars collect: Refactored run_preprocessing() to use one .collect() call (was 5×), reducing memory pressure on large datasets
NER batch size: GLiNER default batch size increased from 8 → 32; forwarded via ner_batch_size param and objectives.toml [preprocess]
Sentiment batch size: _run_sentiment_scoring() now accepts configurable sentiment_batch_size and passes to HuggingFace pipeline
GPU auto-detection: _detect_device() helper in preprocessor; GLiNER and sentiment pipeline auto-select CUDA with CPU OOM fallback

Phase D: Topic Modeling Quality¶

Outlier threshold: min_cluster_size and min_topic_size defaults reduced 100 → 50 for finer-grained topics
Outlier reduction: --reduce-outliers option (default: distributions) applies BERTopic.reduce_outliers() post-clustering; topics_info.json written after reduction
objectives.toml: New reduce_outliers field in [model] section

Phase E: Analysis Yield¶

Broader context window: _build_topic_context() now includes up to 50 topics (was 20)
More findings per agent: Task descriptions request ≥ 5 findings (was 3)
Fuzzy citation matching: _fuzzy_match() in verifier uses 60% token overlap, tolerating minor LLM paraphrase differences while still flagging hallucinations
objectives.toml: New max_topics and min_findings fields in [analyze] section

v0.7.0 — Data Quality & Topic Transparency (2026-02-28)¶

New Features¶

Relevance Filtering: scout preprocess --relevance-filter scores documents by keyword/topic similarity
Data Quality Report: data_quality_report.json with community relevance breakdown
Topic Keywords Persistence: topics_info.json saves topic labels and keywords from BERTopic
Topic Description Table: Dashboard chart showing topic keywords at a glance
Semantic Chart Labels: Topic charts show keyword-based labels instead of "Topic 0", "Topic 1"
Enriched LLM Context: Interpretation prompts include project keywords, topic keywords, and research purpose

Bug Fixes / Improvements¶

Agent task descriptions include topic keywords and project context
New [preprocess] section in objectives.toml for relevance settings

v0.6.0 — 2026-02-24¶

Sentiment Analysis (`scout preprocess --sentiment`)¶

Added _run_sentiment_scoring() in preprocessor.py using HuggingFace Transformers (cardiffnlp/twitter-roberta-base-sentiment-latest).
Adds sentiment_label (positive / neutral / negative) and sentiment_score (0–1) columns to cleaned_data.parquet.
Batch processing with configurable batch size (default 32); truncation at 512 tokens.
Graceful fallback: if transformers is not installed, preprocessing continues unchanged.
New CLI flag: scout preprocess <project> --sentiment
New pipeline flag: scout run <project> --sentiment

Sentiment & Perception Dashboard Section¶

New _charts_sentiment.py module with 6 Plotly charts:
Sentiment Distribution — donut chart (positive / neutral / negative)
Sentiment by Topic — average sentiment score heatmap per topic
Controversy by Topic — standard deviation bar chart (opinion polarisation)
Sentiment Over Time — 3-line time series of label proportions
Community × Sentiment — heatmap of sentiment distribution per subreddit
Perception Map — scatter plot (avg sentiment vs. post volume)
Section appears in the dashboard only when sentiment_label column is present.

LLM Selection Bug Fix + Ensemble Mode¶

Fixed --llm routing bug: previously llm_config = {"default": model_id} did not match any agent role, rendering --llm ineffective.
New ensemble option (now the default): each agent uses its own default LLM (DEFAULT_LLM_MAP in personas.py).
--llm claude routes all five agents to claude-sonnet-4-6.
--llm gemini routes all five agents to gemini/gemini-3.1-pro-preview.
Default changed from claude → ensemble in both CLI options and objectives.toml.

Report Language Selection (`--report-language`)¶

New --report-language / -rl option on scout analyze and scout run. Choices: english (default) | korean.
When korean is selected, a Korean language requirement block is prepended to the agent task description, directing all agents to write findings in Korean.
Structural keys (FINDING:, DETAIL:, CONFIDENCE:, CATEGORY:, CITATION:) remain in English for reliable parsing.
Persisted in objectives.toml as report_language = "english" under [analyze].

LLM Interpretations + PNG Export + Visualization Report¶

scout visualize gains four new options:
--interpret / --no-interpret — generate LLM-written section interpretations
--report-language / -rl — language for interpretations
--llm — LLM for interpretations (claude | gemini, default claude)
--export-png — export all charts as high-resolution PNG (requires kaleido)
_interpretations.py: calls LiteLLM once per dashboard section, saves to interpretations.json. Graceful skip on API error or missing litellm.
_png_exporter.py: exports charts to visualizations/charts/{section}/{chart}.png at 1200×800px scale=3 (~300 DPI). Graceful skip when kaleido is absent.
visualization_report.md: structured Markdown combining interpretations (blockquotes) and PNG image references — suitable for academic papers and reports.
Dashboard HTML now renders a highlighted interpretation box (.interpretation-box) above each section's chart grid when --interpret is used.
kaleido>=0.2.1 added to [viz] optional dependencies.

Dependency and Version¶

Version bumped to 0.6.0.
kaleido>=0.2.1 added to pyproject.toml [project.optional-dependencies.viz].

v0.5.0 — 2026-02-20¶

Added scout visualize command generating a 30-chart standalone HTML dashboard.
New [viz] optional dependency group: plotly>=5.18.0, wordcloud>=1.9.0.
7 visualizer modules under services/visualizer/.

v0.4.0¶

Reddit direct collector (bypasses Apify for simple URL scraping).
NER chunking improvements.
CrewAI LiteLLM routing fix.

v0.3.0¶

Per-project objectives.toml configuration with CLI > file > default merge.
--llm option on scout analyze.

v0.2.0¶

Configurable Apify actor (--actor).
LLM-based keyword generation from project topic.

v0.1.x¶

Initial release: collect → preprocess → model → analyze pipeline.
GitHub Actions CI + GitHub Pages documentation.