Skip to content

Changelog

v0.12.0 — Multi-Corpus Merge, Subset Comparison, Configurable Visualization (2026-05-05)

New commands

  • scout merge MASTER --source PATH[:LABEL] ... (US1, FR-001 through FR-009): consolidate 2–20 cleaned source projects into one master project that downstream commands accept unchanged. Supports per-source label columns, extra Boolean flag columns, and two dedup strategies (first, prefer-source=PATH). Atomic write of cleaned_data.parquet, synthesized preprocess_meta.json, merge_meta.json, and objectives.toml stub. See docs/merge.md.
  • scout compare PROJECT --include PATTERN ... (US2, FR-010 through FR-018): partition any single project into lexically-defined subsets and emit compare_summary.json plus compare_report.md covering up to four metrics (sentiment, communities, topics, timeline). Supports regex patterns, exclusion-takes-precedence semantics, three --against forms (rest / filter:COL=VAL / include:PATTERN), pre-filter, group-by breakdown, and an optional --charts Plotly HTML output. See docs/compare.md.

New scout visualize flag

  • --extra-charts FILE.toml (US3, FR-019 through FR-024): append user-defined charts from a fixed library of eleven generic types (group_sentiment_bar, quadrant_scatter, diverging_phrase_bar, etc.) to the dashboard. Two-pass validation collects every error before any chart renders (FR-022). When the flag is omitted, dashboard output is semantically equivalent to v0.11.0 (SC-006). See docs/extra_charts.md.

Architectural commitment (FR-025)

The codebase contains no research vocabulary. Brand-token lists, autonomy classification keywords (cs_baseline, consultative, delegated, autonomous), quadrant coordinates, community→industry mappings, and similar dataset-specific constants live exclusively in user-edited TOML configs (--extra-charts FILE.toml, --filter, --include). The eleven chart types in _charts_user.py are generic and swap research domains by editing the TOML alone.

Module independence (FR-027 / SC-007)

services/merger.py, services/comparator.py, and services/visualizer/_extra_charts_loader.py share no cross-imports. Each user story can ship independently — release candidates v0.12.0-rc1 (Merge), v0.12.0-rc2 (Compare), v0.12.0-rc3 (Configurable Visualization) are independently shippable.

Tests added

  • tests/contract/test_merge_cli.py, tests/integration/test_merger.py, tests/unit/test_merge_dedup.py, tests/unit/test_merge_meta_schema.py
  • tests/contract/test_compare_cli.py, tests/integration/test_comparator.py, tests/integration/test_compare_perf.py (@pytest.mark.slow), tests/unit/test_compare_metrics.py, tests/unit/test_compare_summary_schema.py
  • tests/contract/test_extra_charts_cli.py, tests/integration/test_extra_charts.py, tests/integration/test_visualize_backward_compat.py, tests/unit/test_charts_user.py, tests/unit/test_extra_charts_loader.py

v0.8.0 — Performance & Analysis Yield (2026-02-28)

Phase C: Preprocessing Performance

  • Single Polars collect: Refactored run_preprocessing() to use one .collect() call (was 5×), reducing memory pressure on large datasets
  • NER batch size: GLiNER default batch size increased from 8 → 32; forwarded via ner_batch_size param and objectives.toml [preprocess]
  • Sentiment batch size: _run_sentiment_scoring() now accepts configurable sentiment_batch_size and passes to HuggingFace pipeline
  • GPU auto-detection: _detect_device() helper in preprocessor; GLiNER and sentiment pipeline auto-select CUDA with CPU OOM fallback

Phase D: Topic Modeling Quality

  • Outlier threshold: min_cluster_size and min_topic_size defaults reduced 100 → 50 for finer-grained topics
  • Outlier reduction: --reduce-outliers option (default: distributions) applies BERTopic.reduce_outliers() post-clustering; topics_info.json written after reduction
  • objectives.toml: New reduce_outliers field in [model] section

Phase E: Analysis Yield

  • Broader context window: _build_topic_context() now includes up to 50 topics (was 20)
  • More findings per agent: Task descriptions request ≥ 5 findings (was 3)
  • Fuzzy citation matching: _fuzzy_match() in verifier uses 60% token overlap, tolerating minor LLM paraphrase differences while still flagging hallucinations
  • objectives.toml: New max_topics and min_findings fields in [analyze] section

v0.7.0 — Data Quality & Topic Transparency (2026-02-28)

New Features

  • Relevance Filtering: scout preprocess --relevance-filter scores documents by keyword/topic similarity
  • Data Quality Report: data_quality_report.json with community relevance breakdown
  • Topic Keywords Persistence: topics_info.json saves topic labels and keywords from BERTopic
  • Topic Description Table: Dashboard chart showing topic keywords at a glance
  • Semantic Chart Labels: Topic charts show keyword-based labels instead of "Topic 0", "Topic 1"
  • Enriched LLM Context: Interpretation prompts include project keywords, topic keywords, and research purpose

Bug Fixes / Improvements

  • Agent task descriptions include topic keywords and project context
  • New [preprocess] section in objectives.toml for relevance settings

v0.6.0 — 2026-02-24

Sentiment Analysis (scout preprocess --sentiment)

  • Added _run_sentiment_scoring() in preprocessor.py using HuggingFace Transformers (cardiffnlp/twitter-roberta-base-sentiment-latest).
  • Adds sentiment_label (positive / neutral / negative) and sentiment_score (0–1) columns to cleaned_data.parquet.
  • Batch processing with configurable batch size (default 32); truncation at 512 tokens.
  • Graceful fallback: if transformers is not installed, preprocessing continues unchanged.
  • New CLI flag: scout preprocess <project> --sentiment
  • New pipeline flag: scout run <project> --sentiment

Sentiment & Perception Dashboard Section

  • New _charts_sentiment.py module with 6 Plotly charts:
  • Sentiment Distribution — donut chart (positive / neutral / negative)
  • Sentiment by Topic — average sentiment score heatmap per topic
  • Controversy by Topic — standard deviation bar chart (opinion polarisation)
  • Sentiment Over Time — 3-line time series of label proportions
  • Community × Sentiment — heatmap of sentiment distribution per subreddit
  • Perception Map — scatter plot (avg sentiment vs. post volume)
  • Section appears in the dashboard only when sentiment_label column is present.

LLM Selection Bug Fix + Ensemble Mode

  • Fixed --llm routing bug: previously llm_config = {"default": model_id} did not match any agent role, rendering --llm ineffective.
  • New ensemble option (now the default): each agent uses its own default LLM (DEFAULT_LLM_MAP in personas.py).
  • --llm claude routes all five agents to claude-sonnet-4-6.
  • --llm gemini routes all five agents to gemini/gemini-3.1-pro-preview.
  • Default changed from claudeensemble in both CLI options and objectives.toml.

Report Language Selection (--report-language)

  • New --report-language / -rl option on scout analyze and scout run. Choices: english (default) | korean.
  • When korean is selected, a Korean language requirement block is prepended to the agent task description, directing all agents to write findings in Korean.
  • Structural keys (FINDING:, DETAIL:, CONFIDENCE:, CATEGORY:, CITATION:) remain in English for reliable parsing.
  • Persisted in objectives.toml as report_language = "english" under [analyze].

LLM Interpretations + PNG Export + Visualization Report

  • scout visualize gains four new options:
  • --interpret / --no-interpret — generate LLM-written section interpretations
  • --report-language / -rl — language for interpretations
  • --llm — LLM for interpretations (claude | gemini, default claude)
  • --export-png — export all charts as high-resolution PNG (requires kaleido)
  • _interpretations.py: calls LiteLLM once per dashboard section, saves to interpretations.json. Graceful skip on API error or missing litellm.
  • _png_exporter.py: exports charts to visualizations/charts/{section}/{chart}.png at 1200×800px scale=3 (~300 DPI). Graceful skip when kaleido is absent.
  • visualization_report.md: structured Markdown combining interpretations (blockquotes) and PNG image references — suitable for academic papers and reports.
  • Dashboard HTML now renders a highlighted interpretation box (.interpretation-box) above each section's chart grid when --interpret is used.
  • kaleido>=0.2.1 added to [viz] optional dependencies.

Dependency and Version

  • Version bumped to 0.6.0.
  • kaleido>=0.2.1 added to pyproject.toml [project.optional-dependencies.viz].

v0.5.0 — 2026-02-20

  • Added scout visualize command generating a 30-chart standalone HTML dashboard.
  • New [viz] optional dependency group: plotly>=5.18.0, wordcloud>=1.9.0.
  • 7 visualizer modules under services/visualizer/.

v0.4.0

  • Reddit direct collector (bypasses Apify for simple URL scraping).
  • NER chunking improvements.
  • CrewAI LiteLLM routing fix.

v0.3.0

  • Per-project objectives.toml configuration with CLI > file > default merge.
  • --llm option on scout analyze.

v0.2.0

  • Configurable Apify actor (--actor).
  • LLM-based keyword generation from project topic.

v0.1.x

  • Initial release: collect → preprocess → model → analyze pipeline.
  • GitHub Actions CI + GitHub Pages documentation.