How we collect, verify, and publish data
Every record on Nonfaction is traceable to a primary source. This document describes exactly how that data moves from raw government filings to the public interface — with no gaps, no black boxes, and no editorial intervention.
Data pipeline overview
┌─────────────────────────────────────────────────────────────────────┐ │ DATA PIPELINE │ ├─────────────────┬───────────────────────┬───────────────────────────┤ │ 1. INGEST │ 2. VERIFY │ 3. CORRELATE │ │ │ │ │ │ Gov APIs ──┐ │ ┌─ Automated checks │ ┌─ Timing windows │ │ Scrapers ──┼──►│ │ Hash integrity │ │ Entity dedup │ │ Submissions┘ │ │ Schema validation │ │ Cross-referencing │ │ │ └─ Human review ─────┼─►└─ Score calculation │ │ Provenance │ High-impact records│ Flagging │ │ hash attached │ Legal review │ Public output │ └─────────────────┴───────────────────────┴───────────────────────────┘
Stage 1 — Ingest
How data enters the pipeline
Data enters through three complementary mechanisms, each with different trust requirements and update frequencies. Every record is assigned a provenance hash at ingest time.
Automated Scrapers
Runs every 6 hours
Scheduled cron-based scrapers in Rust and Python pull structured data from government portals, parsing HTML, XML, and JSON into canonical records with provenance hashes.
Official APIs
6 live API integrations
Where government agencies provide machine-readable APIs (FEC, Congress.gov, USASpending), we use authenticated API clients with rate limiting, retry logic, and change detection.
Crowdsourced Submissions
Human review required
Community submissions are accepted only with a verifiable primary source link. No anonymous evidence. Every submission undergoes human review before publication.
Source trust hierarchy
Update cadence
Daily
Avg. latency
< 6 hours
Trust score
99%
Direct machine-readable feeds from official government systems. These sources are considered authoritative with minimal transformation required. Hash integrity is verified on every fetch cycle.
Update cadence
Weekly
Avg. latency
< 3 days
Trust score
93%
Curated public databases maintained by state agencies and established civic organizations. Records undergo normalization and deduplication before integration.
Update cadence
Monthly
Avg. latency
< 2 weeks
Trust score
85%
Long-tail accountability data covering the sub-federal layer where much governance actually happens. Higher verification overhead — every record requires a direct source URL or filing reference.
Stage 2 — Verify
Two-stage verification pipeline
All ingested records pass through automated checks first. High-impact records additionally receive human review before publication.
Automated checks
Runs on every ingested record
- Source URL reachability and hash consistency
- Required field completeness validation
- Entity name normalization and deduplication
- Date/timestamp format and range validation
- Cross-reference against existing records for conflicts
- Schema compliance against canonical record types
Human review
Applied to high-impact records
- Manual source verification against original document
- Contextual accuracy review against public record
- Entity disambiguation for common name collisions
- Legal review for records involving active litigation
- Sensitivity review for records involving minors
- Community correction processing and adjudication
Stage 3 — Correlate
Temporal proximity rules
Timing analysis surfaces meaningful temporal proximity between events — not causation, not conclusions. The rules below are deterministic, documented, and version-controlled.
Donation → Vote
Window: < 90 daysCampaign contributions received within 90 days of a directly related legislative vote are flagged for proximity analysis. The timing window is documented in academic literature on campaign finance influence.
Lobbying → Vote
Window: < 180 daysRegistered lobbying activity on a specific bill or policy area, within 180 days of a relevant vote, is surfaced as a timing correlation. The expanded window reflects the lobbying disclosure lag.
Indictment → Pardon
Window: Always flaggedAny presidential or gubernatorial pardon or commutation granted to an individual with an active federal or state indictment is flagged regardless of timing.
Regulatory Action → Donation
Window: < 60 daysDonations received within 60 days following a favorable regulatory decision affecting the donor's industry are flagged for reverse-proximity analysis.
Important: Timing analysis surfaces correlation only. Nonfaction makes no claim of causation. Scores are descriptive analytical tools, not allegations.
No-editorial policy
What we never do
No narrative framing
Records are presented as structured data, not as stories with protagonists and antagonists.
No partisan signals
No language, imagery, or ordering that implies political endorsement or opposition.
No anonymous sources
Every surfaced record traces to a named, verifiable public document. Unnamed allegations are not published.
No hidden algorithms
Every scoring function, ranking rule, and timing window is documented in public code under GPL v3.
Archive & integrity
Tamper-proof by design
Content-Addressable Archive
Every source document is stored by its cryptographic hash. The content cannot be altered without changing the identifier — making silent tampering impossible.
Merkle DAG Audit Trail
The sequence of all data mutations is linked in a Merkle Directed Acyclic Graph. Any alteration to historical records invalidates all subsequent nodes.
Compile-Time Source Enforcement
The type system enforces source chain completeness at build time. Records missing verified provenance cannot compile into a publishable state.