Methodology

How we collect, verify, and publish data

Every record on Nonfaction is traceable to a primary source. This document describes exactly how that data moves from raw government filings to the public interface — with no gaps, no black boxes, and no editorial intervention.

Data pipeline overview


  ┌─────────────────────────────────────────────────────────────────────┐
  │                        DATA PIPELINE                                │
  ├─────────────────┬───────────────────────┬───────────────────────────┤
  │   1. INGEST     │    2. VERIFY          │    3. CORRELATE           │
  │                 │                       │                           │
  │  Gov APIs ──┐   │  ┌─ Automated checks  │  ┌─ Timing windows        │
  │  Scrapers ──┼──►│  │  Hash integrity    │  │  Entity dedup          │
  │  Submissions┘   │  │  Schema validation │  │  Cross-referencing     │
  │                 │  └─ Human review ─────┼─►└─ Score calculation     │
  │  Provenance     │    High-impact records│    Flagging               │
  │  hash attached  │    Legal review       │    Public output          │
  └─────────────────┴───────────────────────┴───────────────────────────┘

Stage 1 — Ingest

How data enters the pipeline

Data enters through three complementary mechanisms, each with different trust requirements and update frequencies. Every record is assigned a provenance hash at ingest time.

Automated Scrapers

Runs every 6 hours

Scheduled cron-based scrapers in Rust and Python pull structured data from government portals, parsing HTML, XML, and JSON into canonical records with provenance hashes.

Official APIs

6 live API integrations

Where government agencies provide machine-readable APIs (FEC, Congress.gov, USASpending), we use authenticated API clients with rate limiting, retry logic, and change detection.

Crowdsourced Submissions

Human review required

Community submissions are accepted only with a verifiable primary source link. No anonymous evidence. Every submission undergoes human review before publication.

Source trust hierarchy

Tier 1Highest trustGovernment APIs

Update cadence

Daily

Avg. latency

< 6 hours

Trust score

99%

Direct machine-readable feeds from official government systems. These sources are considered authoritative with minimal transformation required. Hash integrity is verified on every fetch cycle.

FEC electronic filings APICongress.gov vote recordsPACER federal court filingsUSASpending.gov contracts databaseSEC EDGAR disclosuresOpenSecrets lobbying registrations

Tier 2High trustPublic Databases

Update cadence

Weekly

Avg. latency

< 3 days

Trust score

93%

Curated public databases maintained by state agencies and established civic organizations. Records undergo normalization and deduplication before integration.

State ethics commission filingsLobbyist registration databasesState campaign finance recordsCourt PACER state-level equivalentsPublic procurement portalsInspector General reports

Tier 3VerifiedState & Local Sources

Update cadence

Monthly

Avg. latency

< 2 weeks

Trust score

85%

Long-tail accountability data covering the sub-federal layer where much governance actually happens. Higher verification overhead — every record requires a direct source URL or filing reference.

Municipal council voting recordsCounty-level property and tax dataLocal lobbying registrationsSchool board and special district filingsVerified crowdsourced submissionsJournalist-archived public documents

Stage 2 — Verify

Two-stage verification pipeline

All ingested records pass through automated checks first. High-impact records additionally receive human review before publication.

Automated checks

Runs on every ingested record

Source URL reachability and hash consistency
Required field completeness validation
Entity name normalization and deduplication
Date/timestamp format and range validation
Cross-reference against existing records for conflicts
Schema compliance against canonical record types

Human review

Applied to high-impact records

Manual source verification against original document
Contextual accuracy review against public record
Entity disambiguation for common name collisions
Legal review for records involving active litigation
Sensitivity review for records involving minors
Community correction processing and adjudication

Stage 3 — Correlate

Temporal proximity rules

Timing analysis surfaces meaningful temporal proximity between events — not causation, not conclusions. The rules below are deterministic, documented, and version-controlled.

Donation → Vote

Window: < 90 days

Campaign contributions received within 90 days of a directly related legislative vote are flagged for proximity analysis. The timing window is documented in academic literature on campaign finance influence.

Lobbying → Vote

Window: < 180 days

Registered lobbying activity on a specific bill or policy area, within 180 days of a relevant vote, is surfaced as a timing correlation. The expanded window reflects the lobbying disclosure lag.

Indictment → Pardon

Window: Always flagged

Any presidential or gubernatorial pardon or commutation granted to an individual with an active federal or state indictment is flagged regardless of timing.

Regulatory Action → Donation

Window: < 60 days

Donations received within 60 days following a favorable regulatory decision affecting the donor's industry are flagged for reverse-proximity analysis.

Important: Timing analysis surfaces correlation only. Nonfaction makes no claim of causation. Scores are descriptive analytical tools, not allegations.

No-editorial policy

What we never do

No narrative framing

Records are presented as structured data, not as stories with protagonists and antagonists.

No partisan signals

No language, imagery, or ordering that implies political endorsement or opposition.

No anonymous sources

Every surfaced record traces to a named, verifiable public document. Unnamed allegations are not published.

No hidden algorithms

Every scoring function, ranking rule, and timing window is documented in public code under GPL v3.

Archive & integrity

Tamper-proof by design

Content-Addressable Archive

Every source document is stored by its cryptographic hash. The content cannot be altered without changing the identifier — making silent tampering impossible.

Merkle DAG Audit Trail

The sequence of all data mutations is linked in a Merkle Directed Acyclic Graph. Any alteration to historical records invalidates all subsequent nodes.

Compile-Time Source Enforcement

The type system enforces source chain completeness at build time. Records missing verified provenance cannot compile into a publishable state.

Frequently asked questions

Every source document is archived by its cryptographic hash at ingest time. If the original URL becomes unreachable, the archived copy remains accessible and the provenance chain is unbroken. We surface a "source archived" indicator on affected records.