indexer
the indexer is a Node.js process that runs as a GitHub Action. it finds declared-md files across all public GitHub repositories, validates them, and writes the results to JSON.
how it works
on each run:
- search -- queries GitHub Code Search API for filenames
whoami.md,whois.md,whatis.md - filter -- keeps only files at canonical locations (see below)
- owner check -- for
whoami.md, confirms the repo owner is a GitHub user. forwhois.md, confirms it is an organization. - validate -- runs each file through the JSON schema validator
- deduplicate -- when the same subject has files at multiple locations, keeps only the highest-priority one
- write -- writes valid profiles to
data/<kind>.jsonand invalid files toinvalid/<kind>/ - commit -- commits any changes to the index repo (skips if no changes)
this is a full re-index. there is no incremental cache.
canonical location filter
the indexer silently skips files that are not at one of the three canonical locations:
| priority | location |
|---|---|
| 1 | root of the canonical repo (see per-standard rules) |
| 2 | root of <owner>/declared repo |
| 3 | .github/<filename> in any public repo |
files found via search that are not at these locations are discarded.
deduplication
when the same subject has valid files at multiple locations, the indexer applies priority order:
- for
whoami.md: deduplicated by GitHub username - for
whois.md: deduplicated by GitHub organization login - for
whatis.md: deduplicated byhandlefield
rate limits
the GitHub Search API caps results at 1,000 per query (10 pages of 100). when the total result count exceeds 1,000, the indexer logs a warning. future versions may use multiple queries to work around this.
the indexer sleeps 1 second between pages to avoid secondary rate limits.
invalid file storage
files that are discovered but fail validation are written to invalid/<kind>/. each file is the original content with an HTML comment at the top:
<!--
Source: owner/repo/path
Discovered: 2026-04-27T00:00:00Z
Validation errors:
- handle: must be 2-39 characters, lowercase letters, numbers, and hyphens only
- links.github: must be a valid GitHub URL
-->
[original file content here]
these files are committed to the index repo. the history of validation failures is visible over time.
running locally
requires a GitHub personal access token (no extra scopes needed):
cd index/
npm install
npm run build
# dry run
GITHUB_TOKEN=<token> node dist/crawl.js --dry-run --limit 10
# index one kind only
GITHUB_TOKEN=<token> node dist/crawl.js --kind whoami --limit 5
# full crawl
GITHUB_TOKEN=<token> node dist/crawl.js
flags:
| flag | description |
|---|---|
--dry-run | run without writing output files |
--limit N | stop after N results per kind (0 = no limit) |
--kind | crawl only whoami, whois, or whatis |