Skip to main content

indexer

the indexer is a Node.js process that runs as a GitHub Action. it finds declared-md files across all public GitHub repositories, validates them, and writes the results to JSON.


how it works

on each run:

  1. search -- queries GitHub Code Search API for filenames whoami.md, whois.md, whatis.md
  2. filter -- keeps only files at canonical locations (see below)
  3. owner check -- for whoami.md, confirms the repo owner is a GitHub user. for whois.md, confirms it is an organization.
  4. validate -- runs each file through the JSON schema validator
  5. deduplicate -- when the same subject has files at multiple locations, keeps only the highest-priority one
  6. write -- writes valid profiles to data/<kind>.json and invalid files to invalid/<kind>/
  7. commit -- commits any changes to the index repo (skips if no changes)

this is a full re-index. there is no incremental cache.


canonical location filter

the indexer silently skips files that are not at one of the three canonical locations:

prioritylocation
1root of the canonical repo (see per-standard rules)
2root of <owner>/declared repo
3.github/<filename> in any public repo

files found via search that are not at these locations are discarded.


deduplication

when the same subject has valid files at multiple locations, the indexer applies priority order:

  • for whoami.md: deduplicated by GitHub username
  • for whois.md: deduplicated by GitHub organization login
  • for whatis.md: deduplicated by handle field

rate limits

the GitHub Search API caps results at 1,000 per query (10 pages of 100). when the total result count exceeds 1,000, the indexer logs a warning. future versions may use multiple queries to work around this.

the indexer sleeps 1 second between pages to avoid secondary rate limits.


invalid file storage

files that are discovered but fail validation are written to invalid/<kind>/. each file is the original content with an HTML comment at the top:

<!--
Source: owner/repo/path
Discovered: 2026-04-27T00:00:00Z
Validation errors:
- handle: must be 2-39 characters, lowercase letters, numbers, and hyphens only
- links.github: must be a valid GitHub URL
-->

[original file content here]

these files are committed to the index repo. the history of validation failures is visible over time.


running locally

requires a GitHub personal access token (no extra scopes needed):

cd index/
npm install
npm run build

# dry run
GITHUB_TOKEN=<token> node dist/crawl.js --dry-run --limit 10

# index one kind only
GITHUB_TOKEN=<token> node dist/crawl.js --kind whoami --limit 5

# full crawl
GITHUB_TOKEN=<token> node dist/crawl.js

flags:

flagdescription
--dry-runrun without writing output files
--limit Nstop after N results per kind (0 = no limit)
--kindcrawl only whoami, whois, or whatis