The Long Tail of External Asset Discovery: Enriching NodeZero EAD with Tenant-Specific Knowledge

External Asset Discovery is genuinely hard. The public attack surface of any organization is a moving target: domains registered last quarter, third-party SaaS endpoints provisioned this morning, certificates rotated yesterday. Any tool that promises to find all of it on its own is either oversimplifying or running on a budget no one can afford.

NodeZero's External Asset Discovery (EAD) does very good work on the part of the surface that is reachable through automation. On a recent engagement we kicked off an EAD against two seed domains and twelve external subdomains came back, all correctly resolved, with their third-party hosting providers tagged (Microsoft 365 autodiscover, Azure App Service, AWS, Okta). That covered most of what we needed for the engagement, fully automated, in roughly fifteen minutes.

What every external attack surface also has, though, is a long tail, and the long tail rarely shows up in any general-purpose enumeration tool. That is the part this post is about: the tenant-specific subdomains that only the organization itself reliably knows about, and how to feed them into a NodeZero Asset Group via the Horizon3.ai GraphQL API before the EAD runs.

What automated discovery reaches, and what it cannot

Modern external asset discovery, NodeZero's included, draws on three broad classes of signal:

  1. Certificate transparency logs: anything that has ever served a public TLS certificate.
  2. Passive DNS aggregation: names that appeared in resolver telemetry from upstream data providers.
  3. Third-party hosting fingerprints: Microsoft autodiscover, Cloudflare, Azure, AWS, Okta, and similar services emit predictable subdomain patterns.

Together these three cover an enormous slice of the modern public attack surface, and the parts they cover are exactly the parts that should show up in CT logs and should be hosted on services whose tenant names follow a known shape. NodeZero's EAD surfaces these with very little input from you.

What none of these signal classes cover well is the long tail of names an organization invented internally and never advertised publicly. Common examples from external engagements:

  • Subdomains for vendor products with the vendor name baked in: voice-pbx, siem-splunk, wms-manhattan, eam-maximo. These names are tenant choices and they are not in any general wordlist.
  • Subdomains using internal abbreviations: <org-prefix>-<vendor>, where <org-prefix> is a three-letter shorthand only the IT team uses. Nothing general can predict that prefix.
  • Operational subdomains that intentionally are not on CT logs: an SFTP endpoint for a single B2B partner, an HVAC vendor's remote-access portal, a regional VDI broker. They resolve, they accept connections, some do not even speak HTTPS, so CT log enumeration never sees them.
  • Numeric or environmental suffixes: -2, -prod, -dr. These are normally generated by mutation tools, but mutation requires a high-quality base list that already contains the unsuffixed form.

These names are not failures of any one tool. They are simply outside the reachable set for any external enumerator that does not have insider knowledge of the organization's naming conventions. The right answer is to put that insider knowledge into the discovery process, not to wait for it to arrive uninvited.

A real engagement (sanitized)

On a recent external pentest, the customer's IT team gave us a partial inventory of their public-facing endpoints. Some were already documented. Others surfaced only when we asked specific scoping questions: Do you have a remote VPN concentrator? Where does your HVAC vendor connect? What's your SFTP endpoint for partner integrations?

The list they shared back looked like this (sanitized):

sftp.acme.example
vdi.acme.example
vpn-gp.acme.example
eam.acme.example
hvac-2.acme.example
voice-pbx.acme.example

The first three carry a tenant-specific twist on common terms: vpn-gp is a GlobalProtect vendor hint, not the bare vpn that a general wordlist would carry. The last three are pure long-tail: eam for enterprise asset management (likely a Maximo or equivalent deployment), hvac-2 for a numbered HVAC vendor portal (which implies there is also an hvac-1 somewhere), and voice-pbx for a tenant naming convention around their PBX vendor.

Six names. All resolvable. All in scope for the engagement. None of which would have made it into our test plan if we had relied entirely on automation.

Where the names belong: the Asset Group's scope

NodeZero models external scope as an AssetGroup. Each Asset Group has an associated op_template whose osint_domains array seeds discovery, and the Horizon3.ai GraphQL API exposes a clean mutation to extend that scope:

mutation AddDomains($asset_group_uuid: String!, $domains: [StringNotEmpty]!) {
  add_domains_to_asset_group(
    asset_group_uuid: $asset_group_uuid
    domains: $domains
  ) {
    asset_group {
      uuid
      name
      external_domain_xops_count
    }
  }
}

Two design choices work in your favor here. First, the mutation is append-and-deduplicate, not overwrite: feeding it idempotently is safe, and re-running the same enrichment script does not create duplicates. Second, EAD honours osint_domains as authoritative seeds for the op-series, so any names you add are treated with the same weight as the ones NodeZero discovered for itself.

This is the right shape for enrichment. You do not need to fork a configuration, you do not need to disable any of NodeZero's automated discovery, and you do not lose the third-party hosting fingerprints EAD would otherwise have produced. You are augmenting, not replacing.

The full enrichment workflow

End to end, the workflow that has been working well on external engagements:

1. Authenticate

export H3_AUTH_URL="https://api.gateway.horizon3ai.com/v1/auth"
export H3_API_URL="https://api.gateway.horizon3ai.com/v1/graphql"

JWT=$(curl -sS -X POST "$H3_AUTH_URL" -H "Content-Type: application/json" -d "{\"key\":\"$H3_API_KEY\"}" | python3 -c 'import sys,json;print(json.load(sys.stdin)["token"])')

JWTs are valid for one hour. Re-authenticate when they expire.

2. Add the long-tail names to the Asset Group

Send the add_domains_to_asset_group mutation shown above with the following variables:

{
  "asset_group_uuid": "your-asset-group-uuid",
  "domains": [
    "sftp.acme.example",
    "vdi.acme.example",
    "vpn-gp.acme.example",
    "eam.acme.example",
    "hvac-2.acme.example",
    "voice-pbx.acme.example"
  ]
}

3. Launch EAD bound to the Asset Group

Send a create_op mutation with ScheduleOpFormInput:

mutation RunEAD($schedule_op_form: ScheduleOpFormInput) {
  create_op(schedule_op_form: $schedule_op_form) {
    op {
      op_id
      op_type
      op_state
      portal_url
    }
  }
}

Variables:

{
  "schedule_op_form": {
    "op_type": "ExternalAssetDiscovery",
    "asset_group_uuid": "your-asset-group-uuid"
  }
}

One small but important detail: pass asset_group_uuid inside schedule_op_form. That binding attaches the new op to the Asset Group's op-series so the discovered assets show up under the Asset Group in the portal. Calling create_op with only op_template_uuid produces a standalone op that is not bound back, so always include the asset_group_uuid in schedule_op_form when running enrichment EADs.

4. Watch and verify

EAD typically completes in fifteen to twenty minutes. Once op_state reaches done, the Asset Group's external_domain_xops_count reflects the combined discovered set (NodeZero's automated finds plus your enrichment), and external_domain_xops_page gives per-domain detail with resolved IPs and pentestability flags.

Obviously, the completion time can vary based on the amount and scope of the environment.

Where this enrichment fits in an engagement

The pattern slots in cleanly at two points in our workflow:

  • Before the engagement EAD, when scoping conversations with the customer's IT team turn up names we know we want covered. Inject them, run EAD, and save fifteen minutes of post-run audit. NodeZero now has the customer's full attack surface: the parts it would have found anyway, plus the long tail nobody could have inferred without an inside view.
  • After the first EAD, as a "did we miss anything?" diff. Pull the discovered set, compare against the customer's own asset inventory or against passive sources, inject any deltas, and re-run. The op-series tracks this naturally and the add_domains_to_asset_group mutation deduplicates anything already known.

Either way, the API does the right thing without forcing us to reason about scope semantics.

Why we like this design

A lot of pentesting tools treat the scope input boundary as a configuration file you have to negotiate at deploy time. NodeZero's API approach makes the scope a live, mutable property of the engagement, addressable programmatically, with automatic deduplication and op-series tracking. That is the part of the design that lets us layer organisational knowledge on top of automated discovery without fighting either side.

If you run external pentest engagements and you have ever wished you could feed the long tail of customer-specific subdomain knowledge into an automated discovery tool without manually editing config and losing the automation, this is the workflow. The Asset Group is the right abstraction, the mutation is the right primitive, and the op-series gives you a clean record of what was discovered when. Add what you know, let EAD find what you do not, and your discovered surface is the union of both.

Subscribe to Basic Security

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe