Vendor Coordination Playbook: Escalations and Evidence
Ground station operations depend on vendors: antenna and RF hardware providers, networking carriers, software suppliers, and facility partners. When something breaks during a pass window, the difference between a fast fix and a long outage often comes down to how you escalate and what evidence you provide. This playbook explains a practical, repeatable approach for working with vendors under real operational pressure.
Table of contents
- Why Vendor Coordination Needs a Playbook
- Define Ownership and Interfaces Before an Incident
- What to Collect First: The Minimum Evidence Set
- How to Triage Before Escalating
- How to Open a High-Quality Vendor Ticket
- Escalation Levels and When to Use Them
- Evidence by System Type: RF, Control, Network, and Software
- Working Sessions and War Rooms: Running the Incident
- Communication Discipline: Status Updates and Decision Logs
- Handover and Follow-Up After the Fix
- Vendor Performance Review and Prevention Work
- Glossary: Vendor Coordination Terms
Why Vendor Coordination Needs a Playbook
Vendor escalations often fail for predictable reasons: unclear ownership, missing logs, slow responses, and too much time spent arguing about where the fault is. A playbook helps avoid those traps. It gives operators a standard way to:
- Decide when to escalate and to whom.
- Send a clear problem statement backed by evidence.
- Keep the incident moving with timeboxed actions and decision points.
- Capture what happened so you can prevent repeats.
The goal is not perfection. The goal is speed, clarity, and a record that supports both technical resolution and operational accountability.
Define Ownership and Interfaces Before an Incident
The best time to define vendor boundaries is before anything breaks. When a pass fails, you want to know which vendor owns which layer and who has the authority to approve changes.
A practical vendor coordination map usually includes:
- System ownership: who owns the antenna control unit, the drive cabinet, the modem, the RF chain, the backhaul router, and site power.
- Support channels: standard support, emergency support, and after-hours contacts.
- Change permissions: which actions you are allowed to take without vendor approval.
- Access rules: how vendors connect, what they can access, and how access is logged.
- Evidence expectations: what logs and measurements are usually required for escalation.
If you can write down “what vendors need to see” for common failures, you will save time during every major incident.
What to Collect First: The Minimum Evidence Set
During an incident, the instinct is often to start changing things. That can make the situation worse and destroy evidence. A better approach is to collect a small, consistent evidence set before taking major action.
The minimum evidence set should be quick to gather and useful across vendors:
- Time window: the exact start and end time of the failure, including time zone.
- Impact statement: what failed (acquisition, lock stability, decoding, delivery) and what was affected (one mission or multiple).
- Pass context: satellite ID, expected AOS/LOS, maximum elevation, and planned frequencies/modcod if applicable.
- System state: key alarms, service status, and recent changes.
- Logs: the most relevant logs for the time window, not a full-day dump.
- Measurements: a few trusted values (signal level, lock status, BER, SNR, or link margin indicators), depending on the system.
When your evidence set is consistent, vendors can respond faster, and your team can compare incidents over time.
How to Triage Before Escalating
Vendors respond best when you show that you did basic triage and can point to a likely layer. The goal is not to prove the vendor is at fault. The goal is to narrow the problem so the vendor can start in the right place.
Quick triage steps that usually pay off
- Confirm the scope: one antenna or all antennas, one mission or all missions, one site or multiple sites.
- Check recent changes: deployments, firmware updates, configuration edits, maintenance activity.
- Check basic health: power, network connectivity, disk space, time reference status.
- Compare to baseline: normal signal level and normal lock behavior for the same satellite or band.
- Reproduce safely: if possible, rerun a non-destructive step to confirm symptoms.
Triage should be timeboxed. If you do not have a clear direction within a short window, escalate with the evidence you have and continue triage in parallel.
How to Open a High-Quality Vendor Ticket
A strong ticket is structured, brief, and evidence-backed. It should read like a story: what should have happened, what happened instead, and what you observed.
Ticket structure that works across vendors
- Title: a clear failure plus scope (for example, “Loss of lock during Ku-band passes on Antenna 2”).
- Summary: one paragraph describing the failure and impact.
- Timeline: key timestamps (start, first symptom, mitigation attempts, current state).
- Expected vs actual: what the station normally does and what changed.
- Evidence: short log excerpts, alarm lists, and key measurements for the time window.
- Actions taken: what you already tried and what the outcome was.
- Requested next step: what you need from the vendor (diagnosis, remote session, RMA, configuration review).
The last line should always state your operational urgency in plain terms: how many passes are at risk and when the next critical window occurs.
Escalation Levels and When to Use Them
Not every issue should trigger an emergency escalation. A simple escalation ladder helps teams act consistently, even across shifts.
Typical escalation ladder
- Level 1 (Standard support): non-urgent issue, workaround exists, no immediate pass impact.
- Level 2 (Priority support): recurring failure, service degraded, some pass risk.
- Level 3 (Emergency escalation): active outage, missed contacts, major customer impact, or safety risk.
- Executive escalation: severe, prolonged outage or repeated vendor non-response that threatens commitments.
Escalation should be tied to measurable impact and time. For example, if the next critical pass is within a short window and there is no workaround, escalate immediately rather than waiting for standard response times.
Evidence by System Type: RF, Control, Network, and Software
Different vendors expect different evidence. The most helpful approach is to provide a small set of high-signal data that points to a layer, then offer deeper logs on request.
RF and signal chain issues
RF vendors usually want proof of what the signal looked like and what equipment reported. Focus on stability, levels, and any sudden shifts.
- Key items: receive level trend, noise floor changes, lock status, BER/FER if available, and any converter or amplifier alarms.
- Useful context: weather conditions that could affect the link and whether other carriers in the band looked normal.
Antenna control and pointing issues
Control vendors typically need evidence about tracking behavior and servo performance rather than RF measurements alone.
- Key items: commanded vs actual position, encoder status, tracking mode, limit events, and wind or servo alarms.
- Useful context: whether the issue affects a specific axis or appears only at certain elevations.
Network and backhaul issues
Carriers and network vendors respond better when you show packet-level symptoms and timestamps that align with observed delivery failures.
- Key items: link state changes, packet loss, latency spikes, routing changes, and interface error counters.
- Useful context: whether local traffic is impacted or only external connectivity.
Software and automation issues
Software vendors usually need application logs, version information, and a clear reproduction path when possible.
- Key items: service logs for the incident window, recent deployments, configuration diffs, and error messages with timestamps.
- Useful context: whether the failure is deterministic (always happens) or intermittent (happens sometimes).
Working Sessions and War Rooms: Running the Incident
For high-impact outages, you often need a focused working session with multiple vendors and internal teams. These sessions should have structure, or they become long conversations without progress.
How to structure a working session
- Assign a coordinator: one person owns the timeline, next steps, and updates.
- State the objective: restore operations, find a workaround, or isolate the fault domain.
- Timebox experiments: agree on what will be tried and how long you will spend before switching tactics.
- Control changes: track every change made during the session so you can roll back safely.
- Keep a decision log: record key conclusions and why they were made.
The fastest sessions are the ones where evidence is ready and where the team can answer basic questions immediately: what changed, what is the scope, and what is the next critical pass window.
Communication Discipline: Status Updates and Decision Logs
Communication is part of the fix. Without disciplined updates, teams repeat work, lose context across shifts, or misunderstand what vendors already tried.
What to include in status updates
- Current state: what works, what does not, and what is degraded.
- Impact: passes affected, data delayed, or services unavailable.
- Next actions: what is being tested next and who owns it.
- Time horizon: the next checkpoint time and the next critical pass.
A short, repeatable format makes updates easier for operators and easier for vendors to follow. Over time, it also builds an incident history that helps with trend detection and prevention.
Handover and Follow-Up After the Fix
Once service is restored, it is tempting to close the incident and move on. But the most important prevention work happens right after recovery, while details are fresh.
Practical follow-up checklist
- Verify stability: confirm multiple successful passes or a defined validation window.
- Capture the final state: record configurations and versions after the fix.
- Close the evidence loop: attach final logs and the root cause statement, if known.
- Document the workaround: if a temporary fix was used, define how long it is acceptable.
- Update runbooks: improve steps and add missing evidence expectations.
- Track preventive actions: parts replacement, firmware updates, monitoring improvements, or training needs.
A clean handover also matters across shifts. If the next operator cannot tell what changed and why, the same issue can be reintroduced accidentally.
Vendor Performance Review and Prevention Work
Over time, repeated incidents often reveal patterns: a component that fails in cold weather, a carrier path that flaps monthly, or a software build that breaks under high load. A vendor coordination playbook should include a periodic review step so you do not treat recurring incidents as “bad luck.”
Useful review topics:
- Response times: how long until first response and how long until effective action.
- Evidence quality: whether your evidence set was sufficient and what was missing.
- Root cause clarity: whether the vendor provided a clear technical explanation.
- Prevention actions: monitoring changes, spares strategy, environmental protections, or design improvements.
- Escalation effectiveness: whether escalation paths worked as expected.
The best vendor relationships are predictable and structured. Clear evidence and clear escalation expectations improve outcomes on both sides.
Glossary: Vendor Coordination Terms
Escalation
A request for faster or higher-priority vendor response due to operational urgency or high impact.
Evidence set
A standard collection of logs, measurements, timestamps, and context used to support diagnosis during an incident.
Scope
The boundaries of an incident, such as which missions, antennas, sites, or services are affected.
Timeline
A time-ordered record of symptoms, actions taken, and outcomes during an incident.
Decision log
A short record of key decisions made during an incident and the reasons for them.
Workaround
A temporary operational method used to restore partial or full service while root cause is still being addressed.
War room
A focused working session where internal teams and vendors coordinate actions to restore service and isolate the fault domain.
Root cause
The underlying technical reason an incident occurred, distinct from symptoms and immediate triggers.