Gene Library Courses Download Pricing Contact Sign in

Create an MCP Incident Response Plan

Create an MCP Incident Response Plan

An MCP incident response plan explains what to do when a connected server, tool, source, destination, or workflow becomes unsafe or unavailable.

A good plan helps people respond consistently when:

  • an MCP server is unreachable;
  • authentication fails;
  • expected tools disappear;
  • a write action affects the wrong destination;
  • repeated tool calls appear;
  • sensitive information is exposed;
  • a workflow returns unreliable results;
  • scheduled runs fail repeatedly;
  • permissions become broader than expected; or
  • a server can no longer be trusted.

The plan should be written before an incident happens.

What the plan should achieve

The plan should help your team:

  • recognise an incident;
  • understand the likely impact;
  • stop risky activity;
  • pause dependent schedules;
  • preserve useful evidence;
  • assign clear ownership;
  • communicate with affected users;
  • restore one layer at a time;
  • verify recovery;
  • resume automation gradually; and
  • learn from the event afterward.

It should be clear enough to use during a stressful situation.

Define what counts as an MCP incident

An incident is more than a normal error.

Treat a problem as an incident when it may affect:

  • data confidentiality;
  • data accuracy;
  • system availability;
  • external records;
  • customer or employee information;
  • scheduled automation;
  • write destinations;
  • important business processes;
  • service trust; or
  • several dependent workflows.

A single harmless no-match result is not normally an incident.

Repeated failures, unsafe writes, or unknown data movement may be.

Define the incident scope

Your plan should cover:

  • local MCP servers;
  • remote MCP servers;
  • built-in tools;
  • external tools;
  • Workbench use;
  • Studio workflows;
  • RunFlows execution;
  • scheduled workflows;
  • connected accounts;
  • local files and databases;
  • external sources;
  • external destinations; and
  • mixed local and cloud workflows.

Record which systems are outside the plan and who owns them.

Identify the complete dependency path

Document the normal path for each important server.

For example:

Feluda
→ AI Model
→ MCP Server
→ Connected Source
→ Tool Result
→ Workflow Decision
→ External Destination

An incident may affect only one part of this path.

The response team should know how to isolate each layer.

Assign incident roles

Define who is responsible for each part of the response.

Useful roles include:

Role Responsibility
Incident owner Coordinates the full response and keeps decisions clear
Feluda owner Reviews MCP Servers, Workbench, Studio, RunFlows, and schedules
Server owner Checks the MCP server, endpoint, tools, and updates
Source owner Confirms whether the connected data source is available and current
Destination owner Verifies external writes and prevents duplicates
Security or privacy reviewer Reviews sensitive-data exposure and access changes
Communications owner Updates affected users and stakeholders
Recovery reviewer Confirms tests before automation resumes

One person may hold more than one role in a small team.

Record contact details

Keep current contact details for:

  • Feluda administrator;
  • MCP server owner;
  • connected-service owner;
  • local IT support;
  • network or VPN support;
  • security or privacy contact;
  • workflow owner;
  • schedule owner;
  • provider support; and
  • final approval authority.

Store the contact list where it remains available during an outage.

Do not include private credentials.

Define severity levels

Use simple severity levels that match the real impact.

For example:

Level Description Example
Level 1 — Low Limited issue with no important data or write risk One non-critical read tool fails
Level 2 — Moderate Several users or workflows are affected Scheduled reports fail repeatedly
Level 3 — High Important writes, sensitive data, or production processes may be affected Wrong records are updated
Level 4 — Critical Broad exposure, destructive actions, or major operational impact Credentials are exposed or delete actions run unexpectedly

Keep the levels easy to apply.

Define escalation triggers

Escalate immediately when:

  • credentials appear in tool input or output;
  • a delete tool runs unexpectedly;
  • a write reaches the wrong account or environment;
  • sensitive information is sent to an unapproved service;
  • repeated writes create duplicates;
  • an unknown server or tool appears;
  • a production account has excessive access;
  • the server owner cannot be identified;
  • important data may have been altered; or
  • the environment cannot be trusted.

Do not wait for several failures when the first one is high-impact.

Define warning conditions

A warning may include:

  • one failed known-good test;
  • slower-than-normal runtime;
  • one missing non-critical field;
  • one schedule conflict;
  • one authentication expiry warning;
  • one unavailable low-risk source; or
  • one unexplained tool rename.

Warnings should still receive an owner and follow-up date.

Define outage conditions

Treat the service as unavailable when:

  • Feluda cannot reach the server;
  • authentication fails;
  • required tools disappear;
  • every known-good test fails;
  • the connected source cannot be reached;
  • required write destinations cannot be reached;
  • repeated timeouts occur;
  • scheduled workflows fail repeatedly; or
  • results cannot be trusted.

Pause dependent automation when the impact is unclear.

Define unsafe-write conditions

Stop write activity when:

  • the destination is uncertain;
  • the account or environment is unclear;
  • a timeout occurs after a write call;
  • a partial write may have completed;
  • duplicate actions appear;
  • the wrong fields are changing;
  • approval cannot be verified;
  • a workflow repeats the same call; or
  • the tool result cannot be matched to the external destination.

The plan should make this stop decision immediate.

Prepare an incident checklist

Keep a short first-response checklist.

For example:

1. Confirm the affected server, tool, workflow, and schedule.
2. Stop risky write actions.
3. Pause dependent schedules.
4. Review active RunFlows executions.
5. Check external destinations for completed writes.
6. Preserve safe evidence.
7. Assign an incident owner.
8. Classify severity.
9. Notify required contacts.
10. Begin controlled diagnosis.

This checklist should be easy to find.

Prepare known-good tests

Keep one stable read-only test for each important MCP server.

The test should use:

  • non-sensitive input;
  • a stable source;
  • a predictable result;
  • known required fields;
  • an expected runtime; and
  • no write action.

For example:

Use only the enabled Internal Knowledge Search tool.

Search for "MCP incident test".

Return:
1. result title;
2. source identifier;
3. returned summary;
4. last updated date; and
5. any warning.

Do not create or change anything.

Record the expected baseline

For each known-good test, record:

  • server;
  • tool;
  • test input;
  • expected source;
  • expected result;
  • expected fields;
  • expected warning state;
  • normal runtime;
  • last successful date; and
  • reviewer.

This gives the recovery team something reliable to compare against.

Prepare safe write tests

Important write tools may need a separate recovery test.

Use:

  • a test account;
  • a temporary item;
  • a reversible action;
  • a known destination;
  • explicit approval;
  • duplicate prevention; and
  • external verification.

Do not use production writes as the first recovery test.

Prepare a server inventory

Keep a current record for every important MCP server.

Include:

  • server name;
  • local or remote location;
  • endpoint;
  • owner;
  • service operator;
  • authentication method;
  • connected account;
  • tool list;
  • read and write tools;
  • sources;
  • destinations;
  • permissions;
  • dependent workflows;
  • dependent schedules;
  • known-good test; and
  • escalation contact.

Do not record raw secrets.

Prepare a workflow dependency map

For every important workflow, record:

  • workflow name;
  • owner;
  • model;
  • MCP server;
  • MCP tools;
  • source;
  • destination;
  • read or write action;
  • permissions;
  • schedule;
  • known error path;
  • recovery priority; and
  • replacement or fallback process.

A dependency map makes outage scope easier to understand.

Prepare a schedule inventory

Record:

  • schedule name;
  • workflow;
  • frequency;
  • next-run time;
  • source;
  • destination;
  • normal runtime;
  • conflict risk;
  • write behaviour;
  • owner;
  • reviewer; and
  • pause procedure.

Schedule Manager should be part of the incident response process.

Prepare access and permission records

Document the intended:

  • account scope;
  • read access;
  • write access;
  • URL rules;
  • IP rules;
  • path rules;
  • port rules;
  • source scope;
  • destination scope; and
  • approval requirements.

This helps identify unexpected permission changes during an incident.

Prepare backup and recovery information

Keep current records for:

  • Feluda user-data backups;
  • Studio workflow backups;
  • Journal backups;
  • local source backups;
  • database backups;
  • MCP server configuration;
  • model and runner details;
  • restoration instructions;
  • recovery test results; and
  • backup owner.

Do not store raw credentials in normal backup documents.

Define detection methods

Incidents may be detected through:

  • MCP Servers connection errors;
  • Workbench Activity;
  • RunFlows errors;
  • Emit output;
  • Schedule Manager failures;
  • conflict warnings;
  • external-service activity;
  • user reports;
  • known-good tests;
  • authentication expiry notices;
  • local service monitoring; or
  • provider-status updates.

Use more than one signal for important services.

Use Workbench Activity as evidence

Workbench Activity can show:

  • which tool was called;
  • what input was sent;
  • what result returned;
  • whether a warning appeared;
  • whether an error occurred; and
  • whether the model repeated the call.

The response plan should require Activity review for interactive incidents.

Use RunFlows as evidence

RunFlows can show:

  • starting input;
  • tool calls;
  • tool input;
  • raw output;
  • intermediate values;
  • warnings;
  • errors;
  • branch decisions; and
  • final output.

The response plan should identify who reviews these details.

Use Emit blocks for visibility

Emit blocks can reveal:

  • raw tool results;
  • extracted fields;
  • approval proposals;
  • write parameters;
  • error messages;
  • branch values; and
  • intermediate decisions.

Add them to important workflows before an incident occurs.

Define evidence-handling rules

Record only what is needed.

Useful evidence may include:

  • date and time;
  • server name;
  • tool name;
  • workflow name;
  • schedule name;
  • safe sample input;
  • visible error;
  • warning;
  • runtime;
  • source;
  • destination;
  • recent change; and
  • action taken.

Do not include credentials or unrelated sensitive data.

Protect incident records

Incident records may contain:

  • internal system names;
  • user information;
  • file paths;
  • source details;
  • tool output;
  • external destinations;
  • account names; or
  • security findings.

Limit access and follow your normal retention process.

Define immediate containment actions

The plan should allow responders to:

  • pause schedules;
  • stop active write workflows;
  • disable a write tool;
  • disable an MCP connection;
  • remove a broad permission;
  • stop a local server;
  • disconnect a local network service;
  • revoke a credential;
  • disable an account; or
  • block an external destination.

High-impact containment may require approval from the appropriate owner.

Pause schedules first when risk is unclear

Scheduled workflows may continue running without a person watching them.

Pause schedules when:

  • the server is unavailable;
  • authentication fails;
  • required tools disappear;
  • results become unreliable;
  • write destinations are uncertain;
  • repeated calls appear;
  • permission failures repeat; or
  • the server is no longer trusted.

Record which schedules were paused.

Check active RunFlows executions

During containment, review:

  • active runs;
  • long-running calls;
  • failed runs;
  • queued runs;
  • repeated calls;
  • partial writes; and
  • overlapping schedules.

Stop or isolate risky executions.

Check external destinations

For write incidents, inspect:

  • records;
  • files;
  • tasks;
  • messages;
  • Journal entries;
  • database changes;
  • status changes; and
  • timestamps.

Do this before retrying or rolling back.

Define communication rules

The plan should state:

  • who communicates;
  • who receives updates;
  • what information is shared;
  • when updates are sent;
  • how uncertainty is described;
  • when users should stop using a tool;
  • when users may resume; and
  • who approves the final recovery notice.

Keep user-facing communication clear and non-technical.

Prepare an initial incident message

A simple message may say:

The connected MCP service is currently unavailable.

Affected workflows have been paused to prevent incomplete or repeated
actions. No new result should be treated as current until recovery is
confirmed.

The service owner is reviewing the issue.

Do not promise a recovery time unless it is confirmed.

Prepare a write-risk message

For uncertain writes:

The connected service did not confirm the write action.

Do not repeat the action yet. The destination is being reviewed to
determine whether the first request completed.

This helps prevent duplicates.

Prepare a recovery message

A recovery message may say:

The MCP service has passed connection, read, workflow, and scheduled tests.

Dependent workflows are being resumed gradually. Please report any
missing results, repeated actions, or unexpected destination changes.

Send it only after verification.

Define diagnosis order

Use a consistent diagnosis sequence:

  1. Feluda application.
  2. Selected AI model.
  3. MCP connection.
  4. Endpoint.
  5. Authentication.
  6. Permissions.
  7. Local network or internet.
  8. MCP server process.
  9. Connected source.
  10. External destination.
  11. Workflow mapping.
  12. Schedule conditions.

Find the first failing layer.

Diagnose local MCP incidents

For a local server, check:

  • computer power;
  • sleep state;
  • process status;
  • application status;
  • endpoint;
  • port;
  • firewall;
  • local files;
  • local database;
  • storage;
  • memory;
  • service startup; and
  • recent operating-system changes.

Record the normal startup process in advance.

Diagnose remote MCP incidents

For a remote server, check:

  • network access;
  • DNS;
  • certificate;
  • VPN;
  • proxy;
  • provider status;
  • endpoint;
  • authentication;
  • account status;
  • remote source;
  • remote destination; and
  • provider changes.

Know how to contact the remote owner.

Diagnose model-related incidents

A model failure can look like an MCP incident.

Check whether:

  • the provider is available;
  • the model is available;
  • the model supports tool use;
  • the correct model is selected;
  • the prompt is clear;
  • only intended tools are enabled;
  • context is not too long; and
  • the model is repeating calls.

Test the model without tools.

Diagnose source-related incidents

The MCP server may work while its source fails.

Check:

  • file availability;
  • folder path;
  • local database;
  • cloud storage;
  • search index;
  • business application;
  • account access;
  • data freshness; and
  • record existence.

Test the source separately when possible.

Diagnose destination-related incidents

A write tool may fail because the destination changed.

Check:

  • account;
  • workspace;
  • project;
  • record;
  • folder;
  • recipient;
  • environment;
  • write permission;
  • overwrite behaviour; and
  • service availability.

Confirm the destination before repeating the action.

Diagnose permission incidents

Review:

  • account scope;
  • read access;
  • write access;
  • URL allowlists;
  • IP allowlists;
  • path allowlists;
  • port allowlists;
  • blocked destinations; and
  • temporary permission changes.

Do not broaden access without understanding the cause.

Define recovery steps

Recovery should restore one layer at a time.

For each change:

  1. record the current state;
  2. make one approved correction;
  3. run the known-good read test;
  4. review Activity;
  5. compare with the baseline;
  6. test the affected workflow;
  7. verify the destination when relevant; and
  8. decide whether to continue.

Avoid several untracked changes at once.

Define local recovery steps

A local recovery sequence may be:

  1. wake or restart the computer;
  2. start the model runner;
  3. confirm the model;
  4. start the MCP server;
  5. confirm the endpoint and port;
  6. confirm local files or database;
  7. restore narrow permissions;
  8. reconnect in MCP Servers;
  9. run the known-good test; and
  10. test affected workflows.

Define remote recovery steps

A remote recovery sequence may be:

  1. confirm network or VPN;
  2. review provider status;
  3. confirm endpoint and certificate;
  4. renew authentication;
  5. confirm account access;
  6. confirm remote source;
  7. confirm destination;
  8. contact the provider when needed;
  9. run the known-good test; and
  10. test affected workflows.

Define credential-response steps

When a credential may be exposed:

  • stop affected access;
  • pause schedules;
  • revoke or rotate the credential;
  • inspect account activity;
  • replace the protected value;
  • confirm the new scope;
  • test one read-only action;
  • test required writes;
  • verify external destinations; and
  • record the response.

Never copy the exposed value into the incident record.

Define data-exposure steps

When sensitive information may have reached an unapproved destination:

  • stop the workflow;
  • disable the affected tool;
  • preserve safe evidence;
  • identify what was sent;
  • identify the receiving service;
  • identify affected users or records;
  • contact the privacy or security owner;
  • revoke access when needed;
  • follow organisational notification rules; and
  • prevent repeated transmission.

Do not minimise the incident before the facts are known.

Define wrong-write steps

When a tool writes to the wrong destination:

  • stop the workflow;
  • pause related schedules;
  • identify every affected item;
  • preserve timestamps and identifiers;
  • confirm whether reversal is possible;
  • obtain approval before correction;
  • avoid duplicate corrective actions;
  • test the corrected path safely; and
  • review why the destination was wrong.

Define duplicate-write steps

When duplicates appear:

  • stop repeated calls;
  • pause the schedule;
  • inspect Activity and RunFlows;
  • compare timestamps;
  • identify the first successful write;
  • identify every duplicate;
  • confirm the approved cleanup process;
  • add duplicate prevention; and
  • test timeout handling.

Define recovery tests

Recovery should include:

  • MCP Servers connection check;
  • expected tool-list check;
  • known-good read-only Workbench test;
  • Workbench Activity review;
  • no-result test;
  • error-path test;
  • Studio workflow test;
  • RunFlows test;
  • safe write test when required;
  • destination verification; and
  • one-time scheduled test.

A connection status alone is not enough.

Verify the expected tool list

After recovery, confirm that:

  • required tools appear;
  • no required tool is missing;
  • no unexpected tool appears;
  • tool names are correct;
  • descriptions remain accurate;
  • read and write behaviour is unchanged; and
  • input and output remain compatible.

A server update may have occurred during the incident.

Verify raw tool results

Compare the recovered result with the baseline.

Check:

  • source;
  • identifier;
  • required fields;
  • timestamps;
  • result count;
  • warnings;
  • runtime; and
  • final answer.

A server that connects but returns the wrong data is not recovered.

Verify write tools separately

Use a test account or safe destination.

Confirm:

  • exact destination;
  • exact fields;
  • approval;
  • write confirmation;
  • external result;
  • duplicate prevention;
  • timeout handling; and
  • reversibility.

Do not use an important production write as the first test.

Verify one-time scheduled execution

Before recurring schedules resume:

  • use a safe one-time run;
  • confirm the server is available at run time;
  • review Schedule Manager;
  • review RunFlows;
  • inspect Activity;
  • verify output;
  • verify the destination; and
  • check for duplicates.

Define recovery approval

State who can declare the service recovered.

Recovery approval should require evidence that:

  • the cause is understood or controlled;
  • known-good tests pass;
  • affected workflows pass;
  • required write tests pass;
  • destinations are correct;
  • schedules are safe to resume;
  • users have been informed; and
  • temporary access has been removed.

Resume automation gradually

Resume in stages.

A useful order is:

  1. low-risk read-only workflows;
  2. important read-only workflows;
  3. low-risk local writes;
  4. approved production writes;
  5. high-volume workflows; and
  6. overlapping schedules.

Monitor each stage before moving to the next.

Monitor the first resumed runs

Review:

  • tool calls;
  • tool input;
  • raw output;
  • runtime;
  • warnings;
  • errors;
  • branch decisions;
  • external destinations;
  • duplicate actions; and
  • user reports.

Pause again if unexpected behaviour returns.

Define rollback conditions

Roll back or disable the recovery change when:

  • the server remains unstable;
  • required tools are missing;
  • result quality is unacceptable;
  • write destinations remain uncertain;
  • permissions are too broad;
  • repeated calls continue;
  • scheduled runs fail;
  • privacy requirements are not met; or
  • the environment cannot be verified.

The plan should identify who approves rollback.

Prepare a fallback policy

A fallback may use:

  • another approved MCP server;
  • a manual process;
  • a local copy;
  • another approved source;
  • another provider; or
  • postponement.

The fallback policy should state:

  • when fallback is allowed;
  • who approves it;
  • what data may be sent;
  • how users are informed;
  • how the result is labelled;
  • whether write actions are allowed; and
  • how the normal service is restored later.

Avoid silent fallback

Do not allow a workflow to switch silently:

  • from local to remote;
  • from one server to another;
  • from test to production;
  • from read-only to write-capable;
  • from one source to another; or
  • from one destination to another.

Make the change visible and approved.

Plan for unavailable backups

The plan should address what happens when:

  • the latest backup is missing;
  • the backup is unreadable;
  • restoration fails;
  • model files are unavailable;
  • source data is incomplete;
  • protected credentials cannot be recovered; or
  • a replacement device is not ready.

Define the minimum safe service that can be restored.

Plan for device failure

For local environments, record:

  • Feluda user-data backup;
  • model-runner installer;
  • model name and version;
  • MCP server installer;
  • source backup;
  • database backup;
  • permission record;
  • credential recovery method;
  • replacement-device requirements; and
  • restoration tests.

Plan for provider failure

For remote environments, record:

  • provider support contact;
  • service-status location;
  • alternative endpoint when approved;
  • authentication renewal process;
  • replacement-server option;
  • data export process;
  • migration plan; and
  • retirement plan.

Conduct practice exercises

Test the plan with controlled scenarios.

Examples include:

  • local MCP process stopped;
  • remote server unavailable;
  • expired credential;
  • blocked path;
  • blocked port;
  • missing tool;
  • no-result response;
  • changed result structure;
  • timed-out write;
  • duplicate write; or
  • unavailable source.

Use safe test data and destinations.

Review exercise results

After a practice exercise, ask:

  • Was the incident detected?
  • Was an owner assigned quickly?
  • Were schedules paused?
  • Were risky writes stopped?
  • Was evidence preserved?
  • Were contacts current?
  • Did the known-good test help?
  • Did communication work?
  • Did recovery tests cover every layer?
  • Was automation resumed safely?

Update the plan when the answer is no.

Define the post-incident review

Every important incident should end with a review.

Cover:

  • what happened;
  • how it was detected;
  • what was affected;
  • what data or actions were at risk;
  • what containment worked;
  • what recovery worked;
  • what communication was sent;
  • what was delayed;
  • whether duplicates or partial writes occurred;
  • what should change; and
  • who owns each follow-up action.

Focus on practical improvement.

Review missed work

After recovery, identify:

  • missed scheduled runs;
  • delayed reports;
  • unprocessed records;
  • unsent messages;
  • incomplete files;
  • partial writes;
  • duplicate actions;
  • stale outputs; and
  • user tasks that need to be repeated.

Re-run work only after duplicate risk is checked.

Update workflows after incidents

Improve:

  • no-result paths;
  • error paths;
  • timeout handling;
  • duplicate prevention;
  • approval steps;
  • source validation;
  • destination checks;
  • Emit visibility;
  • user-facing error messages; and
  • fallback controls.

Update monitoring after incidents

Consider improving:

  • known-good test frequency;
  • authentication-expiry tracking;
  • service-startup checks;
  • schedule-history review;
  • conflict-warning review;
  • external-destination review;
  • server-owner contacts;
  • provider-status checks; and
  • audit frequency.

Update access after incidents

Remove temporary:

  • administrator access;
  • broad file access;
  • broad URL access;
  • broad IP access;
  • broad port access;
  • test credentials;
  • fallback permissions;
  • duplicate connections; and
  • emergency accounts.

Return the environment to approved boundaries.

Set plan review dates

Review the incident response plan:

  • at a regular interval;
  • after every important incident;
  • after server replacement;
  • after server retirement;
  • after new write tools are added;
  • after credential changes;
  • after ownership changes;
  • after major workflow changes;
  • after backup changes; and
  • after local-to-remote or remote-to-local migration.

Keep the plan current.

Final incident response plan checklist

Confirm that the plan includes:

  • incident definition;
  • scope;
  • dependency paths;
  • owners and contacts;
  • severity levels;
  • escalation triggers;
  • warning and outage conditions;
  • unsafe-write conditions;
  • first-response checklist;
  • server inventory;
  • workflow and schedule inventories;
  • permission records;
  • known-good read tests;
  • safe write tests;
  • evidence-handling rules;
  • containment actions;
  • communication templates;
  • diagnosis order;
  • local and remote recovery steps;
  • credential and data-exposure steps;
  • wrong-write and duplicate-write steps;
  • recovery tests;
  • recovery approval;
  • gradual resumption;
  • rollback conditions;
  • fallback policy;
  • practice exercises;
  • post-incident review; and
  • next plan-review date.

An MCP incident response plan is ready only when the people responsible can use it to stop risk, restore service, and verify recovery without guessing.

Frequently Asked Questions

What should an MCP incident response plan include?
Include owners, contacts, severity levels, pause conditions, known-good tests, containment actions, communication, diagnosis steps, recovery tests, rollback rules, and post-incident review.
When should MCP schedules be paused during an incident?
Pause them when the server is unavailable, authentication fails, results become unreliable, write destinations are uncertain, repeated calls appear, or the server can no longer be trusted.
How should I verify recovery after an MCP incident?
Check MCP Servers, confirm the expected tools, run a known-good Workbench test, review Activity, test Studio and RunFlows, verify required writes, and complete a one-time scheduled test.
Why should the plan include practice exercises?
Controlled exercises reveal missing contacts, unclear ownership, weak error paths, unsafe retries, incomplete recovery tests, and other problems before a real incident occurs.