Respond to an MCP Server Outage
An MCP server outage can interrupt Workbench tasks, Studio workflows, RunFlows executions, and scheduled automation.
The safest response is to:
- confirm what is affected;
- stop risky write actions;
- pause dependent schedules;
- preserve useful error information;
- identify the failing layer;
- restore one component at a time;
- test with a known-good request; and
- resume automation gradually.
Do not repeatedly retry write actions before checking whether an earlier call already completed.
What counts as an outage
Treat the server as unavailable when Feluda cannot reliably use its required tools.
Common signs include:
- the server shows as unavailable;
- expected tools disappear;
- authentication fails;
- tool calls time out;
- every known-good test fails;
- the connected source cannot be reached;
- write destinations cannot be reached;
- repeated schedules fail; or
- results are too incomplete or inconsistent to trust.
A server may still appear configured while its tools are not usable.
Confirm the scope first
Before changing settings, determine whether the problem affects:
- one tool;
- one workflow;
- one account;
- one source;
- one destination;
- one computer;
- one network;
- one MCP server;
- several servers; or
- all Feluda tool use.
A narrow problem should not trigger a full environment change.
Separate the affected layers
A typical tool path is:
Feluda
→ AI Model
→ MCP Server
→ Connected Source or Service
→ Tool Result
→ Workflow Output
The outage may be in:
- Feluda;
- the selected AI model;
- the MCP connection;
- authentication;
- the local network;
- the internet connection;
- VPN or proxy access;
- the MCP server process;
- the connected source;
- the write destination; or
- a later workflow step.
Find the first failing layer.
Check whether the issue is truly the MCP server
The AI model may fail even when the MCP server works.
The connected source may fail even when Feluda reaches the MCP server.
A workflow may fail after the tool returns a correct result.
Compare:
- the MCP Servers connection state;
- Workbench Activity;
- RunFlows output;
- the raw tool result;
- the connected source;
- the destination; and
- the final workflow step.
Do not label the event as a server outage until the evidence supports it.
Stop high-risk actions
Stop or pause tasks that can:
- create records;
- update records;
- send messages;
- save files;
- overwrite files;
- move items;
- change statuses; or
- delete information.
A delayed response can make a completed write look like a failed write.
Check the destination before retrying.
Pause dependent schedules
Open Schedule Manager.
Pause schedules that depend on the affected server.
Record:
- schedule name;
- workflow;
- next-run time;
- source;
- destination;
- write action;
- recent failure; and
- responsible reviewer.
Leave schedules paused until manual and one-time scheduled tests succeed.
Check active runs
Review:
- active RunFlows executions;
- recent failed runs;
- Workbench Activity;
- pending external actions;
- delayed write confirmations; and
- overlapping schedules.
Stop active work when continuing could create duplicates or incorrect writes.
Preserve useful evidence
Before changing the connection, record:
- date and time;
- server name;
- tool name;
- workflow name;
- schedule name;
- safe sample input;
- visible error;
- warning;
- runtime;
- connection state;
- recent changes; and
- affected source or destination.
Do not include credentials.
Review Workbench Activity
Open the Activity drawer after a failed tool request.
Check:
- which tool was called;
- which server provided it;
- what input was sent;
- whether a result returned;
- whether an error appeared;
- whether the call repeated;
- whether the model continued without data; and
- whether a write may have completed.
The Activity drawer helps distinguish a model problem from a tool problem.
Review RunFlows output
For a failed workflow, review:
- starting input;
- tool calls;
- tool input;
- raw tool output;
- intermediate values;
- warnings;
- errors;
- selected branch; and
- final output.
The first visible failure is usually more useful than the final error message.
Use Emit blocks when needed
In Studio, an Emit block can expose an intermediate result.
For example:
Input
→ MCP Tool
→ Emit Raw Tool Result
→ Prepare Summary
→ Output
This helps confirm whether the MCP tool failed or a later step failed.
Check MCP Servers
Open MCP Servers from the Feluda sidebar.
Review:
- server name;
- endpoint;
- connection state;
- authentication state;
- discovered tools;
- warnings; and
- errors.
Do not change the endpoint until you have confirmed the official current value.
Check recent changes
Ask whether the outage began after:
- a server update;
- an endpoint change;
- credential rotation;
- an account change;
- a network change;
- VPN changes;
- firewall changes;
- operating-system updates;
- Feluda updates;
- a model-runner update;
- a tool rename;
- a workflow edit;
- a source move; or
- a destination change.
Recent changes often reveal the likely cause.
Check the endpoint
Confirm:
- protocol;
- host;
- port;
- path;
- spelling;
- local or remote location;
- official server guidance; and
- whether redirects or certificates changed.
Do not guess an endpoint.
Check authentication
Confirm that:
- the credential belongs to the correct server;
- the account remains active;
- the credential has not expired;
- the required scope remains available;
- the authentication method has not changed;
- the connected account still has access; and
- the value remains stored in protected settings.
Never paste credentials into a prompt or incident note.
Check permissions
A reachable server can still fail because access is blocked.
Review:
- read permission;
- write permission;
- account scope;
- workspace scope;
- project scope;
- record scope;
- URL rules;
- IP rules;
- path rules;
- port rules; and
- destination access.
Apply only the narrowest approved change.
Check local server health
For a local MCP server, confirm that:
- the computer is on;
- the process is running;
- the required application is open;
- the port is available;
- the local firewall allows access;
- the source path is mounted;
- the local database is running;
- enough memory is available; and
- the service started after restart.
Test the local server separately when possible.
Check remote server health
For a remote MCP server, review:
- internet or network access;
- provider service status;
- DNS;
- certificates;
- VPN;
- proxy;
- remote endpoint;
- account status;
- authentication; and
- source availability.
Contact the server owner when the problem is outside Feluda.
Check the connected source
The MCP server may be available while its source is not.
Check whether the tool depends on:
- files;
- a local database;
- cloud storage;
- a hosted application;
- an internal service;
- a search index;
- a message platform; or
- another provider.
Test the source directly when possible.
Check the destination
For write tools, confirm that the destination is still available.
Review:
- account;
- workspace;
- project;
- folder;
- record;
- message destination;
- local path;
- external service; and
- write permission.
Do not repeat a timed-out write until the destination has been checked.
Check the AI model
A model problem can look like a tool outage.
Confirm that:
- the provider is available;
- the selected model is available;
- the model supports tool use;
- the model receives the tool description;
- only the intended tools are enabled;
- the prompt is clear; and
- the model is not repeating failed calls.
Test the model without tools.
Use a known-good read test
Keep one stable read-only test for the server.
For example:
Use only the enabled Internal Knowledge Search tool.
Search for "MCP outage test".
Return:
1. result title;
2. source identifier;
3. returned summary;
4. last updated date; and
5. any warning.
Do not create or change anything.
Use the same test during diagnosis and recovery.
Interpret the test result
If the read test fails, review:
- server connection;
- authentication;
- tool availability;
- source access;
- permissions;
- network;
- runtime; and
- returned error.
If it succeeds, the outage may be limited to another tool, workflow, source, or destination.
Distinguish no result from outage
No result:
The tool completed but found nothing.
Outage:
The tool could not complete the request.
Do not treat an empty search result as proof that the server is unavailable.
Check repeated calls
Repeated calls may appear when:
- the model does not recognise the error;
- the tool times out;
- the result is empty;
- the workflow loops;
- several similar tools are enabled; or
- a fallback repeats the same failing request.
Stop repeated write-capable calls immediately.
Check timeouts carefully
A timeout may happen before or after the external service acts.
Before retrying:
- review Activity or RunFlows;
- inspect the destination;
- compare timestamps;
- confirm whether the action completed; and
- retry only if the first action did not complete.
This prevents duplicate records, files, tasks, messages, or notes.
Check partial failures
A tool may return some data before failing.
Confirm:
- which fields returned;
- whether the data is complete;
- whether a write partly completed;
- whether the destination changed;
- whether retrying would repeat successful steps; and
- whether human review is required.
Do not treat partial success as full success.
Decide whether to use a fallback
A fallback may involve:
- another MCP server;
- a manual process;
- another approved source;
- a local copy;
- another provider; or
- postponing the task.
Use a fallback only when:
- it is approved;
- the data path is understood;
- the source is appropriate;
- the destination is correct;
- users are informed; and
- the result is clearly labelled.
Do not switch silently to a different server.
Avoid unreviewed local-to-remote fallback
A local workflow should not silently send information to a remote service during an outage.
Confirm:
- what information would leave the device;
- which provider would receive it;
- whether the fallback is approved;
- whether personal or confidential data is involved; and
- whether explicit confirmation is required.
Return a clear outage message when remote fallback is not approved.
Communicate the impact
Tell affected users:
- which server is unavailable;
- which tools are affected;
- which workflows are affected;
- which schedules are paused;
- whether write actions are stopped;
- whether a fallback exists;
- what results may be delayed; and
- where to report unexpected behaviour.
Avoid promising a recovery time unless it is confirmed by the responsible service owner.
Use clear user-facing messages
A workflow may return:
The connected service is currently unavailable.
No result was produced.
This workflow has stopped to avoid using incomplete information.
For a write workflow:
The connected service could not confirm the write action.
Review the destination before trying again.
Do not expose internal secrets or unnecessary technical details.
Restore one layer at a time
Change only one of these before retesting:
- server process;
- endpoint;
- authentication;
- account;
- network;
- VPN;
- permission;
- source;
- destination;
- model;
- tool configuration; or
- workflow mapping.
Use the same known-good test after each change.
Recover a local server
A practical local recovery sequence is:
- confirm the computer is awake;
- confirm the MCP server process;
- confirm the endpoint and port;
- confirm the local firewall;
- confirm required applications;
- confirm source files or database;
- confirm available memory;
- restart only the required service;
- reopen MCP Servers; and
- run the known-good read test.
Recover a remote server
A practical remote recovery sequence is:
- confirm network access;
- confirm VPN or proxy;
- confirm provider status;
- confirm endpoint and certificate;
- confirm authentication;
- confirm account access;
- confirm the remote source;
- contact the service owner when needed;
- reopen MCP Servers; and
- run the known-good read test.
Verify the tool list
After recovery, confirm that:
- expected tools reappear;
- no required tool is missing;
- no unexpected tool appears;
- tool names remain correct;
- descriptions remain correct;
- read and write behaviour is unchanged; and
- input and output remain compatible.
A server update during the outage may have changed the tool list.
Verify raw results
Compare the recovered tool result with the known-good baseline.
Check:
- source;
- record identifier;
- fields;
- timestamps;
- warnings;
- result count;
- runtime; and
- final answer.
A connection that returns the wrong data is not fully recovered.
Verify permissions
Confirm that:
- approved access works;
- unrelated sources remain blocked;
- read-only accounts remain read-only;
- URL rules remain narrow;
- IP rules remain narrow;
- paths remain narrow;
- ports remain narrow; and
- write destinations remain limited.
Do not leave temporary broad access in place.
Verify Workbench
Run the known-good test in a new conversation.
Review Activity.
Confirm:
- the expected tool is called;
- input is correct;
- output is complete;
- warnings are understood;
- no repeated call occurs; and
- the model interprets the result correctly.
Verify Studio workflows
Open affected workflows.
Review:
- selected tool;
- model;
- prompt;
- input mapping;
- output mapping;
- permissions;
- no-result path;
- error path;
- Emit blocks;
- write approval; and
- destination.
An outage or update may expose an outdated dependency.
Verify RunFlows
Test each important flow with safe sample data.
Review:
- starting input;
- raw tool output;
- intermediate values;
- branch decision;
- warnings;
- errors;
- final output; and
- external destination.
Do not resume schedules based only on a Workbench test.
Verify write tools separately
Use:
- a test account;
- a safe destination;
- a reversible action;
- explicit approval;
- destination review; and
- duplicate checking.
A successful read test does not prove that writes work.
Verify a one-time schedule
Before resuming recurring schedules:
- create or use a one-time scheduled test;
- use a safe flow;
- confirm the server is available at run time;
- review Schedule Manager;
- review RunFlows;
- inspect the result;
- verify the destination; and
- check for duplicates.
This confirms scheduled availability.
Resume automation gradually
Resume one important schedule at a time.
Monitor the first runs.
Check:
- tool calls;
- input;
- output;
- runtime;
- warnings;
- errors;
- branch decisions;
- write destinations; and
- duplicate actions.
Pause again if unexpected behaviour appears.
Keep non-critical schedules paused when needed
It may be safer to restore critical read workflows first.
Resume:
- low-risk read-only flows;
- important read flows;
- low-risk write flows;
- approved production write flows; and
- high-volume or overlapping schedules.
Use the order that fits your environment.
Escalate the outage
Contact the server owner or provider when:
- the endpoint is correct but unreachable;
- provider status shows a failure;
- authentication fails after confirmed renewal;
- required tools are missing;
- result structure changed;
- repeated timeouts continue;
- the connected source remains unavailable;
- writes affect the wrong destination; or
- recovery cannot be verified.
Include safe diagnostic details, not credentials.
What to include in an escalation
Provide:
- date and time;
- server name;
- affected tool;
- endpoint type;
- local or remote status;
- visible error;
- known-good test;
- whether the source is reachable;
- whether authentication was checked;
- whether schedules are paused; and
- whether write actions are affected.
Avoid sending private data unless the support process explicitly requires and protects it.
Record the incident
For important outages, record:
- start time;
- detection method;
- affected server;
- affected tools;
- affected workflows;
- affected schedules;
- affected sources;
- affected destinations;
- write risk;
- visible errors;
- actions taken;
- owner contacted;
- recovery time;
- tests completed; and
- final outcome.
Do not include raw credentials.
Review after recovery
A post-incident review should ask:
- What failed first?
- How was the outage detected?
- Which workflows were affected?
- Were schedules paused quickly enough?
- Did any write action duplicate or partially complete?
- Were users informed?
- Was the fallback appropriate?
- Did recovery steps work?
- Were known-good tests available?
- Could the issue have been detected earlier?
- What should change before the next outage?
Focus on practical improvements.
Review missed or delayed work
After recovery, identify:
- missed scheduled runs;
- delayed reports;
- unprocessed records;
- unsent messages;
- incomplete writes;
- duplicate actions;
- stale outputs; and
- user tasks that need to be repeated.
Re-run work only after confirming it will not duplicate earlier actions.
Check for duplicates
Before repeating missed write workflows, inspect:
- destination records;
- files;
- tasks;
- messages;
- Journal entries;
- timestamps;
- identifiers; and
- schedule history.
A failed confirmation may hide a completed action.
Improve monitoring
After an outage, consider improving:
- known-good read tests;
- ownership records;
- authentication expiry tracking;
- schedule review;
- conflict warnings;
- error paths;
- Activity review;
- RunFlows review;
- fallback rules;
- service startup;
- backup routines; and
- escalation contacts.
Use the outage to improve readiness.
Improve workflow error handling
Add or update paths for:
- unavailable server;
- authentication failure;
- permission denial;
- no result;
- partial result;
- timeout;
- write uncertainty; and
- manual review.
A workflow should stop safely when the tool cannot be trusted.
Improve local recovery
For local environments, consider:
- automatic service startup;
- clearer port documentation;
- restart tests;
- power and sleep changes;
- hardware monitoring;
- local database checks;
- source path checks; and
- offline tests.
Improve remote recovery
For remote environments, consider:
- provider status monitoring;
- backup endpoints;
- credential renewal planning;
- VPN checks;
- proxy documentation;
- approved fallback services;
- network escalation; and
- provider contacts.
Define future outage thresholds
Decide when to:
- issue a warning;
- pause schedules;
- stop write workflows;
- use a fallback;
- contact the server owner;
- declare recovery;
- resume automation; and
- perform a post-incident review.
Clear thresholds reduce uncertainty during the next event.
A practical outage routine
Use this process:
- Confirm the symptom.
- Identify the affected server, tool, workflow, and schedule.
- Stop risky writes.
- Pause dependent schedules.
- Check active runs and destinations.
- Preserve safe evidence.
- Review MCP Servers.
- Review Workbench Activity and RunFlows.
- Check recent changes.
- Check endpoint, authentication, permissions, and network.
- Check the connected source and destination.
- Run the known-good read test.
- Restore one layer at a time.
- Compare the recovered result with the baseline.
- Test Workbench, Studio, and RunFlows.
- Test write tools separately.
- Use a one-time scheduled test.
- Resume schedules gradually.
- Review missed work and duplicates.
- Complete the incident review.
A safe outage response protects data and external systems while restoring the MCP service in a controlled way.