What should I do first when an MCP server appears unavailable?

Confirm the scope, stop risky write actions, pause dependent schedules, preserve the visible error, and review MCP Servers, Workbench Activity, and RunFlows before changing settings.

Should I retry a timed-out MCP write action immediately?

No. Inspect the external destination and review Activity or RunFlows first, because the action may have completed before the confirmation timed out.

How do I know the MCP server has fully recovered?

Confirm the expected tools reappear, run a known-good read test, compare the raw result with the baseline, test affected workflows in RunFlows, verify write destinations, and complete a one-time scheduled test.

When should I resume scheduled workflows after an outage?

Resume them gradually only after manual tests, RunFlows tests, required write tests, and a one-time scheduled test succeed without unexpected warnings, errors, or duplicates.

Respond to an MCP Server Outage | Feluda.ai Documentation

Respond to an MCP Server Outage

An MCP server outage can interrupt Workbench tasks, Studio workflows, RunFlows executions, and scheduled automation.

The safest response is to:

confirm what is affected;
stop risky write actions;
pause dependent schedules;
preserve useful error information;
identify the failing layer;
restore one component at a time;
test with a known-good request; and
resume automation gradually.

Do not repeatedly retry write actions before checking whether an earlier call already completed.

What counts as an outage

Treat the server as unavailable when Feluda cannot reliably use its required tools.

Common signs include:

the server shows as unavailable;
expected tools disappear;
authentication fails;
tool calls time out;
every known-good test fails;
the connected source cannot be reached;
write destinations cannot be reached;
repeated schedules fail; or
results are too incomplete or inconsistent to trust.

A server may still appear configured while its tools are not usable.

Confirm the scope first

Before changing settings, determine whether the problem affects:

one tool;
one workflow;
one account;
one source;
one destination;
one computer;
one network;
one MCP server;
several servers; or
all Feluda tool use.

A narrow problem should not trigger a full environment change.

Separate the affected layers

A typical tool path is:

Feluda
→ AI Model
→ MCP Server
→ Connected Source or Service
→ Tool Result
→ Workflow Output

The outage may be in:

Feluda;
the selected AI model;
the MCP connection;
authentication;
the local network;
the internet connection;
VPN or proxy access;
the MCP server process;
the connected source;
the write destination; or
a later workflow step.

Find the first failing layer.

Check whether the issue is truly the MCP server

The AI model may fail even when the MCP server works.

The connected source may fail even when Feluda reaches the MCP server.

A workflow may fail after the tool returns a correct result.

Compare:

the MCP Servers connection state;
Workbench Activity;
RunFlows output;
the raw tool result;
the connected source;
the destination; and
the final workflow step.

Do not label the event as a server outage until the evidence supports it.

Stop high-risk actions

Stop or pause tasks that can:

create records;
update records;
send messages;
save files;
overwrite files;
move items;
change statuses; or
delete information.

A delayed response can make a completed write look like a failed write.

Check the destination before retrying.

Pause dependent schedules

Open Schedule Manager.

Pause schedules that depend on the affected server.

Record:

schedule name;
workflow;
next-run time;
source;
destination;
write action;
recent failure; and
responsible reviewer.

Leave schedules paused until manual and one-time scheduled tests succeed.

Check active runs

Review:

active RunFlows executions;
recent failed runs;
Workbench Activity;
pending external actions;
delayed write confirmations; and
overlapping schedules.

Stop active work when continuing could create duplicates or incorrect writes.

Preserve useful evidence

Before changing the connection, record:

date and time;
server name;
tool name;
workflow name;
schedule name;
safe sample input;
visible error;
warning;
runtime;
connection state;
recent changes; and
affected source or destination.

Do not include credentials.

Review Workbench Activity

Open the Activity drawer after a failed tool request.

Check:

which tool was called;
which server provided it;
what input was sent;
whether a result returned;
whether an error appeared;
whether the call repeated;
whether the model continued without data; and
whether a write may have completed.

The Activity drawer helps distinguish a model problem from a tool problem.

Review RunFlows output

For a failed workflow, review:

starting input;
tool calls;
tool input;
raw tool output;
intermediate values;
warnings;
errors;
selected branch; and
final output.

The first visible failure is usually more useful than the final error message.

Use Emit blocks when needed

In Studio, an Emit block can expose an intermediate result.

For example:

Input
→ MCP Tool
→ Emit Raw Tool Result
→ Prepare Summary
→ Output

This helps confirm whether the MCP tool failed or a later step failed.

Check MCP Servers

Open MCP Servers from the Feluda sidebar.

Review:

server name;
endpoint;
connection state;
authentication state;
discovered tools;
warnings; and
errors.

Do not change the endpoint until you have confirmed the official current value.

Check recent changes

Ask whether the outage began after:

a server update;
an endpoint change;
credential rotation;
an account change;
a network change;
VPN changes;
firewall changes;
operating-system updates;
Feluda updates;
a model-runner update;
a tool rename;
a workflow edit;
a source move; or
a destination change.

Recent changes often reveal the likely cause.

Check the endpoint

Confirm:

protocol;
host;
port;
path;
spelling;
local or remote location;
official server guidance; and
whether redirects or certificates changed.

Do not guess an endpoint.

Check authentication

Confirm that:

the credential belongs to the correct server;
the account remains active;
the credential has not expired;
the required scope remains available;
the authentication method has not changed;
the connected account still has access; and
the value remains stored in protected settings.

Never paste credentials into a prompt or incident note.

Check permissions

A reachable server can still fail because access is blocked.

Review:

read permission;
write permission;
account scope;
workspace scope;
project scope;
record scope;
URL rules;
IP rules;
path rules;
port rules; and
destination access.

Apply only the narrowest approved change.

Check local server health

For a local MCP server, confirm that:

the computer is on;
the process is running;
the required application is open;
the port is available;
the local firewall allows access;
the source path is mounted;
the local database is running;
enough memory is available; and
the service started after restart.

Test the local server separately when possible.

Check remote server health

For a remote MCP server, review:

internet or network access;
provider service status;
DNS;
certificates;
VPN;
proxy;
remote endpoint;
account status;
authentication; and
source availability.

Contact the server owner when the problem is outside Feluda.

Check the connected source

The MCP server may be available while its source is not.

Check whether the tool depends on:

files;
a local database;
cloud storage;
a hosted application;
an internal service;
a search index;
a message platform; or
another provider.

Test the source directly when possible.

Check the destination

For write tools, confirm that the destination is still available.

Review:

account;
workspace;
project;
folder;
record;
message destination;
local path;
external service; and
write permission.

Do not repeat a timed-out write until the destination has been checked.

Check the AI model

A model problem can look like a tool outage.

Confirm that:

the provider is available;
the selected model is available;
the model supports tool use;
the model receives the tool description;
only the intended tools are enabled;
the prompt is clear; and
the model is not repeating failed calls.

Test the model without tools.

Use a known-good read test

Keep one stable read-only test for the server.

For example:

Use only the enabled Internal Knowledge Search tool.

Search for "MCP outage test".

Return:
1. result title;
2. source identifier;
3. returned summary;
4. last updated date; and
5. any warning.

Do not create or change anything.

Use the same test during diagnosis and recovery.

Interpret the test result

If the read test fails, review:

server connection;
authentication;
tool availability;
source access;
permissions;
network;
runtime; and
returned error.

If it succeeds, the outage may be limited to another tool, workflow, source, or destination.

Distinguish no result from outage

No result:

The tool completed but found nothing.

Outage:

The tool could not complete the request.

Do not treat an empty search result as proof that the server is unavailable.

Check repeated calls

Repeated calls may appear when:

the model does not recognise the error;
the tool times out;
the result is empty;
the workflow loops;
several similar tools are enabled; or
a fallback repeats the same failing request.

Stop repeated write-capable calls immediately.

Check timeouts carefully

A timeout may happen before or after the external service acts.

Before retrying:

review Activity or RunFlows;
inspect the destination;
compare timestamps;
confirm whether the action completed; and
retry only if the first action did not complete.

This prevents duplicate records, files, tasks, messages, or notes.

Check partial failures

A tool may return some data before failing.

Confirm:

which fields returned;
whether the data is complete;
whether a write partly completed;
whether the destination changed;
whether retrying would repeat successful steps; and
whether human review is required.

Do not treat partial success as full success.

Decide whether to use a fallback

A fallback may involve:

another MCP server;
a manual process;
another approved source;
a local copy;
another provider; or
postponing the task.

Use a fallback only when:

it is approved;
the data path is understood;
the source is appropriate;
the destination is correct;
users are informed; and
the result is clearly labelled.

Do not switch silently to a different server.

Avoid unreviewed local-to-remote fallback

A local workflow should not silently send information to a remote service during an outage.

Confirm:

what information would leave the device;
which provider would receive it;
whether the fallback is approved;
whether personal or confidential data is involved; and
whether explicit confirmation is required.

Return a clear outage message when remote fallback is not approved.

Communicate the impact

Tell affected users:

which server is unavailable;
which tools are affected;
which workflows are affected;
which schedules are paused;
whether write actions are stopped;
whether a fallback exists;
what results may be delayed; and
where to report unexpected behaviour.

Avoid promising a recovery time unless it is confirmed by the responsible service owner.

Use clear user-facing messages

A workflow may return:

The connected service is currently unavailable.
No result was produced.
This workflow has stopped to avoid using incomplete information.

For a write workflow:

The connected service could not confirm the write action.
Review the destination before trying again.

Do not expose internal secrets or unnecessary technical details.

Restore one layer at a time

Change only one of these before retesting:

server process;
endpoint;
authentication;
account;
network;
VPN;
permission;
source;
destination;
model;
tool configuration; or
workflow mapping.

Use the same known-good test after each change.

Recover a local server

A practical local recovery sequence is:

confirm the computer is awake;
confirm the MCP server process;
confirm the endpoint and port;
confirm the local firewall;
confirm required applications;
confirm source files or database;
confirm available memory;
restart only the required service;
reopen MCP Servers; and
run the known-good read test.

Recover a remote server

A practical remote recovery sequence is:

confirm network access;
confirm VPN or proxy;
confirm provider status;
confirm endpoint and certificate;
confirm authentication;
confirm account access;
confirm the remote source;
contact the service owner when needed;
reopen MCP Servers; and
run the known-good read test.

Verify the tool list

After recovery, confirm that:

expected tools reappear;
no required tool is missing;
no unexpected tool appears;
tool names remain correct;
descriptions remain correct;
read and write behaviour is unchanged; and
input and output remain compatible.

A server update during the outage may have changed the tool list.

Verify raw results

Compare the recovered tool result with the known-good baseline.

Check:

source;
record identifier;
fields;
timestamps;
warnings;
result count;
runtime; and
final answer.

A connection that returns the wrong data is not fully recovered.

Verify permissions

Confirm that:

approved access works;
unrelated sources remain blocked;
read-only accounts remain read-only;
URL rules remain narrow;
IP rules remain narrow;
paths remain narrow;
ports remain narrow; and
write destinations remain limited.

Do not leave temporary broad access in place.

Verify Workbench

Run the known-good test in a new conversation.

Review Activity.

Confirm:

the expected tool is called;
input is correct;
output is complete;
warnings are understood;
no repeated call occurs; and
the model interprets the result correctly.

Verify Studio workflows

Open affected workflows.

Review:

selected tool;
model;
prompt;
input mapping;
output mapping;
permissions;
no-result path;
error path;
Emit blocks;
write approval; and
destination.

An outage or update may expose an outdated dependency.

Verify RunFlows

Test each important flow with safe sample data.

Review:

starting input;
raw tool output;
intermediate values;
branch decision;
warnings;
errors;
final output; and
external destination.

Do not resume schedules based only on a Workbench test.

Verify write tools separately

Use:

a test account;
a safe destination;
a reversible action;
explicit approval;
destination review; and
duplicate checking.

A successful read test does not prove that writes work.

Verify a one-time schedule

Before resuming recurring schedules:

create or use a one-time scheduled test;
use a safe flow;
confirm the server is available at run time;
review Schedule Manager;
review RunFlows;
inspect the result;
verify the destination; and
check for duplicates.

This confirms scheduled availability.

Resume automation gradually

Resume one important schedule at a time.

Monitor the first runs.

Check:

tool calls;
input;
output;
runtime;
warnings;
errors;
branch decisions;
write destinations; and
duplicate actions.

Pause again if unexpected behaviour appears.

Keep non-critical schedules paused when needed

It may be safer to restore critical read workflows first.

Resume:

low-risk read-only flows;
important read flows;
low-risk write flows;
approved production write flows; and
high-volume or overlapping schedules.

Use the order that fits your environment.

Escalate the outage

Contact the server owner or provider when:

the endpoint is correct but unreachable;
provider status shows a failure;
authentication fails after confirmed renewal;
required tools are missing;
result structure changed;
repeated timeouts continue;
the connected source remains unavailable;
writes affect the wrong destination; or
recovery cannot be verified.

Include safe diagnostic details, not credentials.

What to include in an escalation

Provide:

date and time;
server name;
affected tool;
endpoint type;
local or remote status;
visible error;
known-good test;
whether the source is reachable;
whether authentication was checked;
whether schedules are paused; and
whether write actions are affected.

Avoid sending private data unless the support process explicitly requires and protects it.

Record the incident

For important outages, record:

start time;
detection method;
affected server;
affected tools;
affected workflows;
affected schedules;
affected sources;
affected destinations;
write risk;
visible errors;
actions taken;
owner contacted;
recovery time;
tests completed; and
final outcome.

Do not include raw credentials.

Review after recovery

A post-incident review should ask:

What failed first?
How was the outage detected?
Which workflows were affected?
Were schedules paused quickly enough?
Did any write action duplicate or partially complete?
Were users informed?
Was the fallback appropriate?
Did recovery steps work?
Were known-good tests available?
Could the issue have been detected earlier?
What should change before the next outage?

Focus on practical improvements.

Review missed or delayed work

After recovery, identify:

missed scheduled runs;
delayed reports;
unprocessed records;
unsent messages;
incomplete writes;
duplicate actions;
stale outputs; and
user tasks that need to be repeated.

Re-run work only after confirming it will not duplicate earlier actions.

Check for duplicates

Before repeating missed write workflows, inspect:

destination records;
files;
tasks;
messages;
Journal entries;
timestamps;
identifiers; and
schedule history.

A failed confirmation may hide a completed action.

Improve monitoring

After an outage, consider improving:

known-good read tests;
ownership records;
authentication expiry tracking;
schedule review;
conflict warnings;
error paths;
Activity review;
RunFlows review;
fallback rules;
service startup;
backup routines; and
escalation contacts.

Use the outage to improve readiness.

Improve workflow error handling

Add or update paths for:

unavailable server;
authentication failure;
permission denial;
no result;
partial result;
timeout;
write uncertainty; and
manual review.

A workflow should stop safely when the tool cannot be trusted.

Improve local recovery

For local environments, consider:

automatic service startup;
clearer port documentation;
restart tests;
power and sleep changes;
hardware monitoring;
local database checks;
source path checks; and
offline tests.

Improve remote recovery

For remote environments, consider:

provider status monitoring;
backup endpoints;
credential renewal planning;
VPN checks;
proxy documentation;
approved fallback services;
network escalation; and
provider contacts.

Define future outage thresholds

Decide when to:

issue a warning;
pause schedules;
stop write workflows;
use a fallback;
contact the server owner;
declare recovery;
resume automation; and
perform a post-incident review.

Clear thresholds reduce uncertainty during the next event.

A practical outage routine

Use this process:

Confirm the symptom.
Identify the affected server, tool, workflow, and schedule.
Stop risky writes.
Pause dependent schedules.
Check active runs and destinations.
Preserve safe evidence.
Review MCP Servers.
Review Workbench Activity and RunFlows.
Check recent changes.
Check endpoint, authentication, permissions, and network.
Check the connected source and destination.
Run the known-good read test.
Restore one layer at a time.
Compare the recovered result with the baseline.
Test Workbench, Studio, and RunFlows.
Test write tools separately.
Use a one-time scheduled test.
Resume schedules gradually.
Review missed work and duplicates.
Complete the incident review.

A safe outage response protects data and external systems while restoring the MCP service in a controlled way.

Respond to an MCP Server Outage

What counts as an outage

Confirm the scope first

Separate the affected layers

Check whether the issue is truly the MCP server

Stop high-risk actions

Pause dependent schedules

Check active runs

Preserve useful evidence

Review Workbench Activity

Review RunFlows output

Use Emit blocks when needed

Check MCP Servers

Check recent changes

Check the endpoint

Check authentication

Check permissions

Check local server health

Check remote server health

Check the connected source

Check the destination

Check the AI model

Use a known-good read test

Interpret the test result

Distinguish no result from outage

Check repeated calls

Check timeouts carefully

Check partial failures

Decide whether to use a fallback

Avoid unreviewed local-to-remote fallback

Communicate the impact

Use clear user-facing messages

Restore one layer at a time

Recover a local server

Recover a remote server

Verify the tool list

Verify raw results

Verify permissions

Verify Workbench

Verify Studio workflows

Verify RunFlows

Verify write tools separately

Verify a one-time schedule

Resume automation gradually

Keep non-critical schedules paused when needed

Escalate the outage

What to include in an escalation

Record the incident

Review after recovery

Review missed or delayed work

Check for duplicates

Improve monitoring

Improve workflow error handling

Improve local recovery

Improve remote recovery

Define future outage thresholds

A practical outage routine

Frequently Asked Questions