How to Improve an Unreliable AI Workflow
An unreliable AI workflow may work on one example and fail on the next.
It may return different formats, omit required information, choose the wrong route, misuse a tool, time out, or produce an answer that sounds convincing but is not supported by the source.
The fastest way to improve it is not to rewrite everything at once.
Use a structured troubleshooting process:
- define the failure clearly;
- save a representative failing example;
- locate the first step that becomes incorrect;
- identify the type of failure;
- change one variable;
- rerun a fixed test set;
- compare the result with the previous version; and
- add monitoring so the problem remains visible.
Reliability is a property of the complete workflow.
A stronger model cannot repair an unreadable source, a broken tool, an invalid route, or a missing approval step.
Define what unreliable means
Begin by describing the problem in observable terms.
Avoid:
The workflow is bad.
Use:
The workflow omits at least one required action item in four of ten test
meetings.
Other useful failure descriptions include:
- the output format is invalid;
- the category changes between repeated runs;
- a required field is missing;
- a total does not match the source;
- the wrong branch is selected;
- the tool receives an incorrect parameter;
- the workflow continues after an error;
- the model invents information;
- the result takes too long; or
- the cost exceeds the expected limit.
A specific failure can be measured.
A vague complaint cannot.
Reproduce the problem
Save the exact input that produced the failure.
Also record:
- workflow version;
- model and provider;
- instruction;
- source files;
- tools;
- expected result;
- actual result;
- time and cost where available; and
- the route taken.
Run the same example again.
If the failure is intermittent, repeat it several times.
Generative output can vary, so one successful retry does not prove the problem is fixed.
A reproducible case gives you a stable starting point for improvement.
Find the first incorrect step
Inspect the workflow from the beginning.
Ask:
- Did the correct input arrive?
- Was the source prepared correctly?
- Did the model receive the expected information?
- Did the AI step return the required fields?
- Did validation pass incorrectly?
- Did a condition choose the wrong route?
- Did the tool receive the correct parameters?
- Did the final output preserve earlier information?
Correct the earliest failing step.
A wrong final report may originate in document extraction, not in the final writing prompt.
Fixing the final step can hide the symptom while leaving the real problem in place.
Separate quality failures from operational failures
Different problems require different remedies.
Quality failures include:
- wrong classifications;
- incomplete summaries;
- unsupported claims;
- poor extraction;
- unsuitable tone; and
- inconsistent structure.
Operational failures include:
- provider timeouts;
- unavailable local models;
- tool errors;
- permission failures;
- invalid credentials;
- file-read errors;
- broken routes; and
- duplicate write actions.
A better prompt may improve a summary.
It will not repair an expired credential or unavailable model service.
Diagnose the failure category before changing the model.
Check the input first
Poor input often creates poor output.
Review whether the source is:
- complete;
- readable;
- relevant;
- correctly ordered;
- within the supported size;
- labelled clearly;
- free from duplicate sections;
- using the expected language; and
- available to the selected model.
For files, confirm that text extraction preserved:
- headings;
- tables;
- names;
- dates;
- amounts;
- page order; and
- special characters.
Do not ask the model to compensate for a source that the workflow failed to read correctly.
Add input validation before the AI step.
Reduce unnecessary context
More context can make a result worse when it includes irrelevant, duplicated, outdated, or conflicting information.
Give each step only what it needs.
Remove:
- repeated email history;
- unrelated attachments;
- obsolete instructions;
- duplicate documents;
- unneeded personal details;
- irrelevant tool results; and
- context intended for another step.
Preserve information that changes meaning.
Context reduction should improve focus, not remove necessary evidence.
When several source versions are required, label them clearly.
Narrow the AI task
Broad prompts create more possible failure modes.
Instead of:
Read this request, analyse it, decide what to do, update the record, and
write a response.
divide the workflow:
Request
→ Classify
→ Extract Details
→ Validate
→ Draft Response
→ Human Review
Each step should have one clear responsibility.
This makes it easier to:
- test;
- compare models;
- validate output;
- replace one weak component;
- reuse successful steps; and
- understand where an error occurred.
Do not split a simple task unnecessarily.
Use the smallest number of steps that remains clear and controllable.
Improve the instruction
A reliable instruction should define:
- the task;
- the allowed source;
- required fields;
- output format;
- missing-value behaviour;
- prohibited assumptions; and
- review conditions.
For example:
Read the customer message.
Return:
1. one Topic from Billing, Delivery, Technical issue, or Other;
2. a one-sentence summary;
3. any order number stated;
4. missing information; and
5. whether human review is required.
Use only the message.
Write "Not provided" when a detail is absent.
Do not infer payment status, delivery date, or customer identity.
Remove conflicting or decorative instructions.
Clear and focused prompts are easier to evaluate than long prompts containing several priorities.
Add examples carefully
Examples can clarify labels, formats, and difficult distinctions.
Use examples when the model repeatedly confuses:
- two categories;
- proposals and decisions;
- invoice date and due date;
- source facts and suggestions; or
- missing values and inferred values.
Include both positive and negative examples.
Do not include so many examples that the actual source becomes difficult to find.
Test whether the examples improve new cases rather than only the examples themselves.
Avoid placing confidential production data inside reusable prompts.
Strengthen the output structure
Free-form responses are difficult for later steps to use.
Define fields such as:
Category:
Summary:
Required details:
Missing information:
Source evidence:
Review required:
Specify allowed values.
Define how dates, numbers, lists, and empty fields should appear.
A stable structure makes validation and routing easier.
It does not guarantee correctness.
A well-formatted false value is still false.
Validate the content against the source where the field matters.
Add deterministic validation
Use fixed logic for exact checks.
Validate:
- required fields;
- allowed labels;
- date formats;
- numeric formats;
- thresholds;
- identifiers;
- output length;
- totals;
- duplicate records; and
- relationships between fields.
For example:
If Category is not approved → Review
If Invoice total is not numeric → Review
If Owner is Not provided → Clarification
If Tool status is Failed → Stop
Do not ask another AI model to perform a check that a normal expression can perform reliably.
Validation should stop or reroute bad output.
It should not silently convert uncertainty into a normal value.
Improve grounding and retrieval
When the workflow answers from documents or knowledge sources, verify the evidence supplied to the model.
Check whether retrieval returned:
- the correct source;
- the current version;
- enough surrounding context;
- the relevant section;
- a duplicate;
- an incomplete passage; or
- content with similar wording but a different meaning.
Test retrieval separately from generation.
Require the model to use only the supplied evidence and to return Not provided when the answer is absent.
Preserve source titles, identifiers, sections, or page references.
A better generation prompt cannot compensate for consistently wrong retrieved evidence.
Compare models fairly
A different model may improve the task, but compare it under the same conditions.
Use the same:
- test inputs;
- instruction;
- source material;
- tools;
- output structure;
- context; and
- evaluation criteria.
Compare:
- accuracy;
- completeness;
- format compliance;
- missing-information handling;
- latency;
- cost;
- tool use;
- supported input types; and
- local hardware requirements.
A larger model is not always the best choice for a focused classification or extraction task.
Choose the model that meets the task requirements consistently.
Inspect tool calls
Tool failures can look like model failures.
Confirm:
- whether the tool was called;
- which parameters it received;
- what credentials and permissions it used;
- what result it returned;
- whether the result was complete;
- whether the model interpreted it correctly; and
- whether the final action occurred at the intended destination.
A model may claim that it saved or sent something when the action failed or never occurred.
Review the activity record and check the destination.
Use safe test accounts and destinations for write actions.
Limit tool autonomy
If the workflow is unpredictable, reduce what the model is allowed to choose.
You can:
- remove unnecessary tools;
- separate read and write tools;
- predefine important parameters;
- restrict destinations;
- require approval before write actions;
- cap the number of tool calls;
- stop repeated retries; and
- replace an agent with a fixed workflow where the path is known.
More autonomy creates more possible behaviour.
Use only the flexibility the task requires.
A model that can draft a message does not need permission to send it automatically.
Improve routing logic
A correct AI output can still reach the wrong path.
Review every condition.
Check:
- exact label spelling;
- case and whitespace;
- missing-value handling;
- default routes;
- overlapping conditions;
- branch order;
- invalid labels; and
- whether every route reaches an output.
Include an Other, Unclear, review, and error path where appropriate.
Test one clear example for every branch.
Also test inputs that fit several routes or none.
Avoid treating the default branch as a normal success path.
Add fallback paths
A reliable workflow explains what happens when the preferred path cannot continue.
Possible fallbacks include:
- request corrected input;
- retry a temporary provider failure;
- use an approved alternative model;
- return a partial result with a warning;
- stop before a write action;
- route to human review; or
- save the case for later investigation.
Fallbacks should be explicit.
Avoid indefinite retries.
Be careful when retrying after a write action, because the first attempt may have completed even when the response was lost.
Check for duplicates before repeating the action.
Add human review at the weak point
Human review should be placed where uncertainty or impact is highest.
It may be needed when:
- a required field is missing;
- sources conflict;
- the model returns
Unclear; - a tool proposes an external action;
- the result affects customers or money;
- the workflow handles legal, medical, employment, safety, security, or access-related information; or
- the failure rate remains above the accepted threshold.
Show the reviewer:
- original input;
- intermediate output;
- validation failures;
- source evidence;
- tool activity;
- proposed action; and
- reason for escalation.
Human review should change what the workflow is allowed to do.
Change one variable at a time
When troubleshooting, avoid changing the prompt, model, tools, and workflow structure simultaneously.
You will not know which change helped or created a new problem.
Use a sequence such as:
- save the current version;
- select one hypothesis;
- change one component;
- run the complete test set;
- compare the metrics;
- keep or revert the change; and
- document the result.
Some changes interact, but isolated experiments create clearer evidence.
After isolated improvements, run an end-to-end test to confirm that the components still work together.
Use a fixed evaluation set
Keep a representative set of known examples.
Include:
- normal inputs;
- missing information;
- conflicting information;
- long inputs;
- unusual wording;
- invalid files;
- every route;
- tool failures;
- provider failures;
- malicious instructions;
- cases requiring abstention; and
- high-impact edge cases.
Define expected results or a clear scoring rubric before testing.
Run the same set after each material change.
This prevents an improvement on one example from hiding a regression on another.
Add important real-world failures to the set.
Measure the right metrics
Reliability needs more than one number.
Measure:
- task completion;
- field accuracy;
- classification accuracy;
- groundedness;
- format compliance;
- missing-information handling;
- human correction rate;
- review time;
- workflow failure rate;
- tool failure rate;
- latency;
- cost per approved result; and
- high-impact error rate.
Separate output-quality metrics from operational metrics.
A workflow can produce excellent summaries but fail frequently because its provider or tools are unavailable.
It can also run successfully while producing poor content.
Add monitoring and observability
A workflow that works during testing can degrade later.
Monitor:
- model errors;
- provider availability;
- invalid output;
- reviewer corrections;
- unusual routes;
- tool failures;
- retries;
- latency;
- cost;
- source changes; and
- model or provider changes.
Keep enough activity information to locate the first incorrect step.
Avoid logging unnecessary sensitive content.
Define who reviews failures, how quickly they respond, and when the workflow should be paused.
Monitoring should lead to action, not merely accumulate records.
Improve an AI workflow in Feluda
Feluda supports a practical troubleshooting path across Workbench, Studio, and RunFlows.
Begin in Workbench.
Reproduce the failing AI task with the same source and instruction.
Compare models only after confirming that the source and prompt are correct.
In Studio, inspect the workflow one block at a time.
Use focused blocks:
- LLM for summarisation, comparison, analysis, or drafting;
- LLM Label for classification;
- LLM Extract for named fields;
- Expression for deterministic checks, calculations, and routing;
- Emit for useful intermediate output; and
- Output for success, review, fallback, and error results.
Rename blocks according to their responsibility.
This makes the first failing step easier to identify.
Use RunFlows for regression testing
Save the current workflow before making a change.
Use RunFlows to test the same representative inputs after each revision.
Compare:
- outputs;
- routes;
- errors;
- intermediate results;
- model behaviour;
- tool activity; and
- final destinations.
For local models, test what happens when the model service is unavailable.
For cloud providers, test invalid credentials, unavailable models, and temporary errors.
For tools, confirm both the activity record and the destination.
Schedule a workflow only after its failing cases, fallback paths, and monitoring responsibilities are understood.
Common troubleshooting mistakes
Avoid:
- changing everything at once;
- blaming the model before checking the input;
- improving only one ideal example;
- adding more context without checking relevance;
- making the prompt longer without making it clearer;
- using another model as the only validator;
- ignoring routing and tool errors;
- retrying write actions without duplicate protection;
- removing human review before reliability is demonstrated;
- measuring only successful executions;
- failing to preserve previous versions; and
- deploying changes without rerunning the test set.
Troubleshooting should reduce uncertainty about the workflow.
Every change should answer a specific hypothesis.
Improve the weakest step first
Do not rebuild the complete workflow because one result is poor.
Reproduce the failure.
Locate the first incorrect step.
Fix the input, instruction, structure, validation, route, model, tool, or fallback that caused it.
Rerun the same evaluation set.
Keep changes that improve the complete workflow without creating new regressions.
A reliable AI workflow is usually not the result of one perfect prompt.
It is the result of clear responsibilities, controlled inputs, deterministic checks, visible failures, measured revisions, and ongoing monitoring.