Gene Library Courses Download Pricing Contact Sign in

Test and Improve a Workflow

Test and Improve a Workflow

A workflow is not ready after one successful run.

It should be tested with different examples so you can understand how it behaves with normal, incomplete, unexpected, and difficult input.

Testing helps you find:

  • unclear instructions;
  • missing connections;
  • incorrect model choices;
  • weak classifications;
  • missing fields;
  • tool errors;
  • broken branches; and
  • outputs that are difficult to review.

Improve one problem at a time, then test the workflow again.

Begin with a clear expected result

Before running the workflow, decide what a good result should look like.

Ask:

  • What information should enter the flow?
  • What should each important step produce?
  • What should the final output contain?
  • What should happen when information is missing?
  • What should happen when a step fails?
  • Which results require human review?

When the expected result is unclear, it is difficult to decide whether the workflow passed the test.

Use a small first example

Start with a short, realistic example that you understand well.

For a meeting-summary workflow, you could use:

The team approved the new homepage design.
Sam will prepare the final files by Friday.
The launch date has not been decided.

You already know what the result should contain:

  • the approved design decision;
  • Sam's action;
  • the Friday deadline; and
  • the missing launch date.

This makes incorrect output easier to notice.

Test the main path first

Begin with the most common type of input.

Run the workflow and check:

  • whether the Input block received the complete information;
  • whether each step completed;
  • whether the correct model was used;
  • whether the expected path was followed;
  • whether tools returned useful results; and
  • whether the final Output block returned the correct result.

Do not add more test cases until the main path works.

Review each step, not only the final answer

A final output may be wrong because an earlier step produced incorrect information.

Trace the workflow from the beginning.

For each block, ask:

  1. What information did the block receive?
  2. What task was it expected to perform?
  3. What result did it produce?
  4. Was the result passed to the correct next block?
  5. Did the next block receive enough information?

Find the first step where the actual result differs from the expected result.

Correct that step before changing later blocks.

Use intermediate results

Intermediate results help you understand how information changes through the flow.

When available, review:

  • extracted fields;
  • labels;
  • transformed text;
  • tool responses;
  • decision outcomes; and
  • emitted progress results.

An Emit block can be useful while testing a longer workflow.

For example:

Input
→ Extract Details
→ Emit Extracted Details
→ Prepare Report
→ Output

The emitted result lets you review the extracted details before the report is created.

Remove unnecessary testing output later when it no longer helps the user.

Create a test set

A test set is a collection of examples that represent the different input the workflow may receive.

Include:

  • normal input;
  • short input;
  • long input;
  • missing information;
  • unexpected formatting;
  • unclear information;
  • conflicting information;
  • unrelated input; and
  • an empty or nearly empty input.

Reuse the same test set after important changes.

This helps you see whether an improvement for one example caused a problem elsewhere.

Test normal input

Normal input represents the task the workflow is designed to handle.

Use several realistic examples rather than one perfect example.

For a customer-message workflow, test:

  • a clear question;
  • a clear complaint;
  • a simple request; and
  • a message containing several details.

Confirm that the workflow produces the expected output consistently.

Test short input

Very short input may not contain enough information.

For example:

Please help.

Decide what the workflow should do.

It may:

  • return a request for more information;
  • classify the input as unclear;
  • send it to human review; or
  • stop with a clear warning.

The workflow should not invent a complete situation from a short message.

Test missing information

Use an example where a required field is absent.

For example:

Mia will prepare the report.
No deadline was provided.

The workflow should make the missing deadline visible.

Add an instruction such as:

If a required detail is missing, return "Not provided."
Do not guess.

Test again after updating the relevant block.

Test unexpected formatting

Real input may not follow one neat format.

Test:

  • bullet points;
  • paragraphs;
  • copied email text;
  • extra spaces;
  • headings;
  • mixed punctuation; and
  • information in a different order.

The workflow should still find the required information when the meaning is clear.

If formatting changes cause repeated failures, improve the preparation or extraction step.

Test long input

Long input can reveal different problems.

The model may:

  • miss details;
  • focus too much on the beginning;
  • ignore the requested format;
  • produce an overly long answer; or
  • exceed what the selected model can handle.

When this happens:

  • remove irrelevant material;
  • divide the source into smaller sections;
  • add an earlier summarisation step;
  • use a model suited to longer input; or
  • process the document in stages.

More input does not always produce a better result.

Test unrelated input

Give the workflow information outside its intended purpose.

For example, provide a recipe to a flow designed for customer-support messages.

Decide whether the workflow should:

  • reject the input;
  • label it as unrelated;
  • return a clear explanation; or
  • route it to human review.

This prevents the flow from producing a confident but meaningless result.

Test conflicting information

Source material may contain different values for the same detail.

For example:

The draft is due on Thursday.
A later note says the draft is due on Friday.

The workflow should not silently choose one value unless the process defines how to decide.

Ask the model to:

  • identify the conflict;
  • list both values;
  • explain which statement appears later; or
  • send the result for review.

Make uncertainty visible.

Test every label

When the workflow uses LLM Label, prepare examples for every label.

For each label, confirm that:

  • a clear example follows the correct path;
  • similar labels are not confused;
  • unclear input follows the intended fallback path; and
  • the final output matches the selected route.

Include examples that could fit more than one label.

Improve label names and descriptions when the model cannot distinguish them consistently.

Test every decision path

A workflow with branches is not tested until every branch has been run.

Create at least one example for:

  • each normal branch;
  • the fallback branch;
  • the human-review path;
  • each error path; and
  • any retry path.

Confirm that every route reaches an intentional endpoint.

A branch should not stop without returning a result or clear error.

Test extraction fields

When using LLM Extract, verify every field against the original source.

Check:

  • names;
  • dates;
  • amounts;
  • reference numbers;
  • organisations;
  • locations; and
  • required actions.

Test sources where:

  • every field is present;
  • one field is missing;
  • several values appear;
  • the wording changes; and
  • information conflicts.

Structured output can look correct even when one value is wrong.

Test Expression rules

Expression blocks should be tested around their boundaries.

If a rule checks whether an amount is above 100, test:

  • 99;
  • 100;
  • 101;
  • a missing amount; and
  • an invalid value.

If a rule checks a category, test the exact expected value and similar values.

Fixed rules should produce predictable results for every defined case.

Test tools safely

Use non-sensitive sample information when testing a tool.

For read actions, check:

  • whether the correct source was used;
  • whether the returned information is complete;
  • whether the result is current; and
  • whether the next block uses it correctly.

For write actions, check:

  • the destination;
  • the content;
  • the title or name;
  • whether the action can be reversed; and
  • whether the result was actually created.

Review the Activity log and confirm the final result at its destination.

Avoid duplicate write actions

When a tool appears to time out or fail, check whether the action completed before running the workflow again.

Repeating the flow could create:

  • duplicate Journal entries;
  • duplicate files;
  • repeated messages;
  • repeated updates; or
  • another unintended action.

Confirm the destination first.

Test provider failures

A cloud provider or local model may become unavailable.

Test how the workflow behaves when:

  • the provider cannot be reached;
  • the selected model is unavailable;
  • the local model application is closed;
  • a request times out; or
  • the model returns an error.

The workflow should show a clear failure or follow a defined error path.

It should not return an incomplete result as if the process succeeded.

Test tool failures

A tool may fail because:

  • a connection is unavailable;
  • a required setting is missing;
  • an access key is invalid;
  • the source cannot be reached;
  • the request is not allowed; or
  • required information is missing.

Connect tool errors to a clear review or error output when possible.

Test the error path so you know what the user will see.

Test with cloud and local models separately

When the workflow may use different providers, test each intended model.

Models can differ in:

  • instruction following;
  • output structure;
  • speed;
  • extraction accuracy;
  • tool support; and
  • handling of long input.

Do not assume that changing the model will leave the workflow result unchanged.

Compare expected and actual results

Keep a simple record for important test cases.

Test Expected Actual Result
Normal input Summary with all key details Review output Pass or fail
Missing deadline "Not provided" Review output Pass or fail
Unrelated input Review warning Review output Pass or fail
Tool unavailable Clear error Review output Pass or fail

This makes repeated testing more consistent.

It also helps when several people review the same workflow.

Change one thing at a time

When a test fails, avoid changing several blocks at once.

Change one item, such as:

  • the instruction;
  • the model;
  • the label description;
  • the extraction field;
  • the Expression rule;
  • the connection; or
  • the tool setting.

Run the same test again.

This helps you understand which change solved the problem.

Improve an AI instruction

When an AI step produces a weak result, check whether the instruction includes:

  • one clear task;
  • the information to use;
  • the required details;
  • the output format;
  • a length or tone limit when needed; and
  • a rule for missing information.

For example:

Read the input and return:

1. a summary of no more than 80 words;
2. a list of decisions;
3. a table with Owner, Action, and Deadline; and
4. unanswered questions.

Use only the input.
If information is missing, write "Not provided."
Do not guess.

Test the instruction in Workbench when you want to compare versions quickly.

Improve labels

When classification is inconsistent:

  • make labels more distinct;
  • add a short description for each label;
  • remove overlapping categories;
  • add an Other or Human Review label;
  • provide clearer source input; and
  • test examples near the boundary between labels.

A small number of clear labels usually works better than many similar ones.

Improve extraction

When extraction is unreliable:

  • use clear field names;
  • explain what each field means;
  • define how to handle missing values;
  • divide large source material;
  • remove irrelevant content; and
  • compare another model.

Review whether the extracted format is easy for the next block to use.

Improve the workflow layout

A confusing canvas makes testing harder.

Arrange the flow so that:

  • the main path follows one direction;
  • block names describe their purpose;
  • decision routes are separated;
  • error paths are visible;
  • connection lines cross as little as possible; and
  • every path reaches an endpoint.

A reviewer should be able to trace the process without opening every block.

Improve the output

A technically successful flow may still return an unhelpful result.

Check whether the final output is:

  • clear;
  • complete;
  • easy to scan;
  • suitable for the audience;
  • consistent across runs; and
  • easy to verify.

Add a final formatting step only when it makes the result easier to use.

Use Workbench for focused testing

Workbench is useful when you need to test:

  • an instruction;
  • a model;
  • a classification idea;
  • an extraction format; or
  • a tool.

Once the individual task works, return to Studio and test it inside the complete workflow.

A task can work in Workbench but still fail when it receives different input from an earlier block.

Re-test after every important change

Re-run the relevant test set after changing:

  • a block instruction;
  • a model;
  • a label;
  • an extraction field;
  • a connection;
  • a tool;
  • an Expression rule;
  • an error path; or
  • the final output.

A change that improves one example can break another.

Test the saved flow in RunFlows

After Studio testing is complete:

  1. save the workflow;
  2. open RunFlows;
  3. select the saved flow;
  4. provide new test input;
  5. run the workflow; and
  6. review the output and any visible activity.

This confirms that the saved version works outside the Studio editing session.

Use realistic input before regular use

After sample testing, use realistic but non-sensitive input.

Confirm that:

  • the source format matches what users will provide;
  • the flow handles normal variation;
  • the result remains reviewable;
  • tool destinations are correct; and
  • errors remain visible.

Do not move directly from one simple example to unattended use.

Keep a human review step

Human review is especially important when the workflow affects:

  • customers;
  • employees;
  • money;
  • contracts;
  • legal rights;
  • health;
  • safety;
  • security; or
  • access to important services.

A workflow can prepare a summary, recommendation, or draft.

A person should approve important decisions and final actions.

Know when the workflow is ready

A workflow is ready for regular use when:

  • the main path works with several examples;
  • every branch has been tested;
  • missing information is handled;
  • unexpected input produces a clear result;
  • tool actions are confirmed;
  • provider and tool errors are visible;
  • the final output is consistent;
  • the saved flow works in RunFlows; and
  • human review is included where needed.

Continue monitoring the workflow after release.

Real input may reveal cases that were not included in the original test set.

Maintain a test set

Keep a small collection of examples for future changes.

Include:

  • a normal example;
  • a missing-information example;
  • an unexpected-input example;
  • an example for every branch;
  • a tool-error example; and
  • a provider-error example.

Run this set after changing the workflow.

This helps protect earlier behaviour while you improve the process.

A practical testing routine

Use this routine for each important workflow:

  1. Define the expected result.
  2. Test the main path.
  3. Review every step.
  4. Test missing information.
  5. Test unexpected input.
  6. Test every branch.
  7. Test tool and provider errors.
  8. Improve one issue.
  9. Re-run the same tests.
  10. Save and test the flow in RunFlows.
  11. Keep human review where the result matters.
  12. Preserve the test set for future updates.

Testing is not a separate final task.

It is part of building a workflow that users can understand and trust.

Frequently Asked Questions

How many examples should I use when testing a workflow?
Use several examples that cover normal input, missing information, unusual input, every branch, and important error paths. One successful run is not enough.
What should I fix first when the final output is wrong?
Find the earliest block that produced an unexpected result. Correcting the first failing step is usually more effective than changing the final block.
Why should I change only one thing at a time?
Changing one instruction, model, rule, or connection at a time helps you identify which change improved or damaged the result.
Should I keep test examples after the workflow is ready?
Yes. Reuse the test set after future changes so you can confirm that earlier behaviour still works.