notes on: Agentic workflows course by Andrew Ng

https://learn.deeplearning.ai/courses/agentic-ai/lesson/pu5xbv/welcome!

Module1

the whole thing is about decomposing a workflow into a process with separate stages. A stage can either ask a LLM to do something or call an external tool (DB access, google query, etc)

(example: writing an essay, write outline, research th topic, write the essay)

Then look at the result, may need to consider to change a step: (like writing the essay -> loop of writing revising , check + research, redraft)
Eval pattern usage Evaluation of results: if a workflow step can be measured objectively then do it. Every kind of measurement is a good thing in this context Example: you don't want to mention your competitors, so count if the competitor name occurs in the output
Reflection: Examples workflows: you step writes a function code? Reflection can means:
- run the code and see it compiles - feed the errors back into the LLM for fixing.
- make up a prompt: Here is the code {code} provide constructive criticism and suggest improvements - and feed the result back to the LLM (as a kind of self-review step)
- have two prompts with different roles: one for the 'author' the second for the 'critic', have them feed their output to each other, so that they improve the result.
(external) tool use: (like MCP extension for accessing a calendar, google, db, git - whatever)
planning - make the LLM determine the workflow!
multiagentic workflows - example agents for Software Developer, Tester, Architect - all collaborating together (q: isn't that a bit like the reflection pattern? DeepSeeek says that they are related, however Reflection is a 'single agent with multiple hats' whereas multi-agent is 'different agents'. An Agent is defined by having its on 'identity' and its own context window, whereas this is not the case with reflection pattern)

More difficult to control, but can give good results

Module2

but Reflection and Reflection with feedback are approaches that can improve the quality of the output significantly. Consider adding such stages, if you get stuck with a tack / the quality of the output does not get better.
Reflection is much better when you can process external input (like error messages or test run output) Also use reasoning models for the review stage (these are good for critical thinking / critical review)\
Reflection may be
- run the generated output via an LLM + additional prompt.
- do something practical with the outputL (like running a program) + feed the results back into the reflection stage together with the draft program + prompt the reflection stage to improve the program according to the received results of running the program
terminology 'zero shot prompting' - just ask the LLM to do the task, that's it. 'one shot prompting' - include one example as part of the prompt. That's also known as 'direct generation' - as there is no review stage.
- Reflection is shown to improve accuracy by several times!
- Examples: task of html generation will require testing(reflection) for double checking of syntax (are all tags closed?). task of 'How to brew a cup of tea' will require review stage for missed steps (probably not a practical test of making tea :-) task of 'suggesting domain names' the review stage will need to check if a suggested domain name means something negative in English or other languages,
tip for writing the prompt of a refection stage:
- state that the purpose of the prompt is to review a draft
- state the criteria for improvement (like with the domain name generation example: check there is no negative connotation with domain name)
- tip: read other people's prompt in order to learn how to write them!
Eval pattern:
- suggests to test for performance with and without reflection stage. Example: write SQL query to perform a SQL query based on a text prompt, the test is a set of questions with the correct answers. This also helps to check if the reflection prompt is good or not. very-very important! (prompt fudging is a frequent activity with this kind of system)
- what if you systems where there is no objective answer to a question (like which code is better?) Suggests to use an LLM as a judge - to determine which output is better.
- Problem: some LLM's have a position bias (they always prefer the first choice, given a set of choices), this kind of eval can be garbage, due to multiple reasons.
- To fix this: produce a prompt that contains list of objective criteria for judging the output: (for the example of generating code to draw a chart: q: does the chart contain clearly labeled axes? suited numerical ranges? (etc) (? but how does the model recognize this on a picture?) Also the prompt says to assign to any fulfilled criteria and add up the numbers, to produce a final score (? q: a model can err in adding up the numbers as well ?)

Reflection with external feedback

example: for python code generation task: add task that runs the code / tests it. Feeds back errors into reflection stage, where the prompt says to 'fix the errors'. (q: this requires a test generation stage also? What about security of the 'run the code' stage? Use a sandbox environment?)
research agent (classical example): use google web search and feed the result to next reflection stage.

Module3

External Tool use - let the LLM call functions that talk to external tools - (MCP) like get the current time), Here the LLM decides which tool to use (a big difference to previous patterns like 'reflection with feedback' !)

How would you implement before the advent of MCP (before LLM's were trained to call MCP extensions) ?

add a prompt that says "you have access to the current time. To obtain the current time write "FUNCTION get_current_time() in your output"
if the output contains "FUNCTION get_current_time()" then obtain it and add the following window to the context window "the current time is 12:23:32" and call the LLM again, so that it can use this data.
Andrew NG wrote a library called aisuit and uses it in the labs - idea is similar to langchain (act as a higher level wrapper for multiple LLMs)-
- for class aisuite.Client, when you calls a create() function on the object, a list of python functions can be passed as arguments (the functions serve as external tools) (q: it does not have separate docstrings for the argument. Need to look at the source and find out how they make up a description of the arguments for wrapped LLM client)
MCP protocol by Anthropic (the guys that do Claude) : standard for writing external tools and on how the LLM should access these tool.
- The docstring of the function argument is extracted and serves as instructions for LLM on when to call the function.
- the create function has a max_turns argument - a limit on how many times to repeat the LLM / tool / LLM cycle.

Module4

Evaluation (Evals) and error analysis (says this is the main module of the course) Evaluation stage is also called: end-to-end-evals.

(this gives you some metrics, as you try to fix this or that an objective score is needed to check if things got better or not!)

eval example for invoice analysis system:
- get a stage that extracts the due date of an invoice, converts it to yyyy/mm/dd format, for each training example: compares the result to the stored value (per training example) - increment hit / miss counters.
example of marketing assistant agent: if you have a criteria that the result must fit into 10 words, add eval stage that measures that.
example of research agent: check for example queries if all high profile results have been returned by the analysis (for given test examples)
if there is no clear way to parse the input: add an LLM powered eval stage with its prompt LLM-as-a-judge; tell it to return a json object with a score number (q: and then eval that eval stage?)
options
- evaluator as code (that parses previous LLM output)
- evaluator as LLM-as-a-judge and 1
- recorded answers for test examples

Important to reexamine the evals periodically, if things don't get better - ask if the eval is doing a good job.

Error Analysis - gets very important, if the quality of the system is not getting better, says ability to do disciplined error analysis process is the best indicator on quality of a team!

Problem: workflows can have lots of steps/stages - each of these can fail. Says examine trace for each stage to examine the problematic stage.
terminology: the trace - is the collection of logs of workflow stages, produced for one input query. a span - output of one workflow stage during that run. Says to focus on the trace of the example cases where the system is doing poorly.
say to focus the analysis on a following table/spreadsheet,
- this helps to get overview & focus attention on the stages that make a problem: (the table gets an intuition on what kind of stages and problems occur frequently; these are the components should be improved, in order to give the greatest effect!)
- also ask if problems in an earlier stage lead to a poor result of a later stage / if there is a correlation.

Example test case	problem with workflow stage A?	problem with workflow stage B?	problem with workflow stage C?	problem with workflow stage D?
specific prompt of test case that failed1
specific prompt of test case that failed2
specific prompt of test case that failed3

Component level evals

When to do do a specialized eval for the output of a single stage? sometimes the result of a stage has a lot of overall weight: replacing one stage will lead tp many ripple changes (like going for a different search engine in a research assistant)

if that is the case, do a eval for the search engine.
suggests to test the web search component stage in isolation: have a set of examples, and record a set of 'gold standard' results that are expected to occur in the search result. Use that as a 'unit test' for the web search component.

Addressing discovered problems with processing stages

can have parameters with non LLM stages - like: web search engine can be tuned to return N result, or results from given range of dates; RAG has parameters like similarity threshold, chunk size, etc. (or you can swap these - say he does that a lot)
you can fiddle with the prompt (more explicit instructions or add examples to the prompt) of LLM stages
or try a different type of LLM model (some are better at inference, other strengths) - gives you an intuition on what model is good for what task.
- says if you have a prompt with instruction for an classification action, redaction action + formatting of output instructions: says big models like GPT-5 are better at following instructions than smaller models
- says he often plays with new models, in order to get an intuition on what they are good at (following instructions? classification? coding?) (says his library aisuite helps with swapping different LLM models)
- also says he is spending a lot of time to read other people's prompts (prompt magic) Prefers to study open source projects of people who he is respecting in order to find best practices.
if a prompt gets too complicated: can try to split it into multiple steps (each step having a smaller prompt as a result).
or fine tune a model (that is quite complex, do that if other options got exhausted and when the solution is mature already)

When things start to work: will have to optimize for speed / cost also!

first measure duration of all steps, given this data, ask:
- which stages can be optimized / are worth to optimize? (the longer ones have lots of potential for improvement!)
- can you run some tools in parallel? (like fetching web pages)
- For a specific stage that takes long: can you use a faster service provider?
- Can you still use a slightly simpler/faster/cheaper model?

Module5

Patterns for highly autonomous agents

Planning pattern: first stage LLM gets a list of tools and is told to plan a set of actions while using the tools. (q: and they trust that to work?)
Also: instruct the prompt to build a json object with 'create a step-by-step in JSON format. each step should have the following items: step number, description tools and args'. Works better than planning with text (probably because it is clearer when to invoke the tool)
planning with code: tell the planning stage to write a python function that solves the problem. Says it is a generalized approach, and saves your from adding tools for yet another situation. (need to run it in a sandbox, mentioned this approach before)

multi agent collaboration

Why should one have multiple agents instead of prompting the same LLM over again? Says each agent is like a different member of a team (example: researcher agent, statistician agent, writer / editor agents (?)) Therefore: an agent is an LLM that is prompted to do any of these rules "you are a researchers, your tools re ... your task is..." (?)

The first stage is still a planning stage. It is supposed to ask each agent to do something, (similar too external tools). The planning agent prompt is to "ask the researcher to research current trends in sunglasses" , etc. this planning stage is similar to the role of a 'manager agent'.

Now the big deal is: what are the communication patterns between the agents?

often a linear plan is used (researcher agent - > graphic designer -> writer)
manager produces a plan, communicates with each of the agents, hierarchic communication pattern (manager -> researcher, graphic designer, writer)
sometimes they have more than one level of hierarchy (but that's complex, they don't do that often)
all to all communication (each agent can communicate with all other agents) (? how do they debug that ?)

MoserMichael/llm-workflow-course.md

Select an option

No results found