Midscene.js原理

最近浏览器agent例如Midscene.js、stagehand很火,好奇去体验了下然后了解了下原理。

浏览器agent是什么

浏览器agent就是用户用自然语言跟浏览器对话,浏览器会自动执行一些操作、爬虫、断言,类似传动UI自动化、RPA解决重复劳动的问题。

与传统UI自动化、RPA的区别

传统的UI自动化都是用硬编码获取元素,假如页面改动频繁,后续的维护成本非常高,但是浏览器agent能够适应经常变动的内容。

实现原理

假如我想要的人工地操作”在搜索框输入 “耳机” ,敲回车”,主要有以下几个步骤:

  1. 定位搜索框
  2. 输入”耳机”
  3. 敲回车

那么浏览器agent也一样需要执行这三个步骤

  1. 根据网页显示,获取搜索框dom元素,并focus
  2. 执行输入”耳机”操作
  3. 按下 回车键

Midscene.js是如何执行这三个操作的

根据屏幕截图,获取搜索框DOM元素,并focus

将screenshotBase64和prompt发送给视觉语言大模型,要求返回如下结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
"what_the_user_wants_to_do_next_by_instruction": "在搜索框输入 '耳机' ,敲回车",
"log": "Now I want to use action 'Input' to enter '耳机' in the search bar first.",
"more_actions_needed_by_instruction": true, // 是否还需要进行操作
"action": {
"type": "Input",
"locate": {
"bbox": [269, 57, 824, 87],
"prompt": "The search input field"
},
"param": {
"value": "耳机"
}
}
}

根据bbox和DOM tree进行深度遍历 DOM节点范围匹配,得到范围最相近的DOM元素,当前是input元素,清空input中的内容。

根据上一步的more_actions_needed_by_instruction判断是否还需要继续调用AI

执行将上一步后将上一步的log和prompt发送给大模型,返回如下

1
2
3
4
5
6
7
8
9
10
11
{
"what_the_user_wants_to_do_next_by_instruction": "在搜索框输入 \\"耳机\\" 后,需要敲回车来执行搜索。",
"log": "现在我将使用 'KeyboardPress' 动作来模拟敲击回车键以执行搜索。",
"more_actions_needed_by_instruction": false,
"action": {
"type": "KeyboardPress",
"param": {
"value": "Enter"
}
}
}

执行KeyboardPress操作

实现的伪代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
const logList = [];
function generateTaskByUserPromptWithPageScreen(prompt, { logList }) => {
return {
type: 'Planning',
subType: 'Plan',
locate: null,
param: {
prompt,
log: logList,
},
execute: () => {
const screenShot = screenShot();
return await callAiGetPlan(prompt, { log: logList.join('-'), screenShot });
}
}
}
let planningTask = generateTaskByUserPromptWithPageScreen(prompt);
while (planningTask) {
const planResult = planningTask.execute();
const { log, more_actions_needed_by_instruction, actions } = planResult;
const newTasks = convertPlanToExecutable(actions); // 'Tap' | 'Hover' 等等
newTasks.execute();
logList.push(log);
if (more_actions_needed_by_instruction) {
planningTask = generateTaskByUserPromptWithPageScreen(prompt, {log: logList.join('-')});
} else {
break;
}
}

Midscene.js的缺点是什么

  1. 成本
    豆包和千问VL的计费都是输入:3元/百万token,输出:9元/百万token,像我们上述的调用输入是2400token,输出是100token,
    上述的一个问题调用了两次AI,大概3元可以问150个问题,我觉得还能接受。
  2. 准确程度
    Midscene.js是视觉分析得到元素的bounding box来获取元素的,可能会不准
  3. 运行速度
    上述问题在本地调用两次”qwen-vl-max-latest”模型AI花用了15-20秒,时间稍长,但是一般都是异步任务也能接受。
  4. 一些操作没法实现
    目前还不支持拖拽、双击、文件上传等等。
  5. 假如目标元素不在首屏,就无法定位元素。

总结

后续将会对比下和Playwright MCP的区别。

附上plan的prompt

System Prompt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
'Target: User will give you a screenshot, an instruction and some previous logs indicating what have been done. Please tell what the next one action is (or null if no action should be done) to do the tasks the instruction requires. 

Restriction:
- Don\'t give extra actions or plans beyond the instruction. ONLY plan for what the instruction requires. For example, don\'t try to submit the form if the instruction is only to fill something.
- Always give ONLY ONE action in `log` field (or null if no action should be done), instead of multiple actions. Supported actions are Tap, Hover, Input, KeyboardPress, Scroll.
- Don\'t repeat actions in the previous logs.
- Bbox is the bounding box of the element to be located. It\'s an array of 4 numbers, representing 2d bounding box as [xmin, ymin, xmax, ymax].

Supporting actions:
- Tap: { type: "Tap", locate: {bbox: [number, number, number, number], prompt: string } }
- RightClick: { type: "RightClick", locate: {bbox: [number, number, number, number], prompt: string } }
- Hover: { type: "Hover", locate: {bbox: [number, number, number, number], prompt: string } }
- Input: { type: "Input", locate: {bbox: [number, number, number, number], prompt: string }, param: { value: string } } // Replace the input field with a new value. `value` is the final that should be filled in the input box. No matter what modifications are required, just provide the final value to replace the existing input value. Giving a blank string means clear the input field.
- KeyboardPress: { type: "KeyboardPress", param: { value: string } }
- Scroll: { type: "Scroll", locate: {bbox: [number, number, number, number], prompt: string } | null, param: { direction: \'down\'(default) | \'up\' | \'right\' | \'left\', scrollType: \'once\' (default) | \'untilBottom\' | \'untilTop\' | \'untilRight\' | \'untilLeft\', distance: null | number }} // locate is the element to scroll. If it\'s a page scroll, put `null` in the `locate` field.


Field description:
* The `prompt` field inside the `locate` field is a short description that could be used to locate the element.

Return in JSON format:
{
"what_the_user_wants_to_do_next_by_instruction": string, // What the user wants to do according to the instruction and previous logs.
"log": string, // Log what the next one action (ONLY ONE!) you can do according to the screenshot and the instruction. The typical log looks like "Now i want to use action \'{{ action-type }}\' to do .. first". If no action should be done, log the reason. ". Use the same language as the user\'s instruction.
"error"?: string, // Error messages about unexpected situations, if any. Only think it is an error when the situation is not expected according to the instruction. Use the same language as the user\'s instruction.
"more_actions_needed_by_instruction": boolean, // Consider if there is still more action(s) to do after the action in "Log" is done, according to the instruction. If so, set this field to true. Otherwise, set it to false.
"action":
{
// one of the supporting actions
} | null,
,
"sleep"?: number, // The sleep time after the action, in milliseconds.
}

For example, when the instruction is "click \'Confirm\' button, and click \'Yes\' in popup" and the log is "I will use action Tap to click \'Confirm\' button", by viewing the screenshot and previous logs, you should consider: We have already clicked the \'Confirm\' button, so next we should find and click \'Yes\' in popup.

this and output the JSON:

{
"what_the_user_wants_to_do_next_by_instruction": "We have already clicked the \'Confirm\' button, so next we should find and click \'Yes\' in popup",
"log": "I will use action Tap to click \'Yes\' in popup",
"more_actions_needed_by_instruction": false,
"action": {
"type": "Tap",
"locate": {
"bbox": [100, 100, 200, 200],
"prompt": "The \'Yes\' button in popup"
}
}
}
'

User prompt

1
`Here is the user's instruction:<instruction>  <high_priority_knowledge>    undefined  </high_priority_knowledge>  在搜索框输入 "耳机" ,敲回车</instruction>`