SAST专项之使用tree-sitter进行SINK点扫描——diff正则扫描器

Ewoji

2026-01-22

前言

在进行批量代码审计的时候，批量寻找危险函数和可控变量是我们的第一步，接下来以PHP的迅睿框架为测试对象，使用Go的tree-sitter-php库来学习，顺便来体验一下传说中为并发而生的Go

一直在使用的一个PHP代码审计项目就是Seay源代码审计系统

不过他好久没更新了，在用的时候比较喜欢使用那个一键审计的功能，但是发现速度似乎不怎么快，而且经常扫描到注释里面的内容，应该是基于正则匹配来写的扫描器，我们可以来试试使用Go的tree-sitter库的速度怎么样

最后也是没想到能快这么多

库安装

1 2	go get github.com/smacker/go-tree-sitter go get github.com/smacker/go-tree-sitter/php

打印根节点

package main

import (
	"context"
	"fmt"

	sitter "github.com/smacker/go-tree-sitter"
	"github.com/smacker/go-tree-sitter/php"
)

func main() {
	sourceCode := []byte(`<?php
    function hack($payload) {
        eval($payload);
    }
    $a = 1;
?>`)

	parser := sitter.NewParser()
	parser.SetLanguage(php.GetLanguage())

	tree, _ := parser.ParseCtx(context.Background(), nil, sourceCode)

	root := tree.RootNode()

	fmt.Println(root.String())
}

这样我们就把我们给的静态代码做成了AST语法树并且输出的根节点

输出

(program (php_tag) (function_definition name: (name) parameters: (formal_parameters (simple_parameter name: (variable_name (name)))) body: (compound_statement (expression_statement (function_call_expression function: (name) arguments: (arguments (argument (variable_name (name)))))))) (expression_statement (assignment_expression left: (variable_name (name)) right: (integer))) (text_interpolation))

可视化图

查询节点

如何查询并且提取出你想要找的东西呢，tree-sitter-php提供的查询方法和我们写前端用的CSS选择器差不多，比如类似上图的一个节点

1
2
3

[assignment_expression]      <-- 父节点
  left: [variable_name]      <-- 子节点字段叫 left，类型是 variable_name
  right: [integer]           <-- 子节点字段叫 right，类型是 integer

如果我们想提取它的变量名字，我们的查询语句应该这样写

1	queryString := (assignment_expression left: (variable_name) @var_name)

它是由父节点子节点字段：(节点名称) @标签名这样的形式构成

标签名的作用是用来标记要去捕获的节点并给他们分类自定义命名

下面给出一个例子

package main

import (
	"context"
	"fmt"

	sitter "github.com/smacker/go-tree-sitter"
	"github.com/smacker/go-tree-sitter/php"
)

func main() {
	sourceCode := []byte(`<?php
    function login($user) {
        echo "logging in";
    }
    
    function dangerous_exec($cmd) {
        system($cmd);
    }
?>`)

	parser := sitter.NewParser()
	parser.SetLanguage(php.GetLanguage())
	tree, _ := parser.ParseCtx(context.Background(), nil, sourceCode)
	root := tree.RootNode()

	//查询语句
	queryString := `(function_definition name: (name) @my_func_name)`

	// 创建查询对象
	q, _ := sitter.NewQuery([]byte(queryString), php.GetLanguage())

	// 创建游标，用来遍历匹配到的结果
	qc := sitter.NewQueryCursor()

	// 执行查询
	qc.Exec(q, root)

	fmt.Println("发现的函数定义：")

	for {
		match, ok := qc.NextMatch()
		if !ok {
			break
		}

		for _, capture := range match.Captures {
			if q.CaptureNameForId(capture.Index) == "my_func_name" {
				funcName := capture.Node.Content(sourceCode)
				fmt.Printf("- 函数名: %s (在第 %d 行)\n", funcName, capture.Node.StartPoint().Row+1)
			}
		}
	}
}

由于PHP的语法的复杂程度，tree-sitter必须给他的节点字段名称定义有足足上百个，我们不可能一个一个记全，但是我们可以通过自己遍历的方式去看我们需要找的节点类型叫什么名字

这是我让ai写的一个遍历脚本

package main

import (
	"context"
	"fmt"
	"strings"

	sitter "github.com/smacker/go-tree-sitter"
	"github.com/smacker/go-tree-sitter/php"
)

// 递归遍历打印树的结构
func printNode(node *sitter.Node, source []byte, level int) {
	// 缩进
	indent := strings.Repeat("  ", level)

	// node.Type() 就是你要找的那个名字！
	nodeType := node.Type()

	// 获取节点内容
	content := node.Content(source)
	// 如果内容太长，截断一下方便显示
	if len(content) > 20 {
		content = content[:20] + "..."
	}
	content = strings.ReplaceAll(content, "\n", "↵")

	// 打印：[层级] 节点类型: 内容
	fmt.Printf("%s[%s] : %s\n", indent, nodeType, content)

	// 递归打印子节点
	for i := 0; i < int(node.ChildCount()); i++ {
		printNode(node.Child(i), source, level+1)
	}
}

func main() {
	// 在这里放入你想分析的任何 PHP 代码
	code := `<?php
    $str = "Hello " . $world;
    if ($a > 1) { echo $a; }
    ?>`

	ctx := context.Background()
	lang := php.GetLanguage()
	node, _ := sitter.ParseCtx(ctx, []byte(code), lang)

	fmt.Println("=== 节点结构分析 ===")
	printNode(node, []byte(code), 0)
}

我们就可以通过这里所给的字段，来找到我们想要找的字段

以及一些比较常用的字段名字

定义类 (Definitions)

program: 根节点。
function_definition: 定义函数。
class_declaration: 定义类。
method_declaration: 定义类方法。
namespace_definition: 命名空间。

变量与数据 (Data)

variable_name: 变量 (如 $a)
integer: 整数 (如 123)
float: 浮点数
string: 普通字符串 (如 'hello')
encapsed_string: 双引号字符串，里面可能包含变量 (如 "hello $name")
boolean: 布尔值。
null: 空值。

表达式与运算 (Expressions)

assignment_expression: 赋值 (如 $a = 1)。
binary_expression: 二元运算 (如 $a + $b, $a . $b)。
- 注意：PHP 的字符串拼接 . 也是 binary_expression。
function_call_expression: 函数调用 (如 eval($a))
method_call_expression: 方法调用 (如 $obj->save())
array_creation_expression: 创建数组 (如 ['a' => 1])

语句 (Statements)

return_statement: 返回语句。
if_statement: if 判断。
expression_statement: 表达式语句 (一行代码结束)

提取信息

现在我们已经可以去提取某个节点了，现在我们还需要这个节点的更多信息，比如名称，位置信息，调用关系等

首先是文本内容我们的节点->文本内容的映射

// 语法：node.Content(原始字节数组)
text := capture.Node.Content(sourceCode)

fmt.Println(text) 
// 输出例如: "$user", "123", "eval($cmd)"

这里必须要传入sourceCode才能通过坐标信息去找到我们的节点

第二个就是节点坐标

// 1. 获取起始点 (StartPoint) 和 结束点 (EndPoint)
start := capture.Node.StartPoint()
end := capture.Node.EndPoint()

// 2. 获取行号 (Row) 和 列号 (Column)
// 注意：Tree-sitter 的行号是从 0 开始的，所以显示给人类看时通常要 +1
lineNum := start.Row + 1 
colNum := start.Column + 1

fmt.Printf("在第 %d 行, 第 %d 列\n", lineNum, colNum)

第三个就是节点的类型

// 获取类型名称
nodeType := capture.Node.Type()

if nodeType == "string" {
    fmt.Println("这是一个字符串")
} else if nodeType == "integer" {
    fmt.Println("这是一个数字")
}

第四个也就是节点的位置信息，这也是AST树比正则匹配的一大优势，可以反应节点和节点间的关系

比如向下找到所有子节点

// 1. 获取孩子的数量
count := node.ChildCount()

// 2. 获取第 N 个孩子 (索引从 0 开始)
firstKid := node.Child(0) 
lastKid := node.Child(count - 1)

// 3. (高级) 根据字段名获取孩子 (推荐!)
// 比如在 binary_expression 里，你可以直接要 "left" 或 "right"
leftNode := node.ChildByFieldName("left")

向上找节点，比如你找到了一个eval函数，想知道这个eval是在哪个函数中的时候很有用

// 获取父节点
dad := node.Parent()

// 这里的 dad 可能就是 function_definition

向平级找，如果一些漏洞需要满足两个语句这样的关系，可以用来判断是否成立

// 下一个兄弟 (后一个节点)
next := node.NextSibling()

// 上一个兄弟 (前一个节点)
prev := node.PrevSibling()

实战

找到危险函数节点

所以我们先确定需要找的节点

我们需要找到的节点如下

普通函数调用 (function_call_expression)
- 例如：system('ls')
- 它会提取 function 字段的内容 -> “system”
静态方法调用 (scoped_call_expression)
- 例如：Class::method()
- 它会提取 name 字段的内容 -> “method”
对象方法调用 (member_call_expression)
- 例如：$obj->eval()
- 它会提取 name 字段的内容 -> “eval”

对应我们的代码就是

func (a *Analyzer) checkNode(n *sitter.Node, source []byte, filePath string) *Finding {
	nodeType := n.Type()
	var funcName string

	// Determine function name based on node type
	switch nodeType {
	case "function_call_expression":
		// Standard function call: name(...)
		funcNode := n.ChildByFieldName("function")
		if funcNode != nil {
			funcName = funcNode.Content(source)
		}
	case "scoped_call_expression":
		// Static call: Class::method(...)
		nameNode := n.ChildByFieldName("name")
		if nameNode != nil {
			funcName = nameNode.Content(source)
		}
	case "member_call_expression":
		// Method call: $obj->method(...)
		nameNode := n.ChildByFieldName("name")
		if nameNode != nil {
			funcName = nameNode.Content(source)
		}

但是要特别注意的是在PHP中有三类函数看起来像函数，也就是文件包含类，输出类，但是实际上在php的AST解析中会被判断为语言结构而不是上面提到的三种函数调用类型

我们可以做一个实验来看看

我们把上面的遍历脚本拿下来，写上有以上三类的代码

package main

import (
	"context"
	"fmt"
	"strings"

	sitter "github.com/smacker/go-tree-sitter"
	"github.com/smacker/go-tree-sitter/php"
)

// 递归遍历打印树的结构
func printNode(node *sitter.Node, source []byte, level int) {
	// 缩进
	indent := strings.Repeat("  ", level)

	// node.Type() 就是你要找的那个名字！
	nodeType := node.Type()

	// 获取节点内容
	content := node.Content(source)
	// 如果内容太长，截断一下方便显示
	if len(content) > 20 {
		content = content[:20] + "..."
	}
	content = strings.ReplaceAll(content, "\n", "↵")

	// 打印：[层级] 节点类型: 内容
	fmt.Printf("%s[%s] : %s\n", indent, nodeType, content)

	// 递归打印子节点
	for i := 0; i < int(node.ChildCount()); i++ {
		printNode(node.Child(i), source, level+1)
	}
}

func main() {
	// 在这里放入你想分析的任何 PHP 代码
	code := `<?php
    eval('code');
    print "text";
    echo "text";
    ?>`

	ctx := context.Background()
	lang := php.GetLanguage()
	node, _ := sitter.ParseCtx(ctx, []byte(code), lang)

	fmt.Println("=== 节点结构分析 ===")
	printNode(node, []byte(code), 0)
}

接着看输出结果

[program] : <?php↵    include('c...
  [php_tag] : <?php
  [expression_statement] : include('code');
    [include_expression] : include('code')
      [include] : include
      [parenthesized_expression] : ('code')
        [(] : (
        [string] : 'code'
          ['] : '
          [string_content] : code
          ['] : '
        [)] : )
    [;] : ;
  [expression_statement] : include_once('code')...
    [include_once_expression] : include_once('code')
      [include_once] : include_once
      [parenthesized_expression] : ('code')
        [(] : (
        [string] : 'code'
          ['] : '
          [string_content] : code
          ['] : '
        [)] : )
    [;] : ;
  [expression_statement] : require('code');
    [require_expression] : require('code')
      [require] : require
      [parenthesized_expression] : ('code')
        [(] : (
        [string] : 'code'
          ['] : '
          [string_content] : code
          ['] : '
        [)] : )
    [;] : ;
  [expression_statement] : require_once('code')...
    [require_once_expression] : require_once('code')
      [require_once] : require_once
      [parenthesized_expression] : ('code')
        [(] : (
        [string] : 'code'
          ['] : '
          [string_content] : code
          ['] : '
        [)] : )
    [;] : ;
  [expression_statement] : print "text";
    [print_intrinsic] : print "text"
      [print] : print
      [encapsed_string] : "text"
        ["] : "
        [string_content] : text
        ["] : "
    [;] : ;
  [echo_statement] : echo "text";
    [echo] : echo
    [encapsed_string] : "text"
      ["] : "
      [string_content] : text
      ["] : "
    [;] : ;
  [text_interpolation] : ?>
    [?>] : ?>

可以清晰地看见我们的包含类和输出类并不是属于函数调用的节点，所以说我们需要找的节点如下（以项目代码片段展示）

func (a *Analyzer) checkNode(n *sitter.Node, source []byte, filePath string) *Finding {
	nodeType := n.Type()
	var funcName string

	// Determine function name based on node type
	switch nodeType {
	case "function_call_expression":
		// Standard function call: name(...)
		funcNode := n.ChildByFieldName("function")
		if funcNode != nil {
			funcName = funcNode.Content(source)
		}
	case "scoped_call_expression":
		// Static call: Class::method(...)
		nameNode := n.ChildByFieldName("name")
		if nameNode != nil {
			funcName = nameNode.Content(source)
		}
	case "member_call_expression":
		// Method call: $obj->method(...)
		nameNode := n.ChildByFieldName("name")
		if nameNode != nil {
			funcName = nameNode.Content(source)
		}
	case "include_once_expression":
		funcName = "include_once"
	case "require_expression":
		funcName = "require"
	case "require_once_expression":
		funcName = "require_once"
	case "print_intrinsic":
		funcName = "print"
	case "echo_statement":
		funcName = "echo"
	}

细化定制

我们知道找到危险函数肯定还不够，必须要满足危险函数传入的变量可控才行，比如要有如下情况的话怎么让AST树去识别呢

1 2	eval($cmd) //变量可控 echo `$cmd` //双引号中存在可控变量

同样的，我们依然沿用上面提到的遍历方法，看看如果是上述情况下，AST树是如何解析的

package main

import (
	"context"
	"fmt"
	"strings"

	sitter "github.com/smacker/go-tree-sitter"
	"github.com/smacker/go-tree-sitter/php"
)

// 递归遍历打印树的结构
func printNode(node *sitter.Node, source []byte, level int) {
	// 缩进
	indent := strings.Repeat("  ", level)

	// node.Type() 就是你要找的那个名字！
	nodeType := node.Type()

	// 获取节点内容
	content := node.Content(source)
	// 如果内容太长，截断一下方便显示
	if len(content) > 20 {
		content = content[:20] + "..."
	}
	content = strings.ReplaceAll(content, "\n", "↵")

	// 打印：[层级] 节点类型: 内容
	fmt.Printf("%s[%s] : %s\n", indent, nodeType, content)

	// 递归打印子节点
	for i := 0; i < int(node.ChildCount()); i++ {
		printNode(node.Child(i), source, level+1)
	}
}

func main() {
	// 在这里放入你想分析的任何 PHP 代码
	// Note: using + "`" + to insert backticks into the Go string
	code := "<?php\n" +
		"system('ls');           // Safe: string\n" +
		"system($cmd);           // Dangerous: variable\n" +
		"system(\"ls \" . $arg);   // Dangerous: binary_expression (concat)\n" +
		"system(\"ls $arg\");      // Dangerous: encapsed_string with variable\n" +
		"\n" +
		"// Backticks\n" +
		"`ls`;                   // Safe: shell_command_expression (constant)\n" +
		"`ls $arg`;              // Dangerous: shell_command_expression (variable)\n" +
		"\n" +
		"include('config.php');  // Safe\n" +
		"include($file);         // Dangerous\n" +
		"?>"

	ctx := context.Background()
	lang := php.GetLanguage()
	node, _ := sitter.ParseCtx(ctx, []byte(code), lang)

	fmt.Println("=== 节点结构分析 ===")
	printNode(node, []byte(code), 0)
}

看看输出结果

[program] : <?php↵system('ls'); ...
  [php_tag] : <?php
  [expression_statement] : system('ls');
    [function_call_expression] : system('ls')
      [name] : system
      [arguments] : ('ls')
        [(] : (
        [argument] : 'ls'
          [string] : 'ls'
            ['] : '
            [string_content] : ls
            ['] : '
        [)] : )
    [;] : ;
  [comment] : // Safe: string
  [expression_statement] : system($cmd);
    [function_call_expression] : system($cmd)
      [name] : system
      [arguments] : ($cmd)
        [(] : (
        [argument] : $cmd
          [variable_name] : $cmd
            [$] : $
            [name] : cmd
        [)] : )
    [;] : ;
  [comment] : // Dangerous: variab...
  [expression_statement] : system("ls " . $arg)...
    [function_call_expression] : system("ls " . $arg)
      [name] : system
      [arguments] : ("ls " . $arg)
        [(] : (
        [argument] : "ls " . $arg
          [binary_expression] : "ls " . $arg
            [encapsed_string] : "ls "
              ["] : "
              [string_content] : ls
              ["] : "
            [.] : .
            [variable_name] : $arg
              [$] : $
              [name] : arg
        [)] : )
    [;] : ;
  [comment] : // Dangerous: binary...
  [expression_statement] : system("ls $arg");
    [function_call_expression] : system("ls $arg")
      [name] : system
      [arguments] : ("ls $arg")
        [(] : (
        [argument] : "ls $arg"
          [encapsed_string] : "ls $arg"
            ["] : "
            [string_content] : ls
            [variable_name] : $arg
              [$] : $
              [name] : arg
            ["] : "
        [)] : )
    [;] : ;
  [comment] : // Dangerous: encaps...
  [comment] : // Backticks
  [expression_statement] : `ls`;
    [shell_command_expression] : `ls`
      [`] : `
      [string_content] : ls
      [`] : `
    [;] : ;
  [comment] : // Safe: shell_comma...
  [expression_statement] : `ls $arg`;
    [shell_command_expression] : `ls $arg`
      [`] : `
      [string_content] : ls
      [variable_name] : $arg
        [$] : $
        [name] : arg
      [`] : `
    [;] : ;
  [comment] : // Dangerous: shell_...
  [expression_statement] : include('config.php'...
    [include_expression] : include('config.php'...
      [include] : include
      [parenthesized_expression] : ('config.php')
        [(] : (
        [string] : 'config.php'
          ['] : '
          [string_content] : config.php
          ['] : '
        [)] : )
    [;] : ;
  [comment] : // Safe
  [expression_statement] : include($file);
    [include_expression] : include($file)
      [include] : include
      [parenthesized_expression] : ($file)
        [(] : (
        [variable_name] : $file
          [$] : $
          [name] : file
        [)] : )
    [;] : ;
  [comment] : // Dangerous
  [text_interpolation] : ?>
    [?>] : ?>

在这里我突然发现实际上如果使用反引号包裹字符串的语法的话，AST会自动识别为[shell_command_expression]，所以我们寻找的危险节点还必须添加一个shell_command_expression，接着说到如何实现查看可控变量呢

根据结果我们可以看到如果变量中的子节点，也就是它的参数的类型是variable_name的话，就可以说明这个变量可控，同时我们还要加一层检测，就是比如碰到如下情况

1	system(getenv($xx));

一个函数的参数是另一个参数的返回值的话，我们就必须进行递归查询，找到可控的变量

所以我们的实现逻辑如下

...
checkScopeNode = n.ChildByFieldName("arguments")
...
		if !a.isControllable(checkScopeNode) {
			return nil
		}
...


func (a *Analyzer) isControllable(n *sitter.Node) bool {
	if n == nil {
		return false
	}

	// Check current node type for dynamic indicators
	switch n.Type() {
	case "variable_name", "variable":
		return true
	case "function_call_expression", "scoped_call_expression", "member_call_expression":
		return true
	}

	// Recursively check children
	count := n.ChildCount()
	for i := 0; i < int(count); i++ {
		if a.isControllable(n.Child(int(i))) {
			return true
		}
	}

	return false
}

全量遍历or定制查询？

上面我们提到我们寻找危险的节点的做法是直接遍历所有的节点去找到匹配的节点

但是最开始我们学习这个库的时候是使用查询的方式去寻找节点的

问了AI，AI的回答是使用底层的查询方式肯定会快一些但提升不大

所以这里还是使用了全量遍历的方式，因为这样写代码在后期维护上比较灵活

规则配置

seay中的规则是非常丰富的，不仅有最重要的RCE规则扫描集，还考虑到了一些危害不高的漏洞，比如XSS，IP伪造，CSRF，变量覆盖，所以我们最好是使用装配思想去来完成我们的项目

下面是写的现在能想起来的一些漏洞规则后期也可以不断完善

rules:
  - name: "命令执行"
    description: "命令执行函数中存在变量，可能存在命令执行漏洞"
    severity: "high"
    functions:
      - "system"
      - "exec"
      - "shell_exec"
      - "passthru"
      - "popen"
      - "proc_open"
      - "pcntl_exec"

  - name: "代码执行"
    description: "代码执行函数中存在变量，可能存在代码执行漏洞"
    severity: "critical"
    functions:
      - "eval"
      - "assert"
      - "create_function"
      - "call_user_func"
      - "call_user_func_array"
      - "array_map"

  - name: "文件包含"
    description: "文件包含函数中存在变量，可能存在文件包含漏洞"
    severity: "medium"
    functions:
      - "include"
      - "require"
      - "include_once"
      - "require_once"

  - name: "任意文件读取"
    description: "读取文件函数中存在变量，可能存在任意文件读取漏洞"
    severity: "medium"
    functions:
      - "file_get_contents"
      - "readfile"
      - "fopen"
      - "fread"
      - "show_source"
      - "highlight_file"

  - name: "任意文件操作"
    description: "文件操作函数中存在变量，可能存在任意文件读取/删除/修改/写入等漏洞"
    severity: "medium"
    functions:
      - "file_put_contents"
      - "unlink"
      - "copy"
      - "fwrite"
      - "move_uploaded_file"
      - "fputs"

  - name: "XSS漏洞"
    description: "echo等输出中存在可控变量，可能存在XSS漏洞"
    severity: "medium"
    functions:
      - "echo"
      - "print"
      - "printf"
      - "print_r"
      - "var_dump"
      - "exit"
      - "die"

  - name: "变量覆盖"
    description: "parse_str等函数中存在变量，可能存在变量覆盖漏洞"
    severity: "medium"
    functions:
      - "parse_str"
      - "extract"
      - "mb_parse_str"
      - "import_request_variables"

  - name: "敏感信息泄露"
    description: "phpinfo()函数，可能存在敏感信息泄露漏洞"
    severity: "low"
    ignore_taint: true
    functions:
      - "phpinfo"

对比测试

这里我们使用seay测试去扫迅睿CMS总共是扫描到了349个结果用时2.33分钟

使用GO高并发工作池+AST树去扫描的话居然扫描了600个结果不到一秒钟（反正是一运行就输出了），效率直接指数倍提升，也是比较有成就感的

还仿制了一个和seay一样的output.html

后续会把这个功能集成在我的PHP的SAST项目中