Xiangxiang's Personal Site

Machine Learning & Security Engineer
生命不息,折腾不止,留下一点活着的记录.

View on GitHub
10 November 2023

Static Code Analysis

by xiangxiang

       Data Flow
Source ----------> Sink

0x00 refs

0x01 static analysis fundamentals

1.1 The problem: finding vulnerabilities

main cause of vulnerabilities: untrusted, user-controlled input being used in sensitive or dangerous functions of the program

1.2 data flow, sources, and sinks

To represent these in static analysis, we use terms such as data flow, sources, and sinks

 +---------+               +-------+
 |         |      data     |       |
 | sources |  -----------> | sinks |
 |         |      flow     |       |
 +---------+               +-------+
untrusted input         dangerous functions
          

1.3 Finding sources and sinks

we can (given that a tool supports this functionality) detect sources and sinks automatically without too many false positives

1.4 Syntactic pattern matching, abstract syntax tree, and control flow graph

After the code is scanned for tokens, it can be built into a more abstract representation that will make it easier to query the code. One of the common approaches is to parse the code into a parse tree and build an abstract syntax tree (AST)

To make our analysis even more accurate, we can use another representation of source code called control call graph (CFG). A control flow graph describes the flow of control, that is the order in which the AST nodes are evaluated in all possible runs of a program, where each node corresponds to a primitive statement in the program. These primitive statements include assignments and conditions. Edges going out from a node denote a possible successor of that statement in the same run of the program. Thanks to the control flow graph, we can track how the code flows throughout the program and perform further analysis

1.5 Data flow analysis and taint tracking

0x02 CodeQL Data flow graph

2.1 v.s. AST

Unlike the abstract syntax tree, the data flow graph does not reflect the syntactic structure of the program, but models the way data flows through the program at runtime.

2.2 edges

in the expression x || y there are data flow nodes corresponding to the sub-expressions x and y, as well as a data flow node corresponding to the entire expression x || y. There is an edge from the node corresponding to x to the node corresponding to x || y, representing the fact that data may flow from x to x || y (since the expression x || y may evaluate to x). Similarly, there is an edge from the node corresponding to y to the node corresponding to x || y.

2.3 challenges

Computing an accurate and complete data flow graph presents several challenges:

To overcome these potential problems, two kinds of data flow are modeled in the libraries:

2.4 CodeQl Normal data flow vs taint tracking

For example, if you are tracking an insecure object x (which might be some untrusted or potentially malicious data), a step in the program may ‘change’ its value. So, in a simple process such as y = x + 1, a normal data flow analysis will highlight the use of x, but not y. However, since y is derived from x, it is influenced by the untrusted or ‘tainted’ information, and therefore it is also tainted. Analyzing the flow of the taint from x to y is known as taint tracking.

0x03 MISC

3.1 如何对比工具

3.2 哪些漏洞可以通过静态代码扫描发现

3.3 哪些漏洞无法通过静态代码扫描发现

tags: sast security codeql