In RQ2, we observed that SA tools including Slither generally suffer from numerous false positives. To analyze the reasons, we delve into the implementation of their detection rules and found that pattern matching used by these tools can be imprecise in real-world scenarios, where edge cases might not align with the tool's predefined rules. By taking Slither as an example, we discuss them from two aspects below.
As shown in Listing 2, it is crucial to understand the contextual flow of the code and the specific execution paths.
Slither flagged a "divide-before-multiply" vulnerability (false positive) due to the conditional nature of the arithmetic operations within the function updateRewardRate. Specifically, the division and multiplication operations are enclosed within an if-else block (L18-23). This design ensures that only one of these operations is executed during any single invocation of the updateRewardRate. As a result, the concern about precision loss due to division preceding multiplication in a direct sequence, as flagged by Slither, is not applicable in this scenario.
By reviewing the exact rule implementation of Slither, we identified key areas that contribute to its limitations in accurately detecting this vulnerability type:
a) Slither first performs an operation detection (L21-50): The functions is_division and is_multiplication are used to identify division and multiplication operations within the code. These functions check if an operation is a binary division/multiplication or a library call that performs these operations.
b) Then it employs data-flow tracking (L65-117): The _explore function iterates over the nodes (basic blocks of code) in the function's control flow graph. For each node, it examines IR statements (for ir in node.irs) and tracks divisions and multiplications. When a division is detected, it maps the result to the node where it occurred (divisions[ir.lvalue] = [node]).
c) Interestingly, Slither also conducts the branch analysis (L96-106): Branch analysis primarily occurs in this section, where Slither tracks the nodes related to a multiplication operation. However, the focus is on whether the multiplication operands involve a division result, not on the conditional structure surrounding these operations. Slither does check for assert-like conditions at L53-61 (is_assert(node)), but this doesn’t extend to a nuanced understanding of the logical separation caused by if-else branches.
In summary, the primary reason for the false positive in our case stems from Slither's relatively simplistic and incomplete approach to its rule implementation. While it effectively tracks and maps operations, it lacks a nuanced understanding of the logical separation in code due to conditional branching. For example, Slither's method doesn't fully consider the operational context provided by if-else branches, leading to misinterpretations like in our example, where two operations in separate conditional branches were incorrectly flagged as a precision loss issue.
This analysis reveals that while Slither is proficient in operation detection and data flow tracking, its approach to branch analysis could benefit from more sophistication to understand the complex logical structures in real-world smart contracts. This would enhance its accuracy in identifying genuine vulnerabilities and reduce instances of false positives.
Another similar case was also observed for the "Weak PRNG" detection. The detector (shown in Listing 4), designed to flag uses of the modulus operator in conjunction with variables like block.timestamp, now, or blockhash, erroneously identified a vulnerability in a simple modulus function such as (z = x + y; z = z % 2;). This false positive stems from Slither's pattern-based detection methodology, which, while effective for identifying standard vulnerability patterns, lacks the contextual understanding necessary to differentiate between the use of the modulus operator in random number generation and its benign use in straightforward arithmetic operations.
Detecting reentrancy issues is particularly challenging for existing SAST tools due to unknown contexts such as modifiers or external calls [1]. The difficulty lies in identifying trusted behavior in Solidity, especially for external calls to contracts without source code. While filtering for single-address reentrancies could be a solution, it still requires annotations for establishing trust, underscoring the need for human judgment in certain cases.
Among these SA tools, Slither implemented seven detectors for reentrancy detection. By manually reviewing Slither's rule implementation for reentrancy detection, it shows that they are primarily heuristic-based. Slither identifies patterns traditionally associated with reentrancy risks, such as external calls that can lead to callbacks and subsequent state changes. The table below presents the patterns that have been defined by Slither to detect reentrancy.
This approach, while generally effective compared with other tools' implementation (such as simple patten matching using regular expression within SmartCheck), may not discern the subtleties and safe practices in more complex contract interactions.
Take ReentrancyEth as an example, this rule implementation involves state tracking in control-flow (L137-179): It tracks states like calls that send ETH, calls that can callback, and variables that are read or written across the CFG of the contract. While this is comprehensive, it may not fully capture the nuances of certain secure patterns, like safe external interactions with trusted contracts or internal functions that follow external calls but don't introduce vulnerabilities.
However, it might not fully account for the context of external calls, such as whether the call is made to a well-audited, standard contract (like an ERC20 token) or whether additional security checks (like ownership verification) are in place. The tool focuses on the pattern of the calls and state changes rather than the broader trust environment in which these calls occur.
Real-world contracts often implement complex logic and sophisticated security practices that may go beyond the standard patterns recognized by Slither.
Take Listing 3 as an example, the function secureTokenTransfer implements a state change after an external call to token.transfer(recipient, amount) (L22-25). This external call is followed by afterTransferAction(amount) (L24), an internal function call. Although this pattern, where state changes occur post an external call, was flagged as a potential reentrancy risk by Slither, it's secure in this context. The reason lies in the nuanced implementation: the external call is to a trusted, standard ERC20 token, and the function is protected by an onlyOwner modifier, ensuring controlled access. These additional layers of security and trust are not immediately apparent in the CFG analysis that Slither performs.
Meanwhile, we discussed why SE tools like Mythril generally suffer from numerous false negatives. To analyze the reasons, we delve into the implementation of their detection rules as well as the code cases within our benchmarks and found that (1) Complexity and Redundancy in Contract Code (especially on SolidiFi Benchmark); (2) Technology Limitations (e.g., complicated vulnerable scenarios), and (3) Analysis Failure (fatal errors within SMT solvers and timeout) are the three main root causes for the false negatives of SE tools.
The contracts in SolidiFi Benchmark exhibit a high level of complexity and redundancy, with multiple similar functions and mappings. This complexity can obscure the flow of data and control, making it challenging for SE tools especially Mythril to accurately track the state and identify vulnerabilities. For example, consider the functions bug_intou20, and bug_intou32 in bugg_15.sol [2], which are scattered throughout the contract with similar patterns. The repetitive and convoluted nature of these functions can lead to analysis tools missing the intentional overflow bugs. Mythril has inherent limitations in its ability to parse and understand complex code structures [3].
For instance, bugg_15.sol contract contains multiple instances of functions like bug_intou7, bug_intou27, etc., each introducing potential underflow and overflow issues. The cumulative effect of these numerous instances can overwhelm the analysis tool, leading to some vulnerabilities being undetected.
SAST tools including SE tools often struggle with complex vulnerabilities that are deeply intertwined with the contract's logic and specific scenarios. For example, CVE-2018-10468 [4] and CVE-2018-10706 [5] represent overflow vulnerabilities that are not straightforward and involve intricate logical conditions. In CVE-2018-10468, the overflow vulnerability (Line 14) is tied to the contract's logic of token transfer and balance update (Lines 10-12). The vulnerability arises from a logical error in the overflow check (Line 14), which is not easily detectable by tools that primarily look for standard overflow patterns. Hence, to detect them, contextual analysis is needed. However, SAST tools make it hard to fully understand the context in which certain operations occur, especially when vulnerabilities are conditional or based on specific states of the contract.
As shown in our manuscript, a significant portion of contracts may experience analysis failure due to various reasons, including fatal errors or timeouts. Mythril, despite having a relatively high success rate (590, 74.87%) compared to other SE tools, still encounters these issues. This can be attributed to two reasons: i) the Z3 SMT solver used in Mythril can struggle with complex satisfiability problems, leading to timeouts or errors. This is particularly evident in contracts with numerous potential vulnerabilities or complex execution paths; ii) SE tools' use of a DFS algorithm can lead to a focus on fewer vulnerabilities, as it delves deep into specific execution paths. This approach can miss vulnerabilities that are not on the prioritized paths. Unlike SA tools, which can analyze a broader range of vulnerabilities more superficially, SE tools' need to check the satisfiability of each potential vulnerability can be more time-consuming and resource-intensive, especially for complicated code structure within real-world contracts like CVEs and BNB contracts within our benchmark.
[1] Zibin Zheng, Neng Zhang, Jianzhong Su, Zhijie Zhong, Mingxi Ye, and Jiachi Chen, "Turn the Rudder: A Beacon of Reentrancy Detection for Smart Contracts on Ethereum." in Proceedings of the 45th International Conference on Software Engineering, ser. ICSE '23. IEEE Press, 2023.
[2] https://github.com/DependableSystemsLab/SolidiFI-benchmark/blob/master/buggy_contracts/Overflow-Underflow/buggy_15.sol.
[3] Fix an edge case (#1820) · Consensys/mythril. https://github.com/Consensys/mythril/commit/c7255768b7083e57b914d0b74a4a9f955b67577a.
[4] https://github.com/renardbebe/Smart-Contract-Benchmark-Suites/blob/master/dataset/CV/2018-10468.sol.
[5] https://github.com/renardbebe/Smart-Contract-Benchmark-Suites/blob/master/dataset/CV/2018-10706.sol.
For the experiment setup, we set a timeout limit of one hour for each tool. Taking inspiration from Ren et al.'s work [12], which underscores the importance of the maximum recursive depth on the effectiveness and efficiency of symbolic execution tools, we uniformly set this value at 45 across the tools. This parameter choice was based on the conclusions and insights derived from Ren's work. However, Manticore was an exception to this setting as it does not support user-defined maximum recursive depth settings.
Detailed result data can be accessed from the following file:
[12] M. Ren, Z. Yin, F. Ma, Z. Xu, Y. Jiang, C. Sun, H. Li, and Y. Cai, “Empirical evaluation of smart contract testing: What is the best choice?” in Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA2021. New York, NY, USA: Association for Computing Machinery, 2021, p. 566–579. [Online]. Available: https://doi.org/10.1145/3460319.3464837.