Automated software engineering has witnessed prosperous applications of Large Language Models (i.e., LLMs). Significant research effort has focused on evaluating code LLMs across various dimensions, leading to numerous proposed benchmarks and frameworks. Besides the highly prioritized ability of code generation, code comprehension ability is attracting increasing attention. Nevertheless, existing works evaluating the code comprehension of LLMs exhibit varied limitations. Evaluation frameworks such as CRUXEval and REval typically rely on code reasoning tasks over specific input cases, limiting the scope of execution traces examined. Such practice results in insufficient coverage of code semantic information and struggles to thoroughly evaluate LLMs' overall program understanding.
To address this issue, we introduce a novel black-box evaluation framework targeting LLMs' code comprehension featuring program specifications. Inspired by the idea that specifications can comprehensively articulate program behaviors across all possible execution traces, the framework employs formalized program specifications to represent program semantics and enable comprehensive evaluation. Specifically, we design four specification-related tasks, progressing from basic to advanced levels, to assess LLM capabilities. Counterfactual analysis is further incorporated to investigate LLM performance variance under semantics-preserving perturbations. Systematic experiments on six state-of-the-art LLMs reveal below-satisfactory performance on specification-related tasks, uncovering the limitations of existing LLMs in terms of articulating program semantics with formal specifications. Counterfactual analysis underscores the sensitivity in LLMs towards semantic-preserving perturbations.