Speeding up Data Manipulation Tasks with Alternative Implementations

Yida Tao, Shan Tang, Yepang Liu, Zhiwu Xu and Shengchao Qin

As data volume and complexity grow at an unprecedented rate, the performance of data manipulation programs is becoming a major concern for developers. In this work, we study how alternative API choices could improve data manipulation performance while preserving task-specific functional equivalence. We propose a lightweight approach that leverages the comparative structures in Q&A sites to extracting alternative implementations. A key appeal of this approach is that it exploits crowd knowledge and the very nature of comparison to reveal programming alternatives, which is essentially different from conventional approaches that harness program analysis and testing.

On a large dataset of Stack Overflow posts, our approach extracts 5,080 pairs of alternative implementations that invoke different data manipulation APIs to solve the same tasks, with an accuracy of 86%. For over 20% of these pairs, the faster implementation has provided >10x speedup over its slower alternative. We identify 68 recurring alternative API pairs from the extraction results and manually characterize them to understand the type of APIs that can be used alternatively. To put these findings into practice, we implement a tool, AlterAPI, to automatically optimize real-world data manipulation programs. In the 1,267 optimization attempts on the Kaggle dataset, 76% achieved desirable performance improvements with up to orders-of-magnitude speedup.

We hope that our study offers a new perspective on API recommendation and automatic performance optimization. To faciliate future research, we release all our scripts, experiment results, as well as the AlterAPI tool.

Dataset (zip)

  • The 5080 validated alternative implementations

  • The 68 recurring alternative API pairs and their characteristics

  • The 1267 optimization results on the Kaggle data

  • Semgrex patterns


Scripts (zip)

  • Extracting from comparative sentences

  • Extracting from consecutive profiling statements

  • Random testing for validation


AlterAPI tool (zip) (github)

  • A tool that automatically optimizes data manipulation programs by suggesting faster alternative implementations.