We scraped the Google Play Store to collect app meta-information (app id, app rating, no. of installs, privacy policy link) from diverse app categories from December 2020 to February 2021. During this time, we collected 500,000 apps’ meta information. After that, we scraped the apps’ privacy policies by following the privacy policy links collected in the first phase. After discarding non-English privacy policies and un-reachable privacy policies, we finally collected 213,000 app privacy policies. Next, we started downloading the Android application packages (APKs) for these 213,000 apps and were finally able to collect 164,156 usable APKs. We decompiled the collected APKs to extract declared permissions from the manifest file (i.e., AndroidManifest.xml). We then mapped the dangerous permission APIs to one of the ten dangerous permission groups presented in Table 6. The top 3 dangerous permission groups were PersistentID (97%), STORAGE (62%), and LOCATION (53%), and the least accessed permission groups (accessed by < 5% apps) were CALENDAR, SENSOR, and SMS. The median no. of dangerous permissions accessed by apps was three. Fig 2 shows the dangerous permission distribution in our dataset.
🛎️ Don't forget to cite our work if you use the dataset:
@ARTICLE{9861610,
author={Rahman, Muhammad Sajidur and Naghavi, Pirouz and Kojusner, Blas and Afroz, Sadia and Williams, Byron and Rampazzi, Sara and Bindschaedler, Vincent},
journal={IEEE Access},
title={PermPress: Machine Learning-Based Pipeline to Evaluate Permissions in App Privacy Policies},
year={2022},
volume={10},
pages={89248-89269},
doi={10.1109/ACCESS.2022.3199882}
}
The 164K Android apps along with their metadata (privacy policy, declared permissions) are available for download. To our knowledge, this is the largest app policy corpus available for research.
⏬
Fig. 2. Distribution of dangerous permissions in our dataset of 164,156 apps.
Table 6. The table lists 30 dangerous permission APIs categorized in 10 permission groups. All permissions listed here (except those asterisk marked) are marked as having protection level as dangerous in the Official Android API Documentation. ACCESS_WIFI_STATE and ACCESS_NETWORK_STATE have been found by previous efforts to access PII data.