Google Scours the Internet for Dirty Android Apps
18.11.2018 securityweek
Android

Google is analyzing all the apps that it can find across the Internet in an effort to keep Android users protected from Potentially Harmful Applications (PHAs).

One week after launching the Android Ecosystem Security Transparency Report, Google decided to explain how it leverages machine learning techniques for detecting PHAs.

Google Play Protect (GPP), the security services that help keep devices with Google Play clean, analyzes more than half a million apps each day, and looks everywhere it can for those apps, the Internet search giant said.

AndroidThanks to the help of machine learning, Google says it is able to detect PHAs faster and scale better. The scanning system uses multiple data sources and machine learning models to analyze apps and evaluate the user experience.

Google Play Protect looks into the APK of all applications it can find, to extract PHA signals such as SMS fraud, phishing, privilege escalation, and the like. Both the resources inside the APK file and the app behavior are tested to produce information about the app's characteristics.

Additionally, Google attempts to understand the manner in which the users perceive apps by collecting feedback (such as the number of installs, ratings, and comments) from Google Play, as well as information about the developer (such as the certificates they use and their history of published apps).

“In general, our data sources yield raw signals, which then need to be transformed into machine learning features for use by our algorithms. Some signals, such as the permissions that an app requests, have a clear semantic meaning and can be directly used. In other cases, we need to engineer our data to make new, more powerful features,” Google notes.

The company calculates a rating per developer based on the ratings of that developer’s apps, and uses that rating to validate future apps. The tech giant also uses embedding to create compact representations for sparse data, and feature selection to streamline data and make it more useful to models.

“By combining our different datasets and investing in feature engineering and feature selection, we improve the quality of the data that can be fed to various types of machine learning models,” the company notes.

Google uses models to identify PHAs in specific categories, such as SMS-fraud or phishing. While these are broad categories, models that focus on smaller scales do exist, targeting groups of apps part of the same PHA campaign and sharing source code and behavior.

Each of these model categories comes with its own perks and caveats. Using a single model to tackle a broad category provides simplicity but lacks precision due to generalization, while the use of multiple PHA models requires additional engineering efforts and reduces scope, despite improving precision.

To modify its machine learning approach, Google uses both supervised and unsupervised techniques, such as logistic regression, which has a simple structure and can be trained quickly, and deep learning, which can capture complicated interactions between features and extract hidden patterns. Google also uses deep neural networks in the process.

“PHAs are constantly evolving, so our models need constant updating and monitoring. In production, models are fed with data from recent apps, which help them stay relevant. However, new abuse techniques and behaviors need to be continuously detected and fed into our machine learning models to be able to catch new PHAs and stay on top of recent trends,” Google notes.

The employed machine learning models were able to successfully detect 60.3% of the PHAs identified by Google Play Protect, covering over 2 billion Android devices, Google says, adding that it will continue investing in the technology.