Machine Learning-Driven Ad Blocking: From Data Collection to Deployment

In this hands-on tutorial, we’ll give you an introduction to our journey of machine learning-based ad blocking, the research we have performed and our results. Together we’ll set up a website crawling service which produces datasets for self-supervised training, we’ll transform the crawled data into graphs, train a graph-based machine learning model and deploy it in a browser extension in order to run inference in the browser and block ads. The goal of this session is to give you an overview of how a machine learning project with a similar scope can be tackled and how to develop basic components ranging from data gathering to a simple extension which makes use of the data you trained your ML model with.

Background

Current state of adblocking

At the root of most of the adblockers stay a set of manually-curated filterlists (most well-known of which being Easylist). The bigger these filterlists get, the more difficult it is to maintain them, since it becomes challenging to understand which filters are outdated and which ones are not.

Moreover, these filterlists are hardly scalable, with every new ad requiring a new filter most of the time. This leads to an increased file size and to a less performant adblocker. Down the road, this means a poor experience for the end-user and we don't want that.

Why ML?

Ok, so we talked about how adblockers work currently and what the issues are with this. Now you might be wondering "Why use Machine Learning? Why not something else?". Fundamentally, adblocking is a pattern-recognition problem. And most of the models used in modern ML, such as neural networks are optimised just for that.

This is why we, the Automated Adblocking Team from eyeo GmbH (the company behind AdblockPlus), are exploring ways in which different kind of models might be applied here. These models include, but are not limited to: graph-based models, OCR, image recognition and NLP.

Tutorial contents

In this tutorial, we will walk through one of our latest exploratory projects attempted by our team that we like to call "Project Moonshot". In a few words, we'll attempt to take a webpage's DOM tree, find a graph-based model that could identify which sections should be blocked or not and finally put it in the adblocking extension. In more technical words, we'll go through:

Prerequisites

Data Collection

Who are we?

Dragan Cvetinović

ML Engineer, eyeo GmbH

Johny Jose

ML Engineer, eyeo GmbH

Levan Tsinadze

ML Engineer, eyeo GmbH

Mario Koenig

Product Owner, eyeo GmbH

Rose Howell

ML Engineer, eyeo GmbH

Tudor Avram

ML Engineer, eyeo GmbH