At the root of most of the adblockers stay a set of manually-curated filterlists (most well-known of which being Easylist). The bigger these filterlists get, the more difficult it is to maintain them, since it becomes challenging to understand which filters are outdated and which ones are not.
Moreover, these filterlists are hardly scalable, with every new ad requiring a new filter most of the time. This leads to an increased file size and to a less performant adblocker. Down the road, this means a poor experience for the end-user and we don't want that.
Ok, so we talked about how adblockers work currently and what the issues are with this. Now you might be wondering "Why use Machine Learning? Why not something else?". Fundamentally, adblocking is a pattern-recognition problem. And most of the models used in modern ML, such as neural networks are optimised just for that.
This is why we, the Automated Adblocking Team from eyeo GmbH (the company behind AdblockPlus), are exploring ways in which different kind of models might be applied here. These models include, but are not limited to: graph-based models, OCR, image recognition and NLP.
In this tutorial, we will walk through one of our latest exploratory projects attempted by our team that we like to call "Project Moonshot". In a few words, we'll attempt to take a webpage's DOM tree, find a graph-based model that could identify which sections should be blocked or not and finally put it in the adblocking extension. In more technical words, we'll go through:
Data collecting - JavaScript
We'll extend AdblockPlus and make it work as an automatic labelled data collector and then we'll look into how we can fully automate this step by setting up a crawler.
Data processing/ cleanup - Python
After we collect the data, we look at how it looks like and what we can do to get it ready to be used in model training.
Model training - Python
With the data prepared, we'll try to train a model and see how well it can learn to classify parts of the DOM tree.
Deployment - JavaScript
In this last section of the tutorial, we'll try to first reproduce the data preprocessing pipeline in JavaScript and finally to deploy the model in the extension in order to test it in a real-life scenario.
Node.js (version>=12.17.0) and nvm (Node Version Manager)
Git
ABP Extension
Docker and Docker Compose
Dragan Cvetinović
ML Engineer, eyeo GmbH
Johny Jose
ML Engineer, eyeo GmbH
Levan Tsinadze
ML Engineer, eyeo GmbH
Mario Koenig
Product Owner, eyeo GmbH
Rose Howell
ML Engineer, eyeo GmbH
Tudor Avram
ML Engineer, eyeo GmbH