Garf's blog: New SafeBrowsing backend

Today, the SafeBrowsing rewrite me and Dave Camp have been working on for several months finally landed in the Mozilla Nightlies, and it should be part of the Firefox 13 release, narrowly having missed Firefox 12. It reduces the disk footprint of our anti-phishing and malware protection from about 40-50Mb to 5-6Mb, changes all related I/O from pure random access to a single serial read, and refactors a single 4000+ line C++ file into a bunch of modules. An earlier part of this work landed in Firefox 9 and reduced the memory footprint from potentially up to 40-100M to 1.2M, as well as removing the need to do some random I/O on every page load.
Aside from the performance gains, the reduced footprint is an essential step to enable us to extend our SafeBrowsing protection to Mobile devices, which is why we undertook this in the first place.

It was an interesting assignment, and being my first real project for Mozilla, a bit more involved than we thought at first. I blogged in July last year about our plans for this feature. Some of the optimizations we had in mind didn't work out, while others did end up being implemented.

Eliminating the host keys

One of the things touched upon in the previous blog post was that we used to store 2 32-bit hash prefixes for every blocked entry: one encoding only the host and the other encoding the full URL. Strictly speaking, we only need the full URL part. The old SQLite based code used the host key to SELECT and load all keys for a certain domain at once, but our new in-memory prefix-trie based approach has no such needs. However, as Justin Lebar alread touched upon in the previous blog, this does significantly increase the likelihood that we get a false positive. We now expect to have a false positive for every 7000 URLs visited. This will not cause us to block out any legitimate sites, as any positive hit in the local database is queried against the full, 256-bit hash at a remote server (hosted by Google, who provides the SafeBrowsing data).

This does mean we will increase the traffic to this remote server by a large factor. Scary as it may sound, some back-of-the-envelope estimates shows its not really that bad: say there are about 420M Firefox users, browsing for 8h/day. They load on average 1 URL per second. This means about 140M URL loads per second, causing about 20000 hash requests per second to Google. Google confirmed they can handle this with ease.

Now, there is still a problem when doing this: any collision will appear for all users on exactly the same URLs. This means that if you're unlucky enough to be a a site owner that has an URL that happens to collide, every visitor to your site will have a slightly slower browsing experience. Even worse, should you get linked in a popular forum, or be in the news, there will be a storm of false positives to the server all at once. We thought this to be problematic enough that we implemented a workaround: every user will generate a unique randomization key and re-randomize all his hashes with it. Collisions will happen on a different URL for every user, and consequently also be much better spread through time.

Eliminating chunk numbers

After some discussion, it turned out eliminating the chunk numbers isn't as easy as hoped. First of all, the observation in the previous blog posts that chunk expires only seem to happen when the chunks are in fact old, doesn't hold after observing the protocol for a longer time. It also happens very regularly that a chunk is added, deleted, and added back again, particularly in the malware list. In those cases, it is important to know which add chunk a delete is referring to, so it won't delete the later add. It would still be possible to deal with that if the server recalculates the prefixes to send for every client, but this is less efficient on the server side compared to the bigger, more static downloads that the server can point to now, and which are easily mirrored on Google's network.

Sub prefixes compress harder

In line with the previous paragraph, it happens that we receive sub prefixes for add prefixes we never received. These must be kept around until we receive an expire for them, as we can't know if the add they belong to is outdated or just not downloaded yet. Note also that we usually receive updates backwards, i.e. the most recent data will be sent to the clients first, as it's the one believed to be most relevant. Because sub prefixes contain both an add and a sub chunk, they are also bigger than add prefixes. This causes the eventual size of the database to be a bit more than the minimum guessed in the previous blog post, which more or less ignored sub prefixes entirely. If you peek in your profile, you can see that the goog-malware-shavar.sbstore will tend to be the biggest file: this is exactly due the many sub prefixes in the malware list.

Detection performance

It should be noted that these improvements are purely focused on the footprint of the feature. They will improve the resource usage of the browser, but they do not change the detection performance in any way.

NSS Labs Report

In what is somewhat of a funny coincidence, the same day I am writing this blog NSS Labs published a report "Did Google pull a fast one on Firefox and Safari users?". The main points of this report shouldn't be much news as I pointed out over half a year ago the discrepancy between Chrome, and Firefox and Safari in my previous blog post, as well the reason ("Performance differences between Firefox, Chrome and Safari").

I have two remarks to the report: one, as I've already pointed out in the past, false positive control is an important part of effective malware detection. Internet Explorer flags many malware sites, but it also flags legitimate sites, undermining the true effectiveness.

Secondly, the problem isn't so much that the "new" SafeBrowsing protocol is proprietary or non-documented; it's implemented in Chrome and Chromium is open source, so at the very worst we can go study that code to see how to implement it. The problem is that permission is required to use Google's SafeBrowsing servers, and ~~Firefox does NOT have permission to use the download protection list~~. Edit: Please see the statement from Ian Fette below.

2 comments:

Anonymous07 February, 2012 02:37
We have offered the new Safe Browsing features to Mozilla in the past, so to say that we are holding back this functionality is inaccurate. From our conversations, our understanding is that Mozilla is still waiting for more data from Google about the effectiveness of our new technology, and is considering those benefits against the limited circumstances in which their users would send URLs to Google for scanning (this only happens if a page looks sufficiently suspicious or an executable download is not whitelisted). This new protection, which is designed to detect new phishing pages as well as malicious downloads, was highlighted recently on our Chromium Blog in more detail: http://blog.chromium.org/2012/01/all-about-safe-browsing.html. We believe this is a reasonable solution for Chrome users, and Microsoft takes a similar approach in Internet Explorer that involves sending URLs to Microsoft. The offer remains for Mozilla to have access to our new APIs for Firefox should they decide that it's in the best interests of their users.
Unknown07 February, 2012 12:10
This is brilliant news. Dropping the size to 10% is a magnificent achievement. Sounds like refactoring the code into various modules should make this aspect of the codebase much more manageable in the future.

Superb!

Note: Only a member of this blog may post a comment.

Monday, February 6, 2012

New SafeBrowsing backend

2 comments: