The email regex rule documented in the previous blog post (called in this graph “dcFilterAlexAntispam) completes the other anti-spam measures and helps catching 100% (!) of all unwanted comments so far. This obviously raises additional questions, as it hints that one operator or software is behind all this spam.
Due to this single software in usage, can we easily identify its source? Let’s break down the collected IP address to their /8 mask and see if we find any obvious source:
A few networks come out but we are far from the Pareto rule where we could attempt to eliminate 80% of the spam by blacklisting 20% of the offending IP addresses.
But maybe we can correlate IP addresses to countries or providers? Let’s use for this the IP to AS service of Team Cymru. But first, let’s aggregate the IP addresses to avoid sending duplicates or multiple IP addresses within the same /24 range. Of the little less than 20’000 entries, 6844 unique /24 IP addresses were identified. Let’s save this list in a text document, insert keywords “begin” and “verbose” at the start of the document and insert “end” at the end of it before invoking the whois based conversion service:
$ netcat whois.cymru.com 43 < asm_requests.txt | sort -n > asm_responses.txt
During the import in your preferred spreadsheet, don’t forget to trim away the various whitespaces of the fields. The outcome is pretty interesting, as some countries emerge from the statistic:
Drilling down to the top 10 countries, we get the following representation:
While some countries are consistently spamming a lot over the year, other have peaks – e.g. Germany in the third trimester of 2013 or Sweden in first trimester of 2013.
There are of course some limitations with this evaluation. First of all, the link between IP address and AS ownership was established today, while some of the IP addresses were recorded over a year ago. This IP address might have been owned by a less reputable source than now and thus induce a bias. Furthermore, some AS might register themselves where their headquarters is, despite being located all over the world. This could help explain the predominance of e.g. the US in this statistic.