As we approach Christmas I wanted to share something that could help researchers across the internet.
As a byproduct of another research project, we ended up collating a list of some 15,000,000+ web domains. Not just any web domains mind you. We didn't just scan the internet for random domains here, we pulled together this list from many different sources that all had one thing in common. The sources are from multiple projects all looking at the data held within their giant DNS systems and recording which are the most requested. That's right, this is a list of the most popular sites online.
First a little background on why these companies/projects compile this data. The more popular a site is, the more requests their systems will have to deal with. Think of the difference in traffic between Amazon and a random myspace page about cats. Knowing which are the most popular websites makes it easier to know which sites to cache and maybe even replicate them using content delivery networks (CDN's). For anyone then using the site, it loads faster.
We have not yet found a way to rank the sites in order of most common or popular, but what we could do is something no one else seems to have done. Once we collated, deduplicated and sorted all of the sources we ended up with just over 15 million unique domain names. But we didn't stop there. Some of the sources (for example Alexa Top Million) are defunct, plus the internet changes constantly, so we wanted to determine what sites are alive out there and what aren't.
We scanned all 15 million domains for live web servers. That list trimmed out to just over 10 million domains that are live and online right now. It's the most up-to-date gigantic list of popular web servers in the world. And we are making it available for free.
One caveat to this is the list is ONLY for web servers running on popular ports such as 80 and 443. Some domains from the original list may have only had FTP running or something and no web server. So, just bear in mind we are only dealing with web servers that responded.
What can you do with 10 million live websites that represent the top percent of sites on the internet?
Now some of you might already be bursting with ideas on what to do with such a list, maybe you thought of grabbing all the headers from them or even screenshots (this gets big very quickly). Maybe some of you thought about checking their SSL certs associated with them.
As I mentioned, this was a by-product of another project Cygenta is working on internally. I obviously have my own plans and research ideas on what to do with this epic list. I also reached out to some likeminded friends to ask them a simple question:
"What research would you do with a list of 10 million live websites?"
Here is how a few of them responded.
First up, I went to the person who is always top of my list when it comes to weird and wonderful hacking projects. That's right, it's the one and only @deathspirate. He suggested using the list to search for those domains that do no have their SPF record set. This would highlight those sites that might be at more risk of spoofed email alongside working out if they are cloud based and which provider they used.
We performed a little test using MassDNS a fantastically fast DNS resolver suggested to me by @tomnomnom and wow did we get some amazing results quickly. We might release those results at another point in time.
I was also interested in the technologies being used by these sites:
I love databases of stuff, so obviously we keep all our results in a nice sql database, above you can see a selection of technologies in use across the internet. Handy for vulnerability research or even figuring out what technologies to learn to get a job.
The OG, Daniel Cuthbert, suggested this dataset can also be used to review which versions of Java are being used and highlight sites that may be vulnerable due to legacy technologies.
Next we have a selection of TLS certificates pulled from those 10 million sites too:
Having an amazing list of live hosts on the internet allows researchers to find and harvest data easier and faster than performing scans across unknown ranges or lists. The amazing ideas and work I have already seen coming from this list is incredible.
Several other great hacking minds, namely @TheXssRat, Stu Kennedy aka @noobiedog, and Ben Bidmead aka @pry0cc all suggested similar approaches such as using Nuclei to look at systemic causes of vulnerabilities.
@deathspirate and Daniel Cuthbert were both interested in looking at security headers. The objective here could be to see the extent to which the top 10 million websites have security headers in place, which is a foundation of website security. You could look for headers including Content-Security-Policy, X-XSS-Protection and Strict-Transport-Security.
Stu also suggested comparing the sites in this list to those that have had known data breaches, which I think would be an eye-opening project.
Of course, if you are wildly ambitious you could just go and download everything.
Speaking of which, I couldn't round this post off without mentioning that @UK_Daniel_Card and @lkarlslund have been working on a similar project, downloading more than 10 million websites using their own lists. I hope we can all collaborate, bringing together our projects and give the internet a new valuable resource for research soon.
The future is exciting and the scale of data we now have at our hands is amazing.
So here it is.. the list to help inspire you to do great things and help the security researchers of the world with the data and projects you will generate using it. Let me know what you do with it!
If you liked this, you might want to subscribe to our mailing list to make sure you don't miss future projects!