Today I've finally decided to publish spelunkTor! It's been 'finished' for a few weeks now but I hadn't gone through the effort of cleaning it up, adding all the comments, and doing some final tweaks for performance. Although I'm not sure how useful it is on it's own, hopefully it'll at least help fill out the strange space of onion spiders that exist out there.
I've been doing research on Tor for many years now. For whatever reason it's always facinated me, both from a technical and security standpoint. About a year ago I came up with an idea for de-anonymizing Tor's Onion Services. For this method to succeed I needed .onion links, and lots of them! Sadly I was unable to find a solid collection of working links that was in the scale I was looking for (tens of thousands). So of course I decided to spider the 'dark web' myself!
Although there are many existing spiders out there that seem to be built for this, they all tend to be traditional spiders with traditional goals. As an example, this spider saves tons of information about the site including title, description, emails, custom domains, etc. It also has a GUI and saves everything to a full mysql database. This would be great if I was using this data to build a search engine but all I needed was active .onion links. To make things worse, often .onion sites are hand-crafted and don't follow proper tagging. This means that for all the tools that use parsers like scapy/beautifulsoup, they will totally miss those links! Again for a general search engine this isn't a huge deal but I wanted a bunch working links sooner than later.
Although this project was mostly reinventing the wheel, I wanted to try and keep things simple and use tooling that already solved most of the complicated bits for me. I think I ended up with a reasonably simple solution. A python3 program that creates and manages a SQLite database and a links file, and spawns child processes to do the actual spidering. Each child grabs the oldest link from the database and spawns a toriffied wget spider that does all that complicated spidering. wget saves all the data into a single temp file, which is then passed through a regex parser that looks for 56 character strings that end in .onion. These strings are saved to the database, along with an updated access-time for the link. Every two minutes the parent python3 process takes all links that were reported accessible and dumps them to a text file. That's it!
I certainly learned a fair bit about the existing tooling around this, and I'm certain spelunkTor could be improved (concurrent access for the tor process and the database are currently the big limiTors). For now, I have a good-sized list of links (22,499) and I can focus on the original goal of this project again! Although the landscape is overrun with Tor spiders, hopefully mine is useful to someone as I needed it to be for me.
Happy Hacking!
- Chris