Hinky's Proxy Project Notes
Random musings about the development & maintenance of The Proxy List
All content is now available at Proxy Obsession!
Google was supposed to shut down this site last month. WTF is the problem?
SHIT OR GET OFF THE POT, GOOGLE!
Since I started tracking URLs, the Google Hack has been much more productive, mostly because it's faster now.
In all I have collected over 5,000 URLs since I started. Of these, about 200 are top-level links. I have just begun to pull these out at regular intervals to re-scan. So far that has been very productive.
You may or may not have noticed Google Page Creator (this site) is going down soon. Google has graciously decided to force me into a Google Sites redirect. Hopefully Google will notice I already have a Google Site (http://www.mrhinkydink.net) and they will put it all there. Are you listening, Google?
However I have found over the years that when you expect someone to do the "smart thing" you are always sadly disappointed. With that in mind, I'm keeping a backup. But without the WYSIWYG editor it's doubtfull I'll be able to keep the project journal updated.
I'm keeping my options open. I'm not entirely impressed with Google Sites. In fact, it sucks. As an option I have reserved a spot at WordPress but I've done nothing with it so far.
But be advised the next time you come here it may be someplace else.
I have been threatening for months to resurrect the SOCKS code and I finally did it today.
I had to find it first. The last time I was hacking around with it, I was doing all the coding on an old Ubuntu 6.06 VM, which apparently I lost track of.
I finally found it, moved the code over to the new Xubu 9.04 VM, installed Anjuta, and recompiled it. It still works. And it's still ugly.
Right know it's looking at all the port 1080 proxies gathered in the past two weeks, just over 4,000 addresses. And, as usual, I'm getting about a 0.001% success rate.
None of them are very usable. In fact, the second one I tested gave me an X-Forwarded-For: header.
WTF? SOCKS proxies shouldn't do that. Period. Is this some new trick? That sucks.
The others have just been slow as fuck.
Anyway, I'm looking into it again.
Another milestone. Well, whoop dee fucking doo!
It's going to be a long march to the second million, but we're well under way, boys and girls!
Since most of the proxy harvest happens over Google these days, I added a URL tracking table to the proxy database. A lot of URLs get hit over and over again. I don't really care about proxy lists from 2003 (those were the days), but there are hundreds of those out there. 99.9% of the address/ports are already in the database, so scouring those old lists is a simple waste of resources.
This table puts an end to that. All it has is three columns, the url, the sha-1 hash of the url, and a count of how many times the url is seen.
This helps a lot for harvesting proxy forums with long histories (clean that old shit up, guys!). It doesn't help so much when the current list is at the top level of the site (such as, for example, "http://www.niceproxy.com" - which is a parked domain so don't bother with it).
As those pile up in the table I can chop them out and do some dedicated runs.
The first time through on this I finally discovered why the ".net" TLD (top level domain) was always the most fruitful. I found a cgi proxy page that is updated hourly! No graphics, no crap, just an ASCII list of addresses and ports!
Nice. That site is now in the daily 4AM schedule.
Remember Bahrain? Thousands and thousands of open proxies?
Those were the days, boys and girls! Alas, those days are long gone.
It turns out they're seriously pwn3d.
See the update on my blog for more.
That took a little more time than I thought it would but it was worth the wait.
Catching up on the hourly runs right now. The next page update will be at noon EST.
I came down with a bad case of "Upgrade Fever" and did a couple of other VMs concurrently. I have an ESXi server at work that I tunnel into over ssh. That had an Ubuntu 8.10 VM on it, so I upgraded that.
I also have VMware Player on my laptop, with an 8.04 Ubuntu VM. That one glitched on the 8.10 upgrade. It came up without a keyboard or mouse. Not very useful that way. But I ssh'd into it and twiddled the xorg.conf file. For some reason the upgrade commented out the keyboard & mouse section!
I was lucky I was able to ssh into it because the old 8.04 image always booted with networking screwed up, but apparently 8.10 took care of that problem. That upgrade has a way to go yet and I might just wipe and copy over the 9.04 test VM I installed yesterday.
Anyway, Hinky's back in business.
Turns out gocr has been abandoned. I need that one for the project. I had to keep a lot of extra crud on the system in order to keep gocr around.
Why is that? They should let you pick & choose what you want to keep around, not just make it an all or nothing situation.
8.10 is better than 8.04 was, but the snapshot has been made and now we crank it up to 9.04!
After writing that diatribe yesterday I installed a 9.04 VM to test it out.
Very nice. But VNC4 is still broken after... what... three years?
It's OK, though. I've pretty much learned to live without it.
It took two attempts. I keep trying to make a VM with JFS (IBM's Journaling File System) but I've never been able to make that happen on a virtual machine. VMware just doesn't like JFS for some reason. Installation went to 53% and the VM choked. Just DIED.
It was ugly. I had to reboot the host in order to kill the VM.
Second time through I bit the bullet and used ext3 instead.
It ran slicker'n shit.
I was so impressed I upgraded my Mythbuntu box to 9.04 and I'm halfway there on upgrading the proxy project platform's VM.
Using snapshots judiciously, of course.
I was contemplating using the first 9.04 VM as the platform but I really didn't want to go through that migration shit all over again. Hence, the upgrade. I'm doing it in two steps, 8.04 to 8.10, then 8.10 to 9.04. It worked fine on the MythTV box doing it that way. The biggest issue is all the Ubuntu servers have been getting hammered ever since 9.04 came out last week.
We'll be down three or four hours today while the upgrade churns.
Don't get me wrong. The Xubuntu 8.04 VM platform I built for the proxy project is running just fine. The scripts, the database, all run smooth as silk.
But as a desktop platform, it's a complete wash.
There is simply zero performance. This has been a real disappointment, but, like I said, everything else runs fine if you don't mind doing all your dirty work in a Bash shell (which I don't).
I've done some research in the Ubuntu forums and I know I'm not alone in this. The other poor schmucks who are facing this issue are getting the same old tired advice from the Linux "experts" that they've been using for the last fifteen years: "Well, you must have done something wrong."
That doesn't cut it anymore. Guys, it's possible your beloved OS could have flaws (personally, I think it's a Gnome problem).
Anyway, as I type this I have installed Ubuntu 9.04 in a VM on the same machine (with the crappy Xubu 8.04 VM running simultaneously) and it is a much better "user experience". If Xubu 9.04 comes out in a reasonable amount of time I may run the upgrade. I took the wife's (Pinky Dink) 6.04 system up to 8.04 with no problems at all. I will do a snapshot first!
Bye, bye, Macau! It was nice knowin' ya.
Upon further hacking around with the Macau proxies I noticed they were extremely short-lived and very picky about what pages they'd serve up, so I did a special recheck on all of them.
They were all dead. Every last one.
I did this after putting the new server VM up. It is working well, although I missed the first run because my ftp settings were screwed up.
I have put up a short blog post about the transition here.
The old VM has been shut down.
May it rest in peace.
As it happens, all those Macau proxies work (that is, all that I have checked so far), but the trick is they send you to another IP (22.214.171.124).
I think at this point I'd attribute it to a clueless (or devious) ISP. Since they're all transparent they don't do much to hide your identity and given that they all go to the same IP, that address is likely to get blocked sooner or later by proxy-hostile sites. As always, use with caution.
Work continues on the VM move. The database has been moved over and I'm working on the scripts. I ran into a side issue of the GeoIP scripts (hacks I threw together before I took time to learn the API - which is actually quite simple). I need to clean that up, but at the moment it seems more trouble than it's worth. I want to get this thing in production before the database gets too stale.
Yes, I'm still alive. And I'm still working on this mess.
I took a little nappy on the couch tonight. Woke up in the wee hours of the morning, and checked the list.
Two pages of proxies from 126.96.36.199/19 came from nowhere (also known as Macau).
Do they work for you? They sure don't work for me. All those addresses seem to have been NULL routed since they were discovered. That is, packets go out but they don't come back. I've tried tracerouting the IPs but I get stuck in a router loop after ten hops, when the packets hit ctm.net (CTM Internet Services, according to the whois record), the people who own the IPs.
This is very reminiscent of last year's Bahrain Incident.
There's definitely some sort of problem going on with CTM Internet Services, but whether they've been hacked or they're new at the ISP business is anyone's guess right now.
However, I've seen this coming. Proxies from Macau ("MO") started showing up a couple of weeks ago. They screwed up the list because I didn't have a flag for "MO". As soon as I fixed that, more and more (MO and MO?) started to show up, culminated by today's flood and NULL route.
I'm thinking Conficker, since the time frame is right, but it could be a coincidence.
In other news, I'm working on moving the project to another (virtual) server. I finally hit a wall with Xubuntu 7.04 (Feisty Fawn) and got stuck in the Land of Non-Support. Right now everything but the database has been moved over. This weekend looks good for a migration.
Wish me luck.
This morning I thought I'd fire up the Google Hack and search for some new proxies.
It had occured to me that I had never done a query on "inurl:proxy.txt", so I gave it a shot.
Every single result came back with a "This site may harm your computer" warning!
That wouldn't be unusual for proxy sites and since this was the first time I had ever used this query, I thought I had stumbled across a gold mine.
It turns out, Google was simply fucked and this was happening to everyone, everywhere. Even Page Creator, where I jot these notes down, was huffed.
It lasted for about 15 minutes, max.
In that time, I figured out how to filter out the malware site warnings. I also found a proxy.txt file with thousands of proxies in it.
That was a bit of a long haul. We first hit a million proxies last August, five months and a couple of weeks after the project started. Now, five months later we added another half a million.
Obviously the rate of discovery has dropped by half.
This is not a big surprise considering that ~800,000 proxies were found in a single file back in July.
In other news, some clown with a residential DSL account in Sweden recently whined to my ISP that I was using his "Web server" as a proxy.
The extent of this "use" was checking the address with one of the public proxy judges I use. This box had been an open proxy since last June.
If anyone needed to get in hot water with their ISP, it was that guy, not me. The guy's running an open proxy, for crying out loud! I ran a Google check on the IP and found it listed at antichat.ru (it's been off my list since the 25th - the asshole probably finally figured out how to use a firewall).
I would think he'd be more surprised to find out someone was using his open proxy as a Web server.
I blackholed his IP so it will never get hit by a resurrection run again, but if you're interested, here it is:
The proxy was on port 80. The "Web site" is a joke. Check this link and you'll probably find your own IP address on the "offenders" list.
Just checking in. Things have been running swimmingly. No issues to speak of.
I have been running the proxy recheck script whenever the total number goes over 900. This generally reduces the size of the list by two thirds. The survivors are usually solid performers.
New proxies are popping up every day. I haven't had to bother with the resurrection script in over a month. It's always been boom or bust so I think I'll save the resurrection script for the inevitable rainy days ahead.
I have been impressed lately with the quality of Korean proxies. They are very fast these days. Five years ago, the APAC (Asia/Pacific) countries had crappy bandwidth. Now, they're among the best.
But as usual they'll hang around for a few days and then die off, just like all the rest.
Right now I'm using a Vietnamese proxy. Excellent performer and it's an October 2008 vintage, so it's been up for quite some time. There are a few gems to be found in the last pages of the list and this is definitely one of them.
Watch out for Bulgarian (BG) proxies. They are speed freaks as well. Unfortunately "cybercrime" is hot in Bulgaria so as usual, excersize caution!
Early this morning all of the updates just STOPPED.
The page was getting refreshed on schedule but nothing new was getting added. When I got home after slaving away all day in the salt mines, the VM that runs the project was choking on its own puke. It was starved for memory and unresponsive. I had to hit the Virtual Big Red Button to get it back. I rebooted it and took a nap.
When I woke up it was doing it again but it wasn't dead yet.
I killed a lot of garbage database processes. More popped up so I killed those. Then more, until it went back to normal. Then I ran the process manually to see WTF was going on.
As it happens, my ISP has decided to be "helpful". They have re-hacked their DNS servers to return their own search page whenever a DNS lookup SERVFAILs. This makes my scripts go nuts.
I have one or two Web sites that disappeared after the shutdown of ESTdomains in November. I keep them in the mix because I'm hoping against hope they'll be back again some day. When the script runs across them it expects to timeout, and not get a "helpful" search page. Since it doesn't timeout it chews on the nonsense from the search page.
The database never got updated (nothing there anyway), process upon process went into forever-loops, and eventually killed the whole system.
Anyway, that's all fixed now. The 10PM run should have a lot of new proxies, and I'm seriously considering running my own damned DNS server.
The last few times the system has hung, I noticed a trend. Each time, without fail, there was a pop-up balloon noting that the wireless network had reconnected (the system is on the wired network, but uses a "secure" ad hoc 802.11b "point-to-point" network to route the wired network to a wireless camera).
This wireless NIC had a Marvell-based chip. I have several of these. I hate them all because they are proprietary and don't work worth a damn with Linux. Apparently this is yet another reason to despise them.
I pulled it and replaced it with a RaLink RT61 based card. If you want to run Linux wirelessly, RaLink is the only way to fly. It's been fully supported in the Linux kernel for a few years now and the drivers are in active development. You never need to mess with that god-awful ndiswrapper abortion (don't get me wrong, ndiswrapper is a very slick hack... it just shouldn't exist). Unfortunately, RaLink cards are hard to find. I've been burned twice by "errors in photography" where the box or the online illustration clearly shows a RaLink chip on the card, but when you open the box the damned thing has a Marvell chip.
It's been running all week without a hitch. I'll give it another week and if all goes well I'll start un-doing my previous attempts at "fixing" the problem, especially that extra gigabyte of RAM I removed a few months back.
In other news, CoDeeN servers continue to disappear. There are now only 34 active servers left in the database.
When I moved the CoDeeN proxies to a standalone text file, there were about 300 total.
Today, there are 50! FIFTY!
I thought perhaps it was something I did, so I ran a resurrection on them all. They're pretty easy to identify in the database even when they're down because most of the DNS names have either "planet" or "lab" (or both) in them. Sure enough, they're showing up as CLOSED, meaning the address is definitely there but nothing's listening.
It could be they're cracking down on abusers (recall my problem with my ISP and the Polish CoDeeN operator from a few weeks back).
Whatever the reason, they're going fast.
Back before the Google Hack became my main modus operandi, I raided the more popular proxy lists. I still raid the best ones every night at 4AM. And they still have mostly crap, but I pick up 10-20 new proxies from them every night.
Today I woke up and the damned system had hung at 2:48AM (this is still driving me nuts). So, I did a manual 4AM run. In the process I discovered one of my scripts wasn't working anymore.
There is one obfuscation technique I've never been able to hack around, and here, for the first time, it is revealed:
Luckily, only one of the Proxy List Boys uses it, and his list is useless. Utter CRAP. But it's impossible to scrape with a shell script.
At least for me.
If it ever catches on with the listers (and it won't), it'll put me out of business.
TCP port 9090, signature port for the tinyproxy.exe, has risen to the number five slot for verified proxies (number ten if you look at all ~1.4M in the database). It will take over port 3128 for the number four spot if Facebook users keep getting pwned at the present rate.
Personally, I don't use them. The reason for that is they're all in US, GB, and CA domains, which I normally avoid (US because I live their, the others because of treaties, LEA cooperation, etc). Almost without exception they're botnet nodes and I'd rather not piss those people off either.
If you're braver than I, give them a shot because they're mostly DSL and cable accounts that are almost guaranteed to be fast. Get 'em while you can because by next Patch Tuesday they'll be in Microsoft's "malicious software" gunsights, if they're not already.
It turns out that was a form letter from the ISP. They didn't "perform a scan". They had a complaint.
They included five lines from a log. The time zone was CET (Central European Time). Each line was a GET request to one of my proxy judges.
This fellow is obviously running a proxy. If I knew which one I'd stop checking it to get him off my case. However, I can't trace it since I don't keep a history of re-checks.
The five log entrys are sequential, so I have to hope I have a backup close to the most recent entry (I probably do) if I want to get him off my back.
I suppose the best way to do that would be to complain to his ISP or host provider that he's running an open proxy.
Until I can ferret him out I'm stopping all rescans. The list may get a little stale.
UPDATE 12:30PM EST
I pulled the backup from the 22nd and queried for the right proxy judge at the right time and found nothing. The closest I can get is a request six minutes earlier than the other log showed, but it's the right country in the right timezone and the right proxy judge.
And I'll be god damned if it isn't a FUCKING CoDeeN server! That is hilarious. Here it is:
188.8.131.52:3127 a.k.a. planetlab2.olsztyn.rd.tp.pl
FUCK THEM! Run a public proxy network and bitch and moan when people use it? Get serious!
I knew this day would come. Nine months and 1.3 million proxies later, my ISP has finally noticed that Something Is Going On.
I used to worry about this more, but after I hit the one million mark I didn't think it was such a big deal. After all, more than 99.9% of the connections my system makes during the discovery and retesting phases time out. No data gets transferred at all and in the rare case a proxy is alive a grand total of one lousy proxy judge page is downloaded.
Since they're not all that bright, they are accusing me of having a virus. This, as a result of a "network scan", whatever that is supposed to mean to them.
To me, it means a search for open ports, usually done with nmap or some similar tool. I do have open ports. I couldn't host three UT99 servers without open ports. I have a smattering of minimal Web sites on port 80, mostly DNS placeholders with very little content. I run SSH and OpenVPN servers. So, yeah, I have open ports.
Open ports are not indicative of "having a virus". But again, their definition of "network scan" may mean something completely different from the normal definition.
I suppose if there had been an abuse complaint, they would have said as much.
Since this email reads suspiciously like a form letter, it could be anything.
Anyway, I wrote them back and responded to all their suggestions (install a firewall, run antivirus, disable "Sharing for Microsoft Networks, blah, blah, blah) and asked them if they had any further questions.
No response yet. Stay tuned.
I split off the USA-based CoDeeN servers into a separate file.
I'll admit I did this mostly for my own benefit.
I found it odd that the number of US servers was less than half the total count (42.5% at the present time). For some reason I expected a bigger chunk.
I suppose the next step would be to split off a file with non-USA servers. It would only take a couple of minutes but I'm feeling lazy today.
Changes applied. Page rewritten. CoDeeN purged.
"Undefined" is gone, due to the new junk filter. This does not mean the junk is gone for good. There is still one particularly nasty piece of junk to catch: "proxies" that mimic proxy judges. You will know them when you see them. It's very difficult to tell whether a "proxy" has returned your judge page or it's own judge page, which is the only thing it serves.
This is very popular in Japan, for some reason. China seems to be jumping on the bandwagon as well. I think there is a simple way out - request two pages instead of one: the judge and (say) Google's home page. The downside is that will double the amount of time required for testing and verification.
Be that as it may, Mr. Hinky Dink still has the highest percentage of active proxies of any list anywhere, junk or no junk!
360+ CoDeeNs have been reclaimed and the file is on the server. The page doesn't reflect this at the moment and the servers are still in the Main List. I plan to take them out of the list and keep them stashed away in the text file (updated and tested, of course). The CoDeeN file will be updated every other hour, just like the Main List. It's randomized each time, so don't depend on a hash to detect changes. It's a very static list, but some servers may drop in/out over time.
Speed, country of origin, and all the ancillary data is not in the text file. That is not the point anway.
Remember, the main idea is using it with the SwitchProxy tool for FireFox, but if you have other uses (like starting a proxy list with servers that actually work), then go for it. Don't do something silly like uploading the list to a proxy forum because they don't generally like CoDeeN proxies (in fact they despise them) and the 312x ports are a dead giveaway.
I had finished rewriting the code and was starting to get the CoDeeNs back when apparently the power blinked at home. Since I have my cable modem, switch, both UT servers, and the domain controller on uninterruptible power supplies, the connection stayed up.
Of course, none of the boxes involved in this project were protected. Maybe Santa will send me another UPS for Christmas.
There won't be another run today until 6PM.
At least it waited until I finished coding.
No big surprise there.
The junk filter worked flawlessly. However, I never intended it to take out the CoDeeN proxies. Some would say that's no great loss because they are, in fact, junk. But I've grown somewhat fond of them, so they will be back, but not in the main list.
I have been using the SwitchProxy Tool for Firefox for quite some time. It's very handy for testing proxies, although it does some silly things now and then (for instance, when you select "None" it clears whatever settings you originally had in the browser), but one of its main features is it lets you use a text-based list of addresses and ports that it will cycle through either sequentially or randomly.
This is not very useful for testing, but if you have a big list of known good proxies it works very well. The problem is getting that big list in the first place. The CoDeeN list works great for this since there are so many of them and they're all - with some exceptions - "fast enough".
So, I'm going to split off the CoDeeNs and make them available on the left side menu as a text link. You can then add this link to SwitchProxy and browse through multiple CoDeeN servers.
From the SwitchProxy toolbar, select Add->Anonymous->Next and you'll see the interface. Just plop in the link, decide how often you want to switch, and you're ready to rock'n'roll. I haven't decided on a name for the link yet, but it will probably be:
Original, no? Don't get excited because it's not there yet. I have to resurrect them from the database first (since they got junked by the junk filter) and hack the code around.
The proxy count is going down drastically, but when the dust clears the list will be much more dependable.
I've been fighting junk for months but an elegant solution finally presented itself to me.
Earlier this week, everything went dark. Even the Japanese list I've been hitting since the beginning of this project back in March, which was good for at the very least a half dozen new proxies a day, was blank. BLANK! NOTHING!
And the Russians went on holiday. At least they were kind enough to say as much on their blog (what would we do without Google Translate?).
Even the 4AM run, when I hit the listers I despise so much, was weak (weaker than usual, that is).
But slowly everything came back to normal. The Japs got their game on and the Rooskies came home tan and refreshed. The proxies started coming back in, only a trickle at first but back to Full Tilt Boogie by Friday.
Work has been a bitch, so I've had to let the Proxy Business slide a little myself. We are in the throes of a Web Migration. After spending about a quarter million a year on Web Hosting for the past five years, the Boys in Mahogany Row decided it was time to cut their losses and bring the servers home.
This is turning into a huge fiasco, although the technical side has gone surprisingly well (so far). It seems we spent all that money on a slew of Web sites that aren't getting any traffic at all. It is glaringly obvious that the Webbies have been lying about how well the sites were doing (as they must - it's part of their "Performance Measures" to make certain traffic increases). Rolling heads may be seen in the near future, but most of them have been re-orged into positions that will probably be eliminated in the near future anyway.
I get to monitor the IDS on these things, so I have a pretty good view of the traffic they pull. From a security perspective, it's a good thing no one uses our servers. They're just not worthwhile targets. Nobody cares enough to hack them, although the way they're configured they could be pwned at the drop of a hat.
Sometimes it keeps me awake at night.
It's the day after Patch Tuesday, and I swear I shut Automatic Downloads off, but the server went down and hasn't come back up yet.
I'm at work now and slightly blind, but I can tell it was a controlled shutdown because the PuTTy shell I had open declared it so before it died.
I saw a single report about continuous rebooting after yesterday's patches and I'm hoping that's not what happened.
I knew I was going to regret complaining that this project was getting boring.
I've been in class all week. A worthless CompTIA Security+ class our CSO forced us all to take. 100% Windows-centric.
I learned nothing new and reinforced my belief that security "professionals" are know-nothing blowhards and that those who can't, teach (and we all know those who can't teach, manage). The only thing I got out of the class was three licensed Windows 2003 Server VMs (I copied them over the Net during class and converted them from VirtualPC to VMWare in the evening). Not sure what I'll do with them, but I have them nonetheless.
Although I had all my remote tools, I only ran a few purge/rescan cycles and the system took care of itself for the duration. It is so dependable it's getting boring. I need a new project (and yes I haven't forgotten the SOCKS issue), something to make this new and exciting again. I'm seriously thinking of moving it all over to the AMD64x2 system, which is faster, quieter, and sucks much less power than this aging P4 monster. Unfortunately, the AMD box is my MythTV project, which is almost ready to go into production mode.
Meanwhile, I'm still eating my own dog food. I found a nice little TurkTeleKom transparent proxy that's been alive for a few weeks now. Turkey has never let me down. Their proxies are always fast (enough) and they tolerate you for a long time. You definitely need a Turkish translator to decode the proxy error messages. Here's one that's a real head-scratcher...
The strangest sites are banned for no apparent reason. For instance, I often like to badger - via proxy of course - a harmless geek (and former co-worker) who runs an "I'm so cool" .Net development blog - the guy is a complete nobody but the Turks have banned his hosting provider. Other sites that are normally banned in, say, Saudi Arabia, are fine with the Turks. It makes no sense.
If ypu were paying attention on the 3rd & 4th you may have noticed a slew of transparent German proxies popped up. They were all out of Frankfurt Am Main and most had ".11" in the last octet. Some had proxies on multiple ports on the same address. What was that all about? They came from this German ISP and disappeared as quickly as they showed up.
I love a good mystery!
CS-1 rose from the dead on Wednesday.
I didn't notice until just moments ago. For the past week I've been putting most of my efforts in refining the Google Hack using CoDeeN proxies.
Turns out, they're wise to Google harvesting.
You can only get ~500-1000 search results from any one CoDeeN server. That was hardly enough for my traditional method, which basicly just searched for port numbers.
CoDeeN's restrictions taught me to maximize my results by subtracting certain search terms, like "-guestbook" and "-mp3" and even "-SOCKS". You can get completely different results with the same ports and different "minus" terms.
I don't know why that never occured to me before, but it has been an excellent learning opportunity.
While I was learning all these wonderful things, Google lifted my ban, so I applied all this new found logic to the original hack.
The result? Thousands of new (DEAD) proxies and a smattering of active ones.
So the list goes on. I have backed off on the purge to keep the numbers up, but there is still a high percentage of good proxies in there.
As if things weren't bad enough with all my big sources going dark, Google has finally got my number on the Google Hack.
For three months now I've been doing Google searches like...
:80 :8080 :3128
... getting a thousand pages, and hitting them all.
Now, and I kid you not, boys and girls, I can't even do a search on anything without getting the "We're Sorry" page.
Clear the cookies and... same thing.
They've definitely got my number!
And I've only had this IP address for less than a week (my old one, which I had for months, was knocked out by hurricane Ike last week).
I could change the IP any time, but it's a hassle. Lots of DNS changes have to be made every time the IP changes and I'm not a fast flux site by any means - I'm one of the GOOD GUYS! So that's out, but I still have sleeves, the requisite tricks, and 350+ CoDeeN proxies in the database. Plus we all know Google is not the only search engine on the Internets. Hear that Schmidt?
The Dink is down, but not out.
Three or four days ago I noticed "Curious Site" dried up. I didn't do much about it because, well, I bought a new laptop and I've been fucking around with it.
As it turns out, on September 8th, someone spilled the beans in a thread at anitchat.ru, a Russian message board with a relatively worthless proxy forum. Now, there's no more proxies to be had. The link at CS-1 is still there, but there's nothing in it.
That kills all three of my megasources. Soon the list is going to degenrate to a few hundred proxies (mostly CoDeeN). So it looks like I'm back to Google Hacking and List Raiding. Since the Google Hack was the source of these sites, I'm going to refine my method. I'm already getting some "interesting" hits. Check out the domain name on this Russian site out (click for a larger view):
Obviously a "fast flux" site.
Sunday the 14th, remnants of hurricane Ike rolled through my state and knocked out power to over 3,000,000 people, including yours truly.
This is why the page has been static since then. When the power comes back up the updates will begin again. There is no word when that will be. The local power company says it may take up to seven days.
I have finally starting clearing the junk out. For example, since the beginning there have been about 20-30 Japanese entries in the list that were garbage. They're finally gone.
I also learned a lesson about wget that didn't directly affect the list. Under certain circumstances, if you get, say, a "403 Access Denied" response, wget will not store the page you would normally see in your browser. This only affected the "Timeout" servers, but there is more junk to be found if there is a 302 or 304 redirect.
I exported all the non-CoDeeN proxies and used SwitchProxy, a FireFox plug-in, to check the junk factor. There's still a fair amount in there, but the next purge should take care of most of it.
It seems that Interesting Sites 1 and 2 are gone for good. No more 75,000+ proxy imports. I'm glad I got those when I could. Curious Site is still supplying proxies, and of course I still hit the other lists every night (but they have nothing). I'm running the Google Hack on and off but not getting much live data. I'm going to keep hitting it because that's where the Interesting Sites came from in the first place. Somewhere, there's an IS-3 out there.
I had a few hiccups at first. The page size was 40 instead of 75, which made for a 32 page list. I changed that and noticed that version of the page code was still set up for 30 second timeouts so I got some negative speed ratings.
In between I noticed I was doing the proxy count for every page, not just the first page (it takes a while for 1,000,000+ proxies), but by the third time it was back to normal. Now it's running swimmingly.
In this version, the live proxies found during the discovery cycle are moved into the gold database with a "Type" of "PENDING" as soon as they are found. Between page runs, when "Type" is changed to "Transparent", "Anonymous", etc., you (well, I - not you) can run a query on the gold database to see what's coming up for the next page run.
In the middle of hacking away at this code yesterday, both IS-1 and IS-2 went dark. IS-2 and its companion, "Curious Site 1" (CS-1), went SERVFAIL dark. That is, its host name was simply gone from DNS.
IS-2 and CS-1 finally came back online in the evening, much to my relief. IS-2 had moved to Frankfurt. I'm not sure what happened with CS-1, but it came up as well. That's compelling evidence that they're in cahoots.
IS-1 simply changed. Previously, there was no default page in the root, but they allowed folder browsing (a stupid thing, but that's how I found them in the first place). My code was using that "feature" to get the timestamp on the proxy file. Consequently, it didn't run right. I only noticed when I checked the page out in a browser (they're into some shady "PayDay Loan" scam now). But a new file was there in the same place and it had over 75,000 proxies in it.
If they ever change the file name, I'm screwed.
Now, if you've been following the Atrivo/EstHost shitstorm that has been going down in the last week, none of this is a real surprise. I'm certain a lot of shady Web sites were motivated to get out of Dodge but it surprises me that IS-2 moved to Germany, where they have some pretty serious anti-hacking laws. With that in mind, they may be moving again any day now.
There is no doubt in my mind that all three of these sites are up to no good. However, I don't work for an LEA (Law Enforcement Agency). I'm only in it for the research.
I spent all day hacking at the new page refresh code.
It's going to be a winner.
I have one more page to make the old-fashined way and then I can switch over.
I did all my development on an old, reliable Ubuntu 6.06 LTS VM. Since I usually develop on the AMD64x2, which uses special credentials on the production database, I had to make sure I didn't screw it up. I edited all the scripts to point to the VM's copy of the database (it's 10 days old) but just to be sure I didn't miss anything I added some firewall rules to prevent the VM from talking to the production database.
And sure enough I didn't get them all. In fact what happened was I had a copy of everything in my /home folder, but I sudo'd into root without realizing I wasn't in root's folder. I also neglected to give the VM a decent amount of memory and left the query limits at the level of the AMD64x2.
I've never seen a session crash quite like that. The OS killed all my processes including the root sessions when it maxed out.
I got the resource issue squared away and removed all the script copies in my /home folder and hammered it out. There were a few hair-pulling bugs but by the third or fourth run of the page code it ran slicker'n shit.
The last old-fashined run just wrapped up. The 4PM run will be on the new code.
We now have 1,089,613 proxies in the database, which is astoundingconsidering two days ago there 890,000.
That's slighlty less than 100,000 proxies per day for the last two days, or about 4000+ per hour. And as usual most of them were dead, but there were enough live ones to slow down publishing the list.
In fact there were only two page refreshes yesterday since it was taking so long to go through them all.
But we're all caught up now and back on the usual publishing schedule.
Meanwhile I'm working on the page refresh to make it faster so days like yesterday don't happen again. Proxies come and go so fast you need the freshest possible data. A twice-a-day refresh is not going to cut it.
Instead of moving the newly found proxies from the main table into the "gold" table before every page run, I will be putting them in both the main and "gold" table. This way I can run a modified version of the resurrection code on the new proxies, which will run much faster than the old sequential method.
For now we're on the old method because things have died dwon a bit. The file at IS-1 stopped refreshing last night. If it starts back up you will notice some page delays if I don't get this code working today.
We hit it, but the page doesn't show it yet.
I did a very ugly thing to compensate for GeoCity Lite's tendency to do nothing with an address it can't find. I ran an SQL statement on the entire database to fix the blank data. These days it's taking a long, long time.
It was a cheap hack, what can I say? In the end it took less time to alter the GeoCity code.
So I rewrote the test-geo-city.c program that comes with the binary version of their database to spit out the values I want. One more "clean" of the database and I can stop doing it.
Great, but right now it's almost 4PM and the 2PM run hasn't finished yet.
Also, I got a call from GoDaddy and they're moving me to another server. There may be some disruption in service.
The IS-1 suck is awesome. A few bugs to work out but it's running fine. It appears they reset the file every now and then so I have to hack around that.
I plan to rewite the main page on the Web site to reflect the fact that most of the data no longer comes from proxy lists. The majority of all proxies in the database came from the Google Hack and the "Interesting Sites" found using it.
I am convinced now more than ever that all online proxy lists, with the exception of the Dinkster's, are PURE CRAP. They have nothing on me. I am the Proxy King.
I woke up this morning to find a new file on IS-1. I downloaded it and started banging on it.
An hour later I refreshed the page and the same file's timestamp had changed. I never noticed this before so I'm starting to wonder whether it hasn't done this all along. If so, this site has the richest supply of proxies on the Internet.
I'm at the limit of my processing power importing three file simultaneously on the AMD64x2 box, so I may have to enlist another VM if the file updates again today. Or I can just start stockpiling data and catch-as-catch can.
-= UPDATE 12:00PM =-
I have implemented a check once evry 15 minutes on this file and it appears it is refreshed every 30 minutes, like clockwork. It's not a new file, but an update. The file always has about 250,000 proxies so I'll need to hack out a diff to make this manageable.
-= UPDATE 1:15PM =-
I hacked out the diff. Using - surprise - diff!
This site just may max out my processing capabilities. Right now the page says we have 995,000 proxies, but we've probably already gone over a million.
The page updates are taking almost an hour with the extra data. The twelve o'clock run didn't make it to the server until 12:46. I may have to look at that code. It checks the new proxies sequentially and with a 45 second timeout that can slow things down considerably. There must be some multitasking opportunities in there somewhere.
While I was playing catch-up with IS-1 (Interesting Site #1), they sent another update. I'm cranking on it right now. The One Million Mark is in sight and approaching faster than I thought.
This project is, in fact, turning into IS-1's proxy list. That is, if they had a proxy list (they seem to be in the SPAM business). I'm not really keeping track, but at least two thirds of the database came from that site.
Thanks guys, whoever you are. I don't agree with what you do but your data is primo!
In other news, I got my first Uknownian site. Turns out it's in Argentina in a /15 CIDR block.
The system is now stable. No surprises in the morning, no sudden lockups. Life is good.
I have been playing around with a new version of nmap that is very slick. You can read about it here if you are so inclined. You'll have to compile it yourself if you want to check it out, but it's worth the effort (you'll need subversion).
I woke up this morning and decided to check up on Interesting Site I (IS-1). Sure enough, there was a new file, dated today!
I downloaded it and let the AMD64x2 have at it.
It's still running.
So far it's added ~50,000 proxies to the database (with ~200 good ones so far). Even proxies with "weird" ports are turning out to be OK, so I may revisit my decision not to add the other 400,000 proxies in the other files from IS-1. Plus there is a lot of port 1080 systems in there, so that could be more grist for the SOCKS mill.
If I decide to do it, we should hit the million mark by the weekend or early next week.
Yesterday I was hacking away like a madman on the code all morning. I was on a serious roll, boys and girls. Then, just about 1PM, the power went out (yes, I do this all from the comfort of my own home) and stayed out until 5PM.
And to top it off the lease expired on the IP address I've had since... well... since the last power outage, whenever that was (I have my gateway box on a UPS, but it's only good for about 90 minutes). So I lost half a day and then another hour and a half getting everything on the new IP address.
The good news is the 1G of RAM I took out seems to have stabilized the box running the system's VM. If it lasts through the weekend I'll probably put the 80G drive back in.
It never fails. You fix one problem, three more show up.
The issue with the GeoLite City database fixed, a few dozen proxies that have been dropped because of SQL errors every time they showed up over the past five months went into the database.
With NULL columns values. ARGH!
This screwed up a number of the page updates for the 23rd & 24th.
I hacked together a fix to update the database with non-NULL values and came up with a new problem: there was no flag icon for "unknown country". So I made one.
Right now I'm playing catch-up with yesterday's runs. Halfway there so far.
A couple of my proxy judges fell off the face of the planet so I took them off the list and ran the resurrection code against the database.
Since I moved to VMware Server, the time has been all screwed up. I just noticed this today. I forgot to run the VMware tools and sync the guest with the host.
Why was this never a problem with VMware Player?
I have had a longstanding problem with IPs that are not in the GeoLite City database. For some reason, long ago, I put the latitude & longitude of the IP addresses in the database. Well, if GeoLite City doesn't have the IP it returns NULL values for latitude & longitude and the SQL errors out (yes, I neglected to specify a default value). That's fixed now but I suspect some RFC 1918 addresses may find their way into the database (10.x.x.x and 192.168.x.x are hammered by the scripts but the 172.16-32.x.x addresses aren't in there yet). Also, some IANA reserved and unallocated IPv4 addresses may have the same issue.
These will never make it to the Web page, but I don't like them in the database.
This hardware is still driving me nuts. Still hanging. Today I ripped out the two odd 512M sticks and took the box down to 2G of RAM (from 3G). If it hangs again, the add-in USB 2.0 card goes (the mobo only supports USB 1.1). If it hangs after that, the video card gets replaced.
I suspect the video card (ATI Radeon X1300 re-branded) because even if I shut down the box normally, when it reboots the event log says it rebooted after a bugcheck (STOP 0x0000007f).
Every single time. Epic FAIL/
Whenever I've seen a STOP 0x0000007f it's always been video related. Very rarely is it anything else. MS really fucked things up after they moved the GDI out of kernel space and into user space starting with NT4 (to get it ready for "DirectX"). This shit never happened in NT 3.51. Oh, BSODs happened now and then in the 3.51 days but never with the frequency they did on NT4 (or beyond). In fact, NT4 was a great shock and a tremendous disappointment to customers who had learned to love the stability of NT 3.51. It wasn't until SP3 that NT4 ran worth a diddly damn.
Yes, I go back that far. Mr. HinkyDink has been around the farm once or twice. How time flies, boys and girls!
ATI isn't helping with their Driver of the Fucking Month either. Forty five fucking minutes of downtime whenever they put out a new driver. Ugh.
Regardless of the host system's problems, the stability of the VM and MySQL still amazes me (knock on wood). It's been mistreated, barfed on, run on a disk with bad sectors, and it still keeps kickin'. But there's no way in Hell I'm stopping the backups!
The 80G hard drive turned out to be a corker.
The weekend after I installed it I bought a 500G drive. Sheer coincidence. After the 80G died every night for about 3 or 4 days I moved everything over, diddled the drive letters and now we're back in business.
That was yesterday.
Tuesday, it died while I was at work. I spent most of the day trying out my Disaster Recovery Plan (don't tell my boss), which, as it turns out, leaves much to be desired, although as luck would have it the database backup ran just before the system crashed. But I was missing a few core utilities and only managed to run about four updates. When I switched back into production I didn't bother bringing the updated database over assuming I'd get the data again, but we were down to less than 500 proxies so I ran the resurrection script and brought it back up to over twelve hundred.
That huge increase in the number of proxies prompted me to take a look at the recheck code. I think I have fixed that issue but we'll just have to see how it goes.
Meanwhile, as I was going over my DRP, GoDaddy decided to migrate the Web site to a new server, so everything sort of worked out.
So... back in business and back in maintenance mode.
The Half Mill mark came much sooner than I thought it would but we hit it and kept on going, up to 512,000+ dead proxies, five months to the day after the start of the project.
The data (some 86,000 rows) that put us over that mark came from Interesting Site I. You may recall there is more data from that site that has never been entered, primarily because of the oddball ports listed in it. I may lift the ban on these in the near future, because the last batch had some hits on those oddball ports.
I finally upgraded the system to VMWare Server. In addition I added an 80G hard drive and copied the VM files over, so now I have a complete backup of the system at the Half Mill mark, which is a good place to be.
Notice I didn't say "interesting"?
The other day, a run to Interesting Site II (IS-2) barfed while I wasn't looking. It was not a disaster and there was no great loss of data. In fact I wouldn't have noticed if I hadn't looked at the log for that event.
I send all the cron job email to my Yahoo account so I can look at it from anywhere and on this particular run, between hundreds of MySQL errors, was a URL. This in itself is curious, since anything from that site should have been chewed down to only IP:port strings.
Intrigued, I pasted it into a browser and was greeted with a few thousand proxies arranged end-to-end in a continous string. I ran a one time pass at the site and got a handful of new proxies.
I went back to the browser and refreshed the page.
There was new data on the page. I refreshed it again and there was more. New data was coming in every few seconds to this site.
Of course, I immediately put it into the rotation and dubbed the place "Curious Site I" (CS-1).
Upon investigation, it turned out to be a type of subscription-only proxy list. There is no login, your account is in the URL. The account had to be that of the operator of IS-2.
Whether CS-1 is the sole source of proxies for IS-2 remains to be seen, but now I'm not so sure that IS-2 is as "evil" as I assumed to be in the beginning. Watching this play out should be educational.
IS-1, the original Interesting Site, had fresh data yesterday. A list with 115,000+ proxies in it. I've been processing it since yesterday and there is some good stuff in there.
This project is becoming less and less about the Other Lists. As a source of new data, they dried up weeks ago. With this new data from IS-1 we should hit the half million mark in the database before noon today.
I collected all 18,901 port 1080 proxies out of the database and let the sockcheck program have at it. Then I took a nap. The results:
These are virtually the same results that I got last year, only more of them. The ISA servers still intrigue me.
I've been using a SOCKS v4 server in Beijing for the last hour or so on the Ubuntu VM. Surprisingly fast. In my experience APAC (Asia/Pacific) servers have been historically dog shit slow.
If you're interested, the original "lost code" from September 2007 is here. It is ugly as fuck and I make no apologies for it. I'm putting it there as a backup before I hack it to bits for the project. It is GPL'd so do with it what you will but leave the copyright notice intact.
After firing up an old VM and putting Anjuta and the salvaged SOCKS code on it, I pulled ~19,000 probable SOCKS proxies out of the database and made a few preliminary random runs.
I tried three SOCKS v4 proxies. Two worked fine (in a browser) and one did nothing. Nothing except the SOCKS v4 handshake, that is.
Every SOCKS v5 server the code identified returned a "connection not allowed by ruleset" result.
Every ISA SOCKS server returned a 12202 error ("The ISA Server denied the specified Uniform Resource Locator"), which is typical for an http request. Often they'll let you do anything but http.
The ISA servers looked familiar. They will tell your their (NetBIOS) name when they send a deny and I swear I've seen a lot of those names before. A large percentage of these boxes have port 3389 (Terminal Services) open to the world.
The vast majority either timed out or showed up as closed. A smattering here and there refused a connection.
There was also junk - no known SOCKS responses but the port is open and something is listening on it. Most of the "SOCKS proxies" on port 7212 will do this.
Anyway, the code is way too verbose as it stands now. I'm going to poke around with it and meditate on what needs to be done.
The purge version of the proxy resurrection code is now in production and will run seven days per week at 6:45AM EDT.
The performance in the VM was excellent and the purge ran in approximately 10 minutes for ~1000 proxies. This is a tremendous improvement over the old purge, which typically took about 45 minutes for ~450 proxies. The difference is concurrent testing. The old purge checked each proxy sequentially. The AMD box was able to do 80 concurrent checks but I limited the VM to 30. I could probably kick that up to about 50 without a problem but I can live with a 10 minute processing time.
However, the junk filter is still killing me.
The resurrection code will probably be run on an as-needed basis from now on, or at least monthly. Or perhaps weekly, in chunks of about 3000 or so, although that would put it in perpetual competition with the purge.
Or - and I'm thinking out loud here - test a thousand or so random offline proxies with every purge. Eventually they'd all get checked and it would add a touch of equilibrium to The List.
Now there's an idea!
All I had to do was change a "0" to a "1" and the rechecker morphs into the purge.
Well, not that simple really. It was a ">0", not a "1".
It ran like a greased gopher on the AMD64x2 and hit all 1100+ proxies in less than five minutes. When the smoke cleared we dropped to around 770 proxies, which is still 500 more than we had before Ressurection Day.
Since it's that fast on the AMD, it should probably take no more than a half hour on the VM, so I'm going to move it over and run it once a day instead of M-F.
BTW, there is an ungodly number of CoDeeN servers that came back from the dead, but zero Bahrainian proxies.
I have inspected and re-compiled the SOCKS code and it's definitely going to need a re-write to fit into what the project has become since it was originally written. But it does discern between SOCKS4 and SOCKS5 (no intermediate versions, however) and can identify Windows ISA SOCKS servers. I also need to work on my SOCKS judge strategy, since it's not all about HTTP with SOCKS proxies (and with ISA >=2004 Servers you can't do HTTP over SOCKS if the Web Proxy service is enabled - which has sucked since Day 1 with those damned things).
BTW I have been using the Anjuta IDE for developing the code. I blogged about my search for an acceptable IDE last year when I did my original work with the lost SOCKS code. Anjuta is nice but Debian is way behind on support for the 2.x version. Debian and its bastard children (Ubuntu) only offer version 1.7 dot something. A lot of that has to do with the version of GNOME that comes with the stable Debian releases. If you want the latest, greatest, cutting edge version of almost anything, you have to run an unstable release, which should/could/might have the correct GNOME version. And you probably won't know that until you install it.
Which is why we have VMs. I may check into that this weekend. I've been waiting over a year for them to fix VNC4. Maybe this time things will be different.
But I'm not getting my hopes up.
IS-2 has been providing the greatest number of new proxies but it looks like it decided to go down for the weekend just before 11AM today. Since it did the same thing last weekend we can only hope that it's simply a reflection of its modus operandi. Only time will tell but I expect The List will be relatively quiet this weekend.
I took a break from the action to do a little hardware hacking and as a result managed to pull my old SOCKS code off a dead hard drive that crashed last year.
I tried to salvage that thing a half a dozen times since it died and I am amazed that I finally pulled it off. Several of the partitions were unreadable but /home was sitting pretty and in pristine condition.
I think the trick was in using a downlevel IDE controller. I had a 9 year old "HighPoint" dual IDE PCI ATA133 adapter in my computer junk box that I had a hard time getting Windows drivers for, so it collected dust for years. In a Linux box, it just works. I'm not sure how to describe it, but it just seemed to treat the drive "gentler" than the other EIDE controllers I had tried on it. Whatever the reason, it worked.
NEVER throw out a "crashed" hard drive... you may eventually get your stuff back.
Now I have to figure out where I left off and get moving with the SOCKS proxies, since that was the reason for all this in the first place. SOCKS proxies are rare and wonderful things and if my previous experience is valid there may by two hundred or so hiding in the database.
With all that happening I did not get around to rewriting the purge. There are now over 1100 proxies in The List and if I put it off any longer it might make maintenance difficult. But that's a quick hack and I have to think over my SOCKS strategy so it will probablly get done first.
Sometimes I scare myself.
The re-check is still running and almost finished. I believe the issue was our old friend, html2text.
If you've been following this madness, I have noted several times that html2text outputs some seriously screwy things when you pipe it to a file, but looks fine when it outputs to the screen. I think that is where the lost proxies went. I trashed it for the re-check code (the proxy judges are very simple Web pages anyway) and now even the junk filter is working, so there will be fewer "Undefined" proxies in the list.
This has worked so well I'm going to rewrite the purge, which does basically the same thing with the live proxies. The problem is, the purge runs on the VM with the database, whereas I wrote the re-checker on the AMD64x2 Mythbuntu box. I doubt seriously if the VM can handle it. I'll work something out. Until it's rewritten I'm discontinuing the purges, although I'll probably run a few during testing.
html2text is also a big part of the Google Hack, so that's going to have to get rewritten as well.
Today I implemented the proxy re-check code. There are over 11,000 proxies in the "gold" table that used to be alive but didn't respond during the daily purge.
No response can be due to a number of things, including the proxy judge system being down.
A lot of proxies are actually coming back to life. These are very, very encouraging results. The first results will show up in this evening's 8PM run, which will probably run late since the code will re-re-check all the resurrected proxies.
I will be interested to see if the Bahrainian proxies have resurfaced.
Earlier today, I "snuck in" to a very active Members Only proxy forum, using a password from www.bugment.com (highly recommended) and snatched a few thousand or so proxies from various recent postings. The vast majority were already in my database. Nothing new there.
While I was there I did some reading on proxy judges and it turns out this is a something of a cottage industry. I had to laugh because I've been using free, absolutely unknown proxy judges for months now. MONTHS! Amazing. These people are clueless.
True, they do disappear, but not as frequently as proxies, and when I throw the re-check into the schedule, that problem will take care of itself.
In typical Hinky Dink fashion I screwed up the speed ratings again.
So, to get things back to normal (-ish) I am running tomorrow's purge early while I go back over the code. By 8PM the first page should have the correct speed ratings. It should finish running by 10PM.
Upside: I added over 300 proxies with the re-checker. Slightly over half the dead ones were checked, so there should be another 300 in there somewhere.
Downside: a lot of CoDeeN proxies have resurfaced.
Up until a couple of days ago I was using a proxy in France somewhere for FF (allow me to be cryptic for now, but FF is listed on the main page and elsewhere). It started to get slow so I picked another from The List. It was a fast German proxy in Frankfurt and it worked great... for a couple of days.
I checked into it and it seems to be an "everything" server (DNS/SMTP/HTTP/POP/the works), like some newb webmaster decided to see how much crap he could pile on one box. And it wasn't some joker at his home playing around. It was located at a legitimate hosting provider.
So I pulled another one from The List, a "2 second" server in Romania, and it worked right off the bat!
Too easy! I run a damned good proxy list, if I do say so myself!
It brought back memories of the Good Old Days, back when almost every proxy at www.proxy4free.com actually worked. Don't bother going there, all they have is junk now, but back in the day they were the best. Nowadays they're just shilling for CGI and PHP proxies and hoping to pick up some click money in the process.
The speed issue is still eluding me. I thought maybe it was a DNS delay with the proxy judges, but those are picked at random so a cached DNS entry on the next check-and-purge doesn't make any sense. The calculations are all good, same as they were when the timeout was 30 seconds.
Anyway, if you start on Page 2, you have will better luck. Those proxies have already been double-checked and the speed should be correct. They are survivors - a lot of the Page 1 proxies don't last more than 24 hours - so they are more likely to be up and running.
Last Friday I didn't think this would happen this month, but after the links2 bug was fixed Google Hack data started going up again.
"Interesting Site II" continues to suppy working proxies as well.
Bahrain hasn't shown up this week yet.
I've been looking all over hell for the speed bug but I can't find it. It's starting to get annoying.
SOCKS is still on my mind. After the SOCKS proxies are identified I can work on publishing my results. After that I will probably put a fork in this project. If the data keeps coming in - and I do believe it will finally dry up because this is the twilight of the open proxy - I may change the domain to proxyobession.com and put the whole thing into maintenance mode.
I found a nasty bug in the Google Hack. I was going to call it "interesting" but I've been overusing the hell out of that word and I'm trying to strike it from my vocabulary.
I have been using links2 to get rid of html markup. It works fine until you pipe it to a file. All sorts of crazy, subtle things happen.
It will translate a "?" to %3f, "=" to %3d, etc. , which is fine until you subsequently pipe that back to wget, which does not translate it back. So if you have a URL like...
which should be
... wget then sends it verbatim and the remote site chokes with a 404 Not Found.
This behavior in links2 is not observed when it displays in a terminal, only when it's piped to a file.
OK, nice catch. It means there is a little life left in the Google Hack, since it has not been getting any forum data since it was hatched. And there is tons of data in proxy forums (in fact the operators of such forums hate being mined and you usually need to be registered - sucking out of Google's cache can get around registration sometimes).
I was sure we were going to top out here pretty soon, but the database may make it to 400,000 rows yet.
I have my doubts about half a million, though.
BTW, it's Sunday. Will Bahrain make another appearance or has that particular pooch been screwed?
After all the fun with Interesting Site II, I went back to Interesting Site I to see if anything new (and/or "interesting") had happened.
There was one new file, dated yesterday, with 70,000+ IP:port combinations in it.
Like the files before it, it had a lot of suspect ports, so I trimmed it down to the "usual" proxy and SOCKS ports (a little over a fifth - 16,000+ lines - of the file) and threw it at the database.
One that I know of for sure. I got tired and went to bed.
The rest were mostly new entries, never before seen by the database.
IS-1 (Interesting Site I ), you may recall, had over 460K total "proxies" in various text files, and over 14 million email addresses tucked away in two RAR archives.
Comparing IS-1 and IS-2, we can say they have at least one thing in common, besides stockpiling proxies:
They're both up to No Good.
Considering the sheer volume of data both sites have contributed to this project, we can safely say that the people who run Proxy Lists in general are amateurs.
It came back. Quite a surprise.
Once it was offline I went back to the Google Hack to try to grab some proxies from anywhere, but it appears I have tapped the Hack out. I have everything Google has to offer.
But on the last run I go the beeps. I knew it had to be Interesting Site //. There is nothing else.
And it is every bit as productive as it was before.
I feel somewhat apprehensive about offering a "botnet" proxy list, but content is content.
It's all part of the research. This has been a fascinating project.
Yes, I'm easily entertained.
That didn't take long at all.
I put the newest "interesting" site into the bi-hourly rotation, got about 50 proxies between two runs, and the place went seriously dark.
As in "port 80 closed" dark.
I hope this isn't permanent. It was such a good source. Something, somewhere was obviously feeding the site new data. I say that because it wasn't a proxy list. It was a PHP page that returned nothing but IP:port data without any html markup at all.
The box is still on the Net, and considering it's a DNS, SMTP/S, POP3/S, and IMAP/S server - all rolled into one - it may be coming back. That could be the reason the Google Hack dies on the weekend and resurrects itself Sunday evening.
Let me tell you what I've learned about this fellow.
His name is Nick. He owns 16 IP addresses (no, I haven't scoped them all out yet). The DNS name (a "dot-com") is registered in Australia.
Some fellow in the UK has evidence that Nick is a criminal.
The name on the Admin/Tech/Billing contact details of the domain whois record is associated with malware domains.
The IP address is alleged to be a "phone home" site for a botnet (makes sense if he's planting proxies all over the world for his own use).
His hosting provider is in the USA and it has captured the attention of a number of security researchers.
It seems to be part of the infamous "Russian Business Network".
I told you it was an interesting site.
I still have it in the rotation. The fact that it doesn't answer anymore doesn't affect the operation of the script, so if it comes back online, The List will devour the information.
This is one of the reasons it's generally not a Good Idea™ to use an open proxy. You don't know where they come from. You don't know where they've been. And you might make a Nasty Person mad at you if you use their proxy.
All week I've been running the Google Hack.
There are only so many ways you can search for proxies. I settled on a simple search early on. I just search for the most common ports:
:80 :8080 :1080 :3128
This gets a lot of results. The downside is, it gets the same results over and over and over, with minor variations depending on which Google site you pick (actually there is no "picking" since the entire hack is randomized to fly under the Google Anti-Bot radar). It takes 20-30 minutes to run. The results fly by on the AMD64x2 box. I had to add a beep to the script to indicate there was a hit, since mostly everything came up as "already in database".
I started doing back to back runs. I started hearing a lot of beeps. Each time I'd get 7-15 new, active proxies.
Each time, same search.
I knew I had struck a vein, but since I randomized each page of Google results I had no clue where they came from.
Until moments ago.
Once again we have what you would call an "interesting" site.
Somewhere my speed calculations are messed up, but after the daily check-and-purge the numbers should be correct. I'm having a heck of a time tracking this down and it happens whenever I start spreading code across multiple machines.
Bahrain Proxy Madness is spreading this week. Cities other than Manama are showing up, the most interesting (to me at least) being "Isa Town".
Interesting because "ISA" is Microsoft's Internet Security and Acceleration (ISA) Server, which is a proxy server. How fitting.
Because the purge ran and one third of the proxies are gone.
Also, this is the first run with the fixed speed calculations. Those are finally back to normal. You may recall I upped the TIMEOUT to 45 seconds (from 30) last week. Besides screwing up all the speed values it helped to add to the total proxy count. There are a few agonizingly slow proxies listed in there but they're not the majority and they might be of use to somebody, somewhere.
I have been using the DualCore AMD64 (my Mythbuntu system) all weekend for the Google Runs because it's just so darned fast. I can run about 70 database checks per second on it, even with the database on the other end of the network and on a VM. I may in fact turn the TIMEOUT up to 60 seconds and start retesting some old data.
You may well ask "How does slowing down a fast machine get more work done?"
It's all in the forks, boys and girls. The system can fork more processes even though they're only waiting for a TIMEOUT. The AMD64x2 has more RAM and more cycles to dedicate to that. The VM can't touch it in that regard.
In fact, I'm just about Googled out. With the faster machine doing all the Google Hacking I'm getting more and more dry runs. Of course, this whole business is cyclical (look at Bahrain for instance), so just taking a break for a day or two is probably a good thing.
At least one of my "Proxy Judge" sites decided it was a REAL proxy judge after all and changed its format to be helpful. In so doing it turned itself into a wothless proxy judge, at least as far as I'm concerned. As a result, there are more "Undefined" servers than is usual. That site is going out of the proxy judge rotation permanently.
This week I will be taking a closer look at the "Undefined" sites to see if I can get rid of them once and for all.
Because Bahrain's back again.
They'll all be gone by Friday and then the cycle will start all over.
The 460K Random Run has completed - faster than I anticipated - and the results are in.
Is that pathetic or what? Of the new proxies most were end-user type DSL or cable systems in South America, Poland, or Spain (judging by the FQDN).
Here is the interesting part: the 431K hosts with "CLOSED" ports are live hosts. Maybe they were proxies last week. Maybe they'll be proxies next week. Maybe they are simply IP addresses that have changed hands via DHCP.
This is also the reason it ran faster than I expected. It was programmed to bypass any testing on closed ports and just go to the next one.
I did a random sampling (nmap) of a few addresses and found - I hate to say it again - "interesting" results. One address was 100% filtered. The next had a single (non-proxy) port open. One had MySQL, VNC, NetBIOS, and HTTP ports open. That one smelled like a honeypot.
Very curious. And someone went to a lot of trouble to compile that list.
I ran across this unsort utility earlier in the week and it was perfect for the 460K Run (the proxy list from the "Interesting Site").
Perfect because there were thousands of dupes and in order to get rid of them the file had to be sorted for unique entries. After it was sorted it was, well... sorted. Testing all those ports sequentially is simply bad form. It sends warning signs to both my ISP and the remote ISP that Something's Up. Randomized, it's just so much background noise to the remote ISP.
Locally it doesn't really help with my ISP, but I've been doing this for three months and they don't seem to care, although it is an exponential increase in activity.
I figure this should take about 5000-7500 seconds running on the DualCore AMD64 box. It's been running for about ten minutes and the vast majority of the ports are closed. The ones that are "open|filtered" (per nmap) are already in the database (whether they're active or not, regardless of how they're listed in the database, will be determined in a future run). So far there are no open ports I don't already have, but this is just the beginning.
I have a feeling this is a meaningless exercise in futility, but I have to get this list behind me.
It's an obsession.
The 460K Random Run has been running smoothly for three hours and I'm getting approximately 3-4 new live, open proxies per hour. That may not seem like a lot (and it isn't) but it's about what I expected.
If you dropped by last night you may have noticed some of the pages were broken. This is a recurring problem I am having with GoDaddy's hosting service. In fact, if you Google...
426 connection closed godaddy
You can read all about it. The first two hits will be me.
Easy come, easy go.
After this morning's purge there wasn't a single Bahrainian flag left in The List. Not one.
There was a bit of a bug in the page code and a lot of proxies added since last night showed up with a negative speed. I upped the timeout by 50%, from 30 seconds to 45 maximum, but missed one calculation. Every run after 10AM today is correct.
Why increase the timeout in the first place? Because it's an international list. It may take the system here in the USA 38 seconds to get a page from a proxy in Zimbabwe, but a user in Kenya may get it in 5. You never know. Plus, it boosts the proxy count and since the daily purge is so damned effective these days I need all the data I can get, even though I'm getting a lot of data.
I have enlisted my Mythbuntu system for some grunt work. AMD64 DualCore, 2G of RAM, and lots of cycles to spare when I'm not watching TV (plus, after I upgraded to Ubuntu v8.04 MythTV is broken anyway... I need to work on that). It is a lot more capable than the VM that has been running the show and I can get a lot more done.
The Google Hack took us to a quarter million and has now kicked us up to more 320,000 IP:ports in the database.
That's 100K in a week. And without dipping into the 460K proxies from the "Interesting Site" I found earlier this week (I'm still trying to figure out how to crack that nut without pissing off my ISP).
As usual, the good proxies go as fast as they come. Some kind of cosmic proxy equilibrium going on there. If things go as usual by the end of tomorrow's purge we'll be back down to 150 working proxies.
Why do they light up and go dark so fast? Good question. A lot of it has to do with the Bahrainian proxies, since they're the biggest block. Once Bahrain Telecom gets their act together we may get a better picture of what's going on in the wild.
In total, the site I mentioned yesterday had over 450,000 proxies tucked away in text files. I thought a long time about adding that stuff to the database, but on closer inspection it looked like mostly junk.
I know. I've said it a hundred times. The database has mostly junk in it already. I just can't see tripling the size of it with this particular junk. There are far too many oddball ports for my taste. And there are IPs with 4 or more different ports listed. No, it just doesn't look right.
I had an idea to just run through it and find any and all open ports in that list and to Hell with the rest, so I cooked up some quick bash kiddie scripts and ran with it... for about five minutes. It simply ran too fast. That kind of activity throws up red flags, so I shut it down and backed off. But still... it's tempting. If the numbers I've run across are any indication, there could be anywhere from 300 to 600 live proxies in all that mess. I may chop it down into smaller files and give it another whack sometime. A slow, leisurely, measured whack. Or rather, whacks. Spread out over a few months. Sounds like a weekend project.
The other interesting thing about that site was 14 million email addresses stuffed into .RAR archives (I didn't count them but the filenames themselves indicated the total numbers).
OK, so we have:
Hmmm... ya think maybe there was some spamming going on here?
Those half a million proxies could have been a rented bot army, which would account for the oddball port numbers. The bot theory is good because I randomly tested a handfull and found live hosts with closed ports. And the ones I tested all had ISP type DNS names.
You certainly can find some peculiar things on the Intertubes!
Once upon a time, when I was doing manual, ad hoc Google grazing, I ran across a list - actually a single text file - with over 70,000 proxies in it. Well, those went into the database long ago but today's Google Hack hit it again, so I had to kill the run and twiddle the hack so that it will ignore the site from now on.
70,000 proxies, at this point, is over a quarter of the total database and to process them would just be so much wheel-spinning, even with the recent performance enhancements (I didn't say... ugh... tweaks).
But this time I took a closer look at the site. I can only say it's... very, very interesting.
It looks like an abandoned Web site, but there are relatively fresh files just sitting there and most of them are text files full of proxies.
I think I will pull down the entire site and put the other VM to work on them to see what happens.
BTW, the Google Hack snarfed down about 120 Bahrainian proxies before I killed it.
I never cared for that word. You heard it a lot in the "Team OS/2" days. I never could stand those people. Now it's another word for "meth head". Fitting.
But "Performance Enhancements" was too long for a header, so WTF?
Today I did two things I should have done long ago.
I should have indexed the database 160,000 proxies ago. After we hit the Quarter Mill mark, checking the database for an existing IP:port slowed things down significantly, especially during those hard core Google Hack runs.
So I resurrected an old Ubuntu 6.10 VM, stretched its disk out about 10 extra gigs, and dumped a copy of the database into it. Then I indexed the copy to see how long it would take. I didn't really know what to expect but the index only took a minute at most and everything worked fine. Satisified that I wouldn't trash the database (and if I did I had a solid copy), I indexed the production database and fired off another Google Hack.
The speed difference isn't earth-shattering, but it has made a big improvement. I'd say it's 4 or 5 times faster than before. Fast enough to run multiple concurrent Google Hacks.
As luck would have it, I was not dreaming.
Right now the daily purge is taking place and the new Bahrainian proxies are holding steady.
There's an old saying that's a favorite of mine. It goes something like this:
I've always enjoyed the subtlety of that statement.
Looking over last night's run, it is clear the Google Hack was responsible for snarfing all those proxies. I did an impromptu survey of "The Other Guys" and found they still have nothing but junk. And although the Bahrainian proxies are obviously listed somewhere on the Internet (since they where found through Google after all). it appears I have the freshest list on the Internet.
I woke up this morning to 150 new Bahrainian proxies. Yesterday there were only a handful. Now, almost three and a half pages worth. All of them were collected with a Google Hack that's been running for about 12 hours.
I haven't discounted the possibility that I may be dreaming.
The Quarter Mill mark was hit sometime in the early hours of the 17th. Since then the Google Hack, as promising as it was in the beginning, virtually dried up and most new data is once again coming from the list raids.
I knew it was getting bad when it started hitting my own list and sites like Proxy Cemetary (why that guy bothers is beyond me but he does get a lot of Google hits).
On top of that, China seems to have dried up. The first Google Hack test runs ran against the .cn TLD with astounding results. Now any Google query against .cn returns only government sites. Crackdown perhaps? Maybe they'll ease up for the Olympics.
This week I also started running the dead proxy purge Monday through Friday instead of just Monday, Wednesday, and Friday. The result: there are usually less than 200 proxies total in the list. Proxies come and go but mostly they go... fast. I can see why the Other Guys pad their lists with junk.
[ NOTE: 11AM and an unproductive Google Hack run that's been going for four hours just started getting results... Maybe it's not dead quite yet... 1PM - the Google Hack added about 50 proxies total and it's still running... 2PM - make that 80!]
So to keep this puppy fresh I've got to come up with some new ideas. The most obvious would be the SOCKS project. There are thousands of port 1080 (the "traditional" SOCKS4 port) proxies in the database that have never been tested.
It also occurs to me there may be bouncing proxies in the wild, here today, gone tomorrow, back the next day. Testing that theory would be simple with the data already in the database.
Very soon we will hit the quarter million mark in the database.
The Google Hack from yesterday continues to churn although at this point no new proxies are getting added. However there are thousands of dead proxies getting added every hour and the total count in the database is slowly approaching a quarter of a million unique address:port combinations. There is less than five thousand to go. I consider this a significant milestone in the project.
It occurs to me that because of the Google Hack there is no way to be 100% positive that what is being added are actual dead or missing proxies because it is a Blind Hack. There is no way to know where the data actually came from. Anything that is in an address:port format is only assumed to be a proxy. Or to have been a proxy at some time in the past.
That is something of a conundrum. Some of these address:port combos could be part of some long lost TCP tutorial, Web server log, or technical publication, but as long as the Google Hack identifies real proxies I believe it's safe to assume the data is from proxy lists. Whatever their origin, the Google Hack works better than anything else I have designed to date for this project.
Luckily, the way the geolocation system works, RFC 1918, APIPA, and other bogus addresses are automatically filtered by generating a harmless SQL error that keeps them out of the database. If they are not real, at least they are not bogus.
I slapped that RAM into the box that runs this project and was back up by 7PM yesterday. Shortly thereafter I set up a Google Hack run that is still scraping the net for proxies now, some twelve hours later.
Result:The List has more than doubled in size.
This has turned out to be much more efficient than raiding the "Big Boys" of the proxy list business and has positively demonstrated that those guys don't have a clue (but honestly, they all want to sell you something, so they keep "the Good Stuff" - if they have any good stuff - under wraps until you send the ten bucks per month to get on their mailing lists or buy their software or whatever).
The dead proxy purge should kick in soon, so we shall presently get some idea of the quality/longevity of these new proxies.
Global Google Proxy Sucking was somewhat disappointing until I pointed it back at China. Even though I hit China in early testing they still had some fresh addreses and active proxies only a few days later. I am seriously considering putting a random Google China Run in the daily rotation since it's every bit as productive as hitting the usual sites.
Some things I have learned:
The RAM upgrade should come today so there will be a short interruption in service while it is installed.
As part of my plan to bypass the Google Anti-Bot by going global, I have published the Definitive List of Global Googles for your education here.
I hacked together a little script to add all the two-charcacter country TLDs (top level domains) to google.xx and www.google.com.xx and did nslookups on all of them to see if they were live. The result was 179 domains, most of which resolve to (or are canonical for) "www.l.google.com".
Hmmm... except for the .cz (Czechoslavakia) domain... I 'll fix that.
And the .cn (China) domain, which is of course "special" because China hates our freedumbs as much as Google does, which is why they track your every move in the first place.
They all resolve to the same three IP addresses but the Anti-Bot doesn't (at this time) respond with the "We're Sorry" page when you switch domain names. The search results aren't quite the same but for my purposes it doesn't matter that much.
It may not make any difference in the end because the anti-Anti-Bot (let's call it the AAB from now on), as it's coded now, doesn't hammer Google to get links and takes a page at a time to do its job.
We'll see how this plays out.
The List dropped down to four pages.
There isn't a single Bharainian proxy in it. I was getting sick of all those little red & white flages anyway.
Of course, I predicted this back in June, but subsequent events tried to make a liar out of me. It was only a matter of time.
I'm always right about these things. Remember that, boys and girls.
This weekend I'm going back to my list raidng activities to bring the numbers back up. I've also ordered a RAM upgrade for the system and I will probably upgrade from VM Player to VM Server so I can restart everything automatically.
The purge for today just finished moments ago and although The List won't be updated for about an hour (we harvest on the odd hours and publish on the even hours), it looks like a bloodbath.
When it started there were 427 proxies, but when it finished there were only 185 verified "live" proxies left - the lowest number yet.
Oh, well... I only claim to have the highest percentage of active proxies... not the highest number. But, when you think about it, I probably do have the highest number of active proxies compared to any other list, considering they only have junk anyway.
There may also be a higher percentage of "Undefined" proxies this time around because one of my proxy judge sites got their domain parked, skewing the results.
I have a few links I'm planning for the sidebar over there on the left, such as an OP/ED piece on bash, a concise description of the harvesting process, and documentation on the system itself, for all you geeks and propeller heads out there.
The system's running cool again. No problems there.
Looking at the Google hits I realized Google is sending some stale hits, like a refernce to "Page 17" long after the list had been shortened down to nine or so pages.
Since these are all static pages spit out by bash, I added a redirect back to Page 1 for any page that has expired.
Better than a 404 or a stale page, but there's got to be a better way.
The box overheated.
In January I pulled three fans out to make the thing a little quieter.
Left home at about 7:40 this morning and by the time I got to work the system was down, so there are not going to be any updates today until around 6PM.
Sort of scratching my head here wondering what the problem could be, but it just so happens today is the dreaded The Day After Patch Tuesday.
I've been hammered numerous times on Post Patch Wednesday, so I have shut down automatic updates on every WIndows machine I run. Or so I believe. The Linux machine that does all the updates to The List is an Xubuntu 6.10 VM on a Windows XP host system... I know, serious problem there, weakest link, all that... I may never learn. At the moment the host system has fallen off the network, probably with a "Reboot Now?" dialog box ready to be clicked.
Anyway... that's what I'm hoping has happened. That's the optimistic outlook. I've already had more hardware failures than I care to admit in the last year. Don't ask about backups.
Must inevitably come down.
Sunday there were over 800 address:port combos on The List. Now there are only 555. I expected much worse.
There are still around five SOLID pages of Bahrainian proxies even though I took a break from Operation Titan Bahrain yesterday.
I did a small raid against Japanese proxy lists, but was unimpressed. For some reason .jp Web admins like putting their logs online. Bad idea (from a security standpoint) and they fool Google into thinking they're proxy lists. Maybe some other time. Now I'm looking at what Russia has to offer (quite a lot, although I'm already harvesting a number of Russian sites).
But for now I'm going to give it a rest until the weekend. Hopefully most of those BH proxies will be gone by then.
Thanks to the China Run, The List now has so many Bahrainian proxies in it that it's starting to get embarrassing. As of now there are 16 pages, the first 8 nearly all from Bahrain.
The last time I checked there were 455 live Bahrainian proxies, as well as 1077 dead ones in the "Gold" database alone.
I expect most of them to clear next out week after the ADPE process runs again.
For now I'm going to put the China Run to bed, but it's obviously the place to harvest proxies from. They are kickin' ass!
Fishing was good! DAMNED GOOD! About 6800 address/port lines got us over 250 LIVE proxies. And that was only the first three pages of the search.
Now we know who's scanning Bahrain. That's one mystery that's over.
Maybe we should call it "Operation Titan Bahrain".
If you get the pun, good for you!
I was talking with my kid, Rinky Dink, yesterday. He's off at college, working like a slave and taking summer classes.
I guess they're keeping an eye on The Dinkster! Less than two weeks online and already I'm on their Shit List.
Anyway, I was distracted from my Anti Google Anti-Bot activities for most of yesterday. I decided to refine my Universal Proxy Harvester in preparation for it.
All of my site-specific harvesters are just that, one-time hacks for pulling info off a page (or pages) from a single site. My Java and GIF hacks are still required for those freakishly paranoid sites that can't be html-scraped. For 99.9% of other Proxy Lists you don't need to jump through all those hoops.
I have finally generalized a sed/grep/tr/cut routine that can scrape any site in any language, eliminating the tedious cut & paste of my ad hoc searches.
It is a First Class Hack, boys and girls.
I did a Google search of BlogSpot to find all the proxy lists there and it worked perfectly. I found over 50 live proxies (out of over 15,000 addr:port combinations, which bumped the database up to nearly 225,000 proxy entries).
From there I decided to get all the proxy lists in China (just use site:.cn in your Google search for that).
And things sort of fell apart from there.
For one thing, the Chinese don't like colons. Most of the sites use "addr port" instead of "addr:port". Simple enough, even for multiple spaces.
The next issue was with html2text, which, as the name would imply, translates html to text. For some reason (documented somewhere I can't find), the -ascii switch doesn't work worth a diddly damn on Chinese Web pages.
This forced me to make a gawd awful, inefficient sed script to kill every character greater than ascii 127.
But after all that was taken care of, it works against any Web page, html table, or forum posting you can throw at it.
I'm currently running my first Chinese Web scrape and it's showing great results. They sure know how to find proxies.
The next step is merging this with automatic Google queries.
It didn't take long for my Clever Plan to get axed by Google.
And, as usual, it was working sooooooooo well up until the point I got the sorry.google.com page.
All I could do was stare at the screen and pout.
I'm very good at that.
But suddenly... it came to me! The perfect solution!
No, no proxies involved... that would be too easy. Plus it might send up a Big Red Flag at the site the proxy's running on. Not a good outcome.
The solution is to be patient and go global.
It doesn't matter where the pages come from. As soon as one search goes "sorry" on you, which is easy enough to monitor, you simply continue by switching the country through which you hit Google. Just wait a few moments, change the Google URL, and carry on.
What's the URL? That's the next trick.
The ADPE ran and culled 160+ proxies. The List went from 11 pages to 8, or about 360 proxies. This number seems to be something of a fuzzy constant.
In anticipation of this I did a number of ad hoc queries and in the process I have found that this can be more fruitful than harvesting lists. This is something of a revelation. I had stopped doing ad hocs regularly before the List went online. I did it primarily to pump the database up, whether the proxies were dead or alive. A usual, most were dead, but that's just a reflection of the proxy problem at large.
Besides the extra proxies that the lists don't have, there a few other advantages to doing ad hoc entries:
The only disadvantage was that I was doing it all manually.
That is about to change.
The real challenge will be doing it under Google's radar. Whenever I start doing something like this they eventaully accuse me of having/being a virus.
Which I'm not.
I'm just a plucky little security researcher.
Today will mark the 3rd invocation of the ADPE (Automatic Dead Proxy Eliminator). As you recall it runs Mondays, Wednesdays, and Friday.
On Wednesday's run the list was chopped in half, down to ~320 proxies. This morning it's at 512, with a number of new Bahrainian proxies, although not showing as strongly as they were last week.
The proxy lists I harvest have been slacking. In the past two days I have been doing ad hoc searches through the Google referrer hits (they're still pretty stupid queries) and my own searches.
I have found one very active forum that requires a login and although BugMeNot has accounts for it they're kind of useless. Interestingly enough, if you get a Google hit the cached page is there for a quick cut & paste. Does the GoogleBot have an account there?
Anyway, you want proxies? Do a Google search on 80 8080 3128. You will get proxies. A lot of it is old crap but it still goes into the database so that they don't get scanned again. That has the side effect of making the database (now with 215,000+ entries) harder to search, so I am considering ways around it, including splitting the database into hi/low address ranges or running a "diff" between the harvester's daily runs (it's the same data every day, with the newest proxies usually buried pretty deep). Organizing that is a bit of a problem.
Im also considering keeping statistics. There are definitely some data mining opportunities in this mess, but unfortunately I never started out with that in mind.
The ADPE (Automatic Dead Proxy Eliminator) kicked off this morning for its Wednesday run and the Bahrainian proxies on The List started dropping like flies. At least two full pages have disappeared. Expect more to drop out by the end of the week.
***UPDATE: This problem has been fixed.***
I ran across this on Digg.com today:
So I figure what the hell, give it a shot. And this is what you get:
Oh for crying out fuck. Why do programmers do this shit?
Back in the day (1994, I think it was) I applied for a job at a CD mail order company (remember those? Back in the days before the Internet?). The dipshit owner/programmer had based their production distribution CD on an demo copy of Visual BASIC. What a dickhead. In six months all his (shipped!) products DIED.
Consider if you will this Russian proxy list.
This fellow went to a lot of trouble to prevent people like me from harvesting his data.
All the address:port combinations are GIFs, which is bad enough, but that's not all!
The GIF file names are based on your session cookie and unique for evey visit. So first you get your cookie, then request the page (it helps if you strip the Accept-Encoding: gzip,deflate header and use HTTP/1.0 instead of 1.1) and get the unique GIF names. Then you download the GIFs and throw them at your OCR program.
GNU OCR (or simply "gocr") couldn't handle the 7's in these GIFs, but I piped them through a utility called "gifsicle" and scaled them up by a factor of 10. After that, it only had a problem with colons, but that was taken care of with a quick sed script.
Most of the proxies were already in my database, but I got about 10 out of the 100 or so he had listed. A 10% hit rate is pretty damned good (almost unheard of, in my experience), so this site is going into the permanent rotation.
At this moment, the second run of the Automatic Dead Proxy Eliminator (ADPE) is about half-way done.
ADPE is currently scheduled to run on Mondays, Wednesdays, and Fridays and its purpose is to glean dead proxies out of The List.
There were 13 pages - 630+ proxies - before it started. Now it's down to 9 pages and 420-odd proxies. Proof positive, if you ever wondered about it in the first place, that they go fast!
Earlier, I speculated that the Bahrainian Proxy Problem might be taken care of on this run, but so far those boxes are still checking out to be up and running. I'm still curious as to what, exactly, they are, since port scans seem to imply they're devices rather than computers.
A nice side effect from the Extreme-DM "Web bug" planted in the Proxy List has been the list of search engine referrers.
Not surprisingly, they are all Google hits so far (at this point the list has only been up for a week).
What is surprising is how exceedingly bad some of the search requests are, such as:
On crappy searches like this the List really shines, and it can be the only pertinent result on the first page. Weird.
Other search engine referrers have taken me to lists I haven't seen before (and with 208,000+ addresses in the database, I have seen quite a few). Some of these have borne fruit and have gone into the daily harvesting runs.
I chose this venue to split documetation off from my main blog and to have a place I can come to any time to add a few notes on the Proxy Project.
You see, I am the Network Nazi at work. Blogspot is banned at work and my own site, www.mrhinkydink.com , has been classified as "Games" by Websense, so it is blocked as well (and for the most part it redirects to BlogSpot anyway). For some reason GooglePages is open, so until Websense screws that parituclar pooch I can jot down a note or two on the fly at work without sending up a red flag.
Besides, I like GooglePages. It's perfect for things like this.
I suppose you may wonder if I'm the Network Nazi why I don't just unblock the site. I can, but that just wouldn't be right.
Besides, I have other ways. If you know what I mean.
Mr. Hinky Dink's Proxy Project Notes