The informal rules behind what's acceptable use of someone else's web server are clear if you write a new browser. Nobody complained when firefox came along, because there's real people reading the content that the server owners are paying to send.
The rules are also well understood if you write a new robot to crawl the web, they should tread very lightly indeed, respect the robots.txt file, and keep some delays in between fetches, so as to avoid slowing down the server for the real traffic.
SearchMash is somewhere in between these two extremes. Originally, it was a pure browser. It is still entirely user-directed, so there's a good chance that the bandwidth is going towards your target audience. On the other hand, an entire page of search results will be fetched at once, so it's not as user directed as if they'd directly clicked on your link.
I do avoid fetching anything but the main HTML until the user requests a preview of the page, to keep the bandwidth demands as small as possible, so no images are requested.
I know not everyone will agree that it's a net benefit, so I've made sure that the User-Agent header is always set to MashProxy for all requests, so servers can easily block my traffic. I considered a whitelist system too, since that would also prevent intranet access, but could see no practical way of that gaining adoption.
Comments