David Wallace Croft
Senior Intelligent Systems Engineer
Special Projects Division, Information Technology
Analytic Services, Inc. (ANSER)
croftd@nexos.anser.org
1997-10-08
The evolution of Internet Agents, bred by commercial interests to produce the quick and the thorough, must be tempered by the necessity to respect the boundaries of the environment.
The future belongs to neither the conduit nor content players,
but those who control the filtering, searching, and sense-making
tools we will rely on to navigate through the expanses of cyberspace.
Without a doubt, the agent arena will become a technological
battleground, as algorithms rather than content duel for market
dominance.
|
Paul Saffo, 1994, Wired |
The increasing stresses on the Web's infrastructure caused by
both human and robot overuse deepened the growing split between
robot authors and webmasters. But that wasn't the only force
at work in an increasingly dire situation. Big business had
arrived on the Net. The speediest robot ensured the best
search engine -- so why worry about overloading Web servers
all over the Net? Entrepreneurs had come out of the closet
like roaches after dark. The flimsy fabric of the Robot
Exclusion Protocol offered little protection.
|
Andrew Leonard, 1997, Bots: The Origin of New Species |
The Robots Exclusion Protocol is the de facto standard for web spider etiquette since 1994.
A file called robots.txt within the top-level directory of any web server provides fair notice to any Internet software agent as to what is off-limits.
The format of the file details restrictions by agent name and by resource path.
User-agent: * Disallow: /
The Protocol is a strictly self-enforced prohibition to which all reputable agent developers voluntarily comply.
An artificial web crawler can request so many resources simultaneously or in such rapid succession that the targeted web server becomes overloaded.
Rapid-fire requests can constitute a denial-of-service attack by overwhelming the server. During the attack, other clients, such as human customers, are denied access to the web server due to the sudden drop in responsiveness.
Internet agent developers can avoid incurring the wrath of webmasters by inserting a forced delay between the requests generated by their super-surfing automatons.
The designs of some early web crawlers were so poor that they would occasionally consume excessive resources by trapping themselves in an infinite loop, requesting the same resources from an overloaded server again and again until mercifully terminated.
One mistake is the failure to record which pages have already been visited. By simply blindly following all hyperlinks, an agent may end up running in circles perpetually.
Another problem is blackholes. These are web pages that are dynamically generated with hyperlinks to more web pages that are also created on the fly. Each web page apparently leads to an apparently novel web page without end. If you can imagine opening a door to find another door and then opening that one to find the same, you get the picture.
By researching the literature of pioneering web robot developers and by staying in touch with resources such as the WWW Robots Mailing List, agent developers can avoid repeating costly mistakes.
Web crawler technology has become so widespread that it is now possible for end-users with little to no domain knowledge to target software agents at Internet resources. Without realizing the consequences of their actions, users can command via a graphical interface the repetitive and wasteful consumption of resources by software agents. The key to the prevention of such potentially harmful abuses is accountability.
First, users must be identified, validated, and authorized before they can access the software agents. By removing the anonymity granted to users of agents available to the general public, willful abuses can by terminated. Additionally, the operators can be made to understand that the behavior of the agents, while under their control or even simply by virtue of ownership, reflects upon themselves.
Second, users must be made to bear the true costs of the resources that they consume. By implementing an accounting system to track the expenditure of bandwidth and other computational resources, agent behaviors will quickly become far more efficient, intelligent, and directed as their human masters seek to minimize those costs as translated into real dollars.
The trick, then, is to implement intelligent mechanisms that maximize effectiveness while minimizing friction. This equates to the old adage of working smarter rather than harder. Too often, however, the additional step of inserting elegant performance enhancements is not taken once the brute-force prototype implementation is completed. The consequence of this is that, as time marches on and web spiders crawl along, the cumulative sum of the costs of the inefficiences explodes as usage increases exponentially. Thus it becomes a moral imperative for the developer, who will often know of but not bear the cost of sub-optimal design, to adapt the initial implementation as soon as possible by inserting both common and creative smart technologies. For consideration, examples of such are listed below.
Martijn Koster
"Robots Exclusion"
http://info.webcrawler.com/mak/projects/robots/exclusion.html
1997-09-04
Andrew Leonard
Bots: The Origin of New Species, "Raising the Stakes"
HardWired: San Francisco
1997-07-01
Paul Saffo
Wired,
"It's the Context, Stupid"
http://wwww.wired.com/wired/2.03/departments/idees.fortes/context.html
1994-03
http://www.alumni.caltech.edu/~croft/research/agent/etiquette/
David Wallace Croft,
croftd@nexos.anser.org
© 1997 Analytic Services, Inc. (ANSER)
Posted 1997-10-08.