Internet Agent Etiquette

David Wallace Croft
Senior Intelligent Systems Engineer
Special Projects Division, Information Technology
Analytic Services, Inc. (ANSER)
croftd@nexos.anser.org

1997-10-08

Agent Abuses

The evolution of Internet Agents, bred by commercial interests to produce the quick and the thorough, must be tempered by the necessity to respect the boundaries of the environment.

The future belongs to neither the conduit nor content players, but those who control the filtering, searching, and sense-making tools we will rely on to navigate through the expanses of cyberspace. Without a doubt, the agent arena will become a technological battleground, as algorithms rather than content duel for market dominance.

Paul Saffo, 1994, Wired

The increasing stresses on the Web's infrastructure caused by both human and robot overuse deepened the growing split between robot authors and webmasters. But that wasn't the only force at work in an increasingly dire situation. Big business had arrived on the Net. The speediest robot ensured the best search engine -- so why worry about overloading Web servers all over the Net? Entrepreneurs had come out of the closet like roaches after dark. The flimsy fabric of the Robot Exclusion Protocol offered little protection.

Andrew Leonard, 1997, Bots: The Origin of New Species

Robots Exclusion Protocol

The Robots Exclusion Protocol is the de facto standard for web spider etiquette since 1994.

A file called robots.txt within the top-level directory of any web server provides fair notice to any Internet software agent as to what is off-limits.

The format of the file details restrictions by agent name and by resource path.

User-agent: * 
Disallow: /

The Protocol is a strictly self-enforced prohibition to which all reputable agent developers voluntarily comply.

Rapid-Fire Requests

An artificial web crawler can request so many resources simultaneously or in such rapid succession that the targeted web server becomes overloaded.

Rapid-fire requests can constitute a denial-of-service attack by overwhelming the server. During the attack, other clients, such as human customers, are denied access to the web server due to the sudden drop in responsiveness.

Internet agent developers can avoid incurring the wrath of webmasters by inserting a forced delay between the requests generated by their super-surfing automatons.

Blackholes and Other Pitfalls

The designs of some early web crawlers were so poor that they would occasionally consume excessive resources by trapping themselves in an infinite loop, requesting the same resources from an overloaded server again and again until mercifully terminated.

One mistake is the failure to record which pages have already been visited. By simply blindly following all hyperlinks, an agent may end up running in circles perpetually.

Another problem is blackholes. These are web pages that are dynamically generated with hyperlinks to more web pages that are also created on the fly. Each web page apparently leads to an apparently novel web page without end. If you can imagine opening a door to find another door and then opening that one to find the same, you get the picture.

By researching the literature of pioneering web robot developers and by staying in touch with resources such as the WWW Robots Mailing List, agent developers can avoid repeating costly mistakes.

User Accountability

Web crawler technology has become so widespread that it is now possible for end-users with little to no domain knowledge to target software agents at Internet resources. Without realizing the consequences of their actions, users can command via a graphical interface the repetitive and wasteful consumption of resources by software agents. The key to the prevention of such potentially harmful abuses is accountability.

First, users must be identified, validated, and authorized before they can access the software agents. By removing the anonymity granted to users of agents available to the general public, willful abuses can by terminated. Additionally, the operators can be made to understand that the behavior of the agents, while under their control or even simply by virtue of ownership, reflects upon themselves.

Second, users must be made to bear the true costs of the resources that they consume. By implementing an accounting system to track the expenditure of bandwidth and other computational resources, agent behaviors will quickly become far more efficient, intelligent, and directed as their human masters seek to minimize those costs as translated into real dollars.

Ethics of Efficiency

While end-users can be made to bear much of the burden of minimizing costs by the informed selection of their agent employment tactics, the agent developers have an ethical obligation to optimize the performance of their bot implementations so as to minimize the long-term impact upon the Internet. Note that this is often, but not necessarily, exactly the reverse of optimizing performance to maximize results. By checking the server robot access policy and possibly being denied, by inserting artificial delays between server requests, by diligently researching, tracking, and testing new bot implementations before releasing them onto the Internet, by inserting the logic to prevent unauthorized misuse, and by adding to the costs of transactions by imposing an additional layer of accounting, developers simultaneously minimize both the potential for agent abuse and, unfortunately, agent effectiveness.

The trick, then, is to implement intelligent mechanisms that maximize effectiveness while minimizing friction. This equates to the old adage of working smarter rather than harder. Too often, however, the additional step of inserting elegant performance enhancements is not taken once the brute-force prototype implementation is completed. The consequence of this is that, as time marches on and web spiders crawl along, the cumulative sum of the costs of the inefficiences explodes as usage increases exponentially. Thus it becomes a moral imperative for the developer, who will often know of but not bear the cost of sub-optimal design, to adapt the initial implementation as soon as possible by inserting both common and creative smart technologies. For consideration, examples of such are listed below.

Adaptive polling
Targeted requests
Cached resources
Off-hours operation
Preference prioritization

References

Martijn Koster
"Robots Exclusion"
http://info.webcrawler.com/mak/projects/robots/exclusion.html
1997-09-04

Andrew Leonard
Bots: The Origin of New Species, "Raising the Stakes"
HardWired: San Francisco
1997-07-01

Paul Saffo
Wired, "It's the Context, Stupid"
http://wwww.wired.com/wired/2.03/departments/idees.fortes/context.html
1994-03

Back

http://www.alumni.caltech.edu/~croft/research/agent/etiquette/
David Wallace Croft, croftd@nexos.anser.org
© 1997 Analytic Services, Inc. (ANSER)
Posted 1997-10-08.