How the web was won

  Harnessing Collective Intelligence  

Home  |  XP Performance Tuning  |  XP Tips  |  More XP Tweaks  |  XP Services

Google's success

Google's life began as a native web application, never sold or packaged, but delivered as a service with customers paying directly or indirectly for the use of that service. No scheduled software releases, just continuous improvement. No licensing or sale, just usage. No porting to different platforms so that customers can run the software on their own equipment, just a massively scalable collection of commodity PCs running open source operating systems plus homegrown applications and utilities that no one outside the company ever gets to see.

Google isn't just a collection of software tools, it's a specialized database. Without the data, the tools are useless; without the software, the data is unmanageable. Software licensing and control over APIs is irrelevant because the software never needs be distributed only performed, because without the ability to collect and manage the data, the software is of little use. In fact, the value of the software is proportional to the scale of the data it helps to manage.

Google's service is not a server -though it is delivered by a massive collection of internet servers- nor does its flagship search page site host the content that it enables users to find. Much like a phone call, which happens not just on the phones at either end of the call, but on the network in between, Google happens in the space between browser and search engine and destination content server, as a middleman between the user and their online experience.

Google's success came from the collective power of small sites making the bulk of the web's content. Google figured out how to enable ad placement on virtually any web page. What's more, they side stepped the publisher/ad-agency advertising formats such as banner ads and popups in favor of minimally intrusive, context-sensitive, consumer-friendly text based advertising.


Internet Decentralization

BitTorrent, like other pioneers in the P2P file sharing movement, took a radical approach to internet decentralization. Every client is also a server; files are broken up into fragments and served from multiple locations, transparently harnessing the network of downloaders to provide both bandwidth and data to other users. The more popular the file, in fact, the faster it can be served, as there are more users providing bandwidth and fragments of the complete file.

BitTorrent thus demonstrates a winning key principle: the service automatically gets better the more people use it. Every BitTorrent consumer brings his own resources to the party. There's an implicit "architecture of participation", a built-in ethic of cooperation, in which the service acts primarily as the broker, connecting the edges to each other and harnessing the power of the users themselves.


Harnessing Collective Intelligence

The central principle behind the success of those winning the web 

is that they have embraced the power of the web to harness our collective intelligence: 

Hyperlinking is the foundation of the web.  As users add new content and new sites, they become bound into the structure of the web by other users discovering their content and linking to it. Much as synapses form in the brain with associations becoming stronger through repetition and intensity, the web of connections continues to grow from the output of the collective activity of all web users.
 
Yahoo!, the first great internet success story, was born as a catalog directory of links, an aggregation of the best work of thousands, then millions of web users. While Yahoo! has since moved into the business of creating other types of content, it's role as a portal to the collective work of the net's users remains the core of its value.
 
Google's breakthrough in search, which quickly made it the undisputed search market leader, was PageRank, a method of using the link structure of the web rather than just the characteristics of documents to provide better search results.
 
eBay's product is the collective activity of all its users; like the web itself, eBay grows in response to user activity, and the company's role is the field of context in which user activity can happen. What's more, eBay's competitive advantage comes almost entirely from the critical mass of buyers and sellers, which makes any new site offering similar services significantly less attractive.
 
Amazon sells the same products as competitors, they receive the same product descriptions, cover images, and editorial content from their vendors. But Amazon made a science of user engagement. They have on an order of magnitude more user reviews, and more ways to participate in varied ways on virtually every page - and more importantly, they use their user activity to produce better search results. While a Barnesandnoble.com search is likely to lead in with the company's own products, or sponsored results, Amazon always leads with the  "most popular", in a real-time based not only on sales but other factors that "flow" around products. With that order of magnitude of more user participation, it's no surprise that Amazon's sales also outpace their competitors.
 
When innovative companies pick up on this insight and extend it further, they start to make their own mark on the web:

Wikipedia, an online encyclopedia based on the unlikely notion that an entry can be added by any web user, and edited by any other, is a radical experiment in trust, applying the dictum (originally coined in the context of open source software) that "with enough eyeballs, all bugs are shallow," to content creation. Wikipedia is already in the top 100 websites, and many think it will be in the top ten before long. This is a profound change in the dynamics of content creation!
 
Sites like del.icio.us and Flickr, pioneered a concept style of collaborative categorization of sites using freely chosen keywords, often referred to as tags - allowing for the kind of multiple, overlapping associations that the brain itself uses, rather than rigid categories. As an example, a Flickr photo of a puppy might be tagged both "puppy" and "cute" - allowing retrieval along the natural axes generated by user activity.
 
It is a truism that the greatest internet success stories don't advertise their products!!

Their adoption is driven by "viral marketing" - that is, recommendations propagating directly from one user to another. You can almost make the case that if a site or product relies on advertising to get the word out, it isn't going to win.
 
Even much of the infrastructure of the web - including the Linux, Apache, MySQL, and Perl, PHP, and Python code involved in most web servers - relies on the peer-production methods of open source, in themselves an instance of collective, net-enabled intelligence. There are more than 100,000 open source software projects listed on SourceForge.net. Anyone can add a project, anyone can download and use the code, and new projects migrate from the edges to the center as a result of users putting them to work, an organic software adoption process relying almost entirely on viral marketing.


Blogging and the Wisdom of Crowds

One of the most highly touted personal features of the web is the rise of blogging. Personal home pages have been around since the early days of the web, and the personal diary and daily opinion column around much longer than that, so what are these bloggers really doing new?

At its most basic, a blog is just a personal home page in diary format. But, the chronological organization of a blog seems like a trivial difference, it drives an entirely different delivery, advertising the value of links in a chain.

One of the things that has made a difference is a technology called RSS. RSS is the most significant advance in the fundamental architecture of the web since early hackers realized that CGI could be used to create database-backed websites. RSS allows someone to link not just to a page, but to subscribe to it, with notification every time that page changes. Call it the "live web".

Now, of course, "dynamic websites" replaced static web pages well over ten years ago. What's dynamic about the live web are not just the pages, but the links. A link to a weblog is expected to point to a perennially changing page, with "permalinks" for any individual entry, and notification for each change. An RSS feed is thus a much stronger link than, say a bookmark or a link to a single page.

RSS also means that the web browser is not the only means of viewing a web page. While some RSS aggregators are web-based, others are desktop clients, and still others allow users of portable devices to subscribe to constantly updated content.

RSS is now being used to push not just notices of new blog entries, but also all kinds of data updates, including stock quotes, weather data, and photo availability. This use is actually a return to one of its roots: RSS was born in 1997 out of the confluence of Dave Winer's "Really Simple Syndication" technology, used to push out blog updates, and Netscape's "Rich Site Summary", which allowed users to create custom Netscape home pages with regularly updated data flows. Netscape lost interest, and the technology was carried forward by blogging pioneer Userland, Winer's company. In the current crop of applications, we see, though, the heritage of both parents.

But RSS is only part of what makes a weblog different from an ordinary web page. 

It may seem like a trivial piece of functionality now, but it was effectively the device that turned weblogs from an ease-of-publishing phenomenon into a conversational mess of overlapping communities. For the first time it became relatively easy to gesture directly at a highly specific post on someone else's site and talk about it.  Discussion emerged.  Chat emerged.  And -as a result- friendships emerged and became more entrenched. The permalink was the first -and most successful- attempt to build bridges between weblogs.

Not only can people subscribe to each others' sites and easily link to individual comments on a page, but also via a mechanism known as trackbacks, they can see when anyone else links to their pages and can respond, either with reciprocal links or by adding comments to your Facebook Twitters!

If an essential part of winning the web is harnessing collective intelligence, turning the web into a kind of global brain, then the social blogosphere is just the equivalent of that constant mental chatter in the brain, the voice we hear in all of our heads. It may not reflect the deep structure of the brain, which is often unconscious anyway, but it's still the equivalent of conscious thought. And as a reflection of conscious thought, the blogosphere can have some powerful effects.

First, because search engines use link structure to help predict useful pages, bloggers, as the most prolific and timely linkers, have a disproportionate role in shaping search engine results. Second, because the blogging community is so highly self-referential, bloggers paying attention to other bloggers magnifies their visibility and power. The "echo chamber" is also an amplifier.

If it were merely an amplifier, blogging would be really uninteresting. But like Wikipedia, blogging harnesses collective intelligence as a kind of filter. This is where "the wisdom of crowds" comes into play, and much as PageRank produces better results than an analysis of any individual document, the collective attention of the blogosphere selects what has the most value.

While mainstream media see's individual blogs as competitors, the real competition is within the social networking groups. This is not just competition between sites, but competitions between emerging business models. The world of winning the web is also the world in which the former audience, not just a few people in a back room, get to decide what's hot and wha'ts not.


Data: the Next Intel Inside

Every significant internet application to date has been backed by a specialized database: Google's web crawl, Yahoo!'s directory (and web crawl), Amazon's database of products, eBay's database of products and sellers, MapQuest's map databases, Napster's distributed song database.  Database management is a core competency of Web 2.0 companies, so much so that these applications are referred to as "infoware" rather than software.

This alone leads to a key question: Who owns the data?

In the internet era, one can already see a number of cases where control over the database has led to market control and outsized financial returns. The monopoly on domain name registry initially granted by government fiat to Network Solutions (later purchased by Verisign) was one of the first great moneymakers of the internet. While we've argued that business advantage via controlling software APIs is much more difficult in the age of the internet, control of key data sources is not, especially if those data sources are expensive to create.

Look at the copyright notices at the base of every map served by MapQuest, maps.yahoo.com, maps.msn.com, or maps.google.com, and you'll see the line "Maps copyright NavTeq, TeleAtlas," or with the new satellite imagery services, "Images copyright Digital Globe." These companies made substantial investments in their databases (NavTeq alone reportedly invested $750 million to build their database of street addresses and directions. Digital Globe spent $500 million to launch their own satellite to improve on government-supplied imagery.) NavTeq has gone so far as to imitate Intel's familiar Intel Inside logo: Cars with navigation systems bear the imprint, "NavTeq Onboard." Data is indeed the Intel Inside of these applications, a sole source component in systems whose software infrastructure is largely open source or otherwise commodified.

The now hotly contested web mapping arena demonstrates how a failure to understand the importance of owning an application's core data will eventually undercut its competitive position. MapQuest pioneered the web mapping category in 1995, yet when Yahoo!, and then Microsoft, and  Google decided to enter the market they were easily able to offer a competing application simply by licensing the same data.

Contrast, however, the position of Amazon.com. Like competitors such as Barnesandnoble.com, its original database came from ISBN registry provider R.R. Bowker. But unlike MapQuest, Amazon relentlessly enhanced the data, adding publisher-supplied data such as cover images, table of contents, index, and sample material. Even more importantly, they harnessed their users to annotate the data, such that after ten years, Amazon, not Bowker, is the primary source for bibliographic data on books, a reference source for scholars and librarians as well as consumers. Amazon also introduced their own proprietary identifier, the ASIN, which corresponds to the ISBN where one is present, and creates an equivalent namespace for products without one. Effectively, Amazon "embraced and extended" their data suppliers.

Imagine if MapQuest had done the same thing, harnessing their users to annotate maps and directions, adding layers of value. It would have been much more difficult for competitors to enter the market just by licensing the base data.

The introduction of Google Maps provides a living laboratory for the competition between application vendors and their data suppliers. Google's lightweight programming model has led to the creation of numerous value-added services in the form of mashups that link Google Maps with other internet-accessible data sources. 

The race is on to own certain classes of core data: location, identity, calendaring of public events, product identifiers and namespaces. In many cases, where there is significant cost to create the data, there may be an opportunity for an Intel Inside style play, with a single source for the data. In others, the winner will be the company that first reaches critical mass via user aggregation, and turns that aggregated data into a system service.

For example, in the area of identity, PayPal, Amazon's 1-click, and the millions of users of communications systems, may all be legitimate contenders to build a network-wide identity database. (In this regard, Google's recent attempt to use cell phone numbers as an identifier for Gmail accounts may be a step towards embracing and extending the phone system.) Meanwhile, startups like Sxip are exploring the potential of federated identity, in quest of a kind of "distributed 1-click" that will provide a seamless web winning identity subsystem.  While the jury's still out on the success of any particular startup or approach, it's clear that standards and solutions in these areas, effectively turning certain classes of data into reliable subsystems of the "internet operating system", will enable the next generation of applications.

A further point must be noted with regard to data, and that is user concerns about privacy and their rights to their own data. In many of the early web applications, copyright is only loosely enforced. For example, Amazon lays claim to any reviews submitted to the site, but in the absence of enforcement, people may repost the same review elsewhere. However, as companies begin to realize that control over data may be their chief source of competitive advantage, we may see heightened attempts at control.

Much as the rise of proprietary software led to the Free Software movement, expect the rise of proprietary databases to result in a Free Data movement within the next decade. One can see early signs of this countervailing trend in open data projects such as Wikipedia, the Creative Commons, and in software projects like Greasemonkey, which allow users to take control of how data is displayed on their computer.


End of the Software Release Cycle

One of the defining characteristics of internet era software is that it is delivered as a service, not as a product. 
Which leads to a couple of fundamental changes in the business models of such a service-ware company:

1.  Operations must become a core competency. Google's and Yahoo!'s expertise in product development must be matched by an expertise in daily operations. So fundamental is the shift from software as a stand alone artifact - to the software being a service - that the software will cease to perform unless it is maintained on a daily basis!  Just as Google must continuously crawl the web and update its indices, continuously filter out link spam and other attempts to influence its results, continuously and dynamically respond to hundreds of millions of asynchronous user queries, simultaneously matching them with context-appropriate advertisements.

It's no accident that Google's system administration, networking, and load balancing techniques are perhaps even more closely guarded secrets than their search algorithms. Google's success at automating these processes is a key part of their cost advantage over competitors.

It's also no accident that scripting languages such as Perl, Python, PHP, and Ruby, play such a large role in web winning companies. Perl was famously described by Hassan Schroeder, Sun's first webmaster, as "the duct tape of the internet." Dynamic languages (often called scripting languages and looked down on by the software engineers of the era of software artifacts) are the tool of choice for system and network administrators, as well as application developers building dynamic systems that require constant change.

2.  Users must be treated as co-developers, in a reflection of open source development practices (even if the software in question is unlikely to be released under an open source license.) The open source dictum, "release early and release often" in fact has morphed into an even more radical position, "the perpetual beta," in which the product is developed in the open, with new features slipstreamed in on a monthly, weekly, or even daily basis.

Real time monitoring of user behavior to see which new features are used, and how they are used, becomes another required core competency  deploying new builds every day. This is clearly a radically different development model! While not all web applications are developed in this extreme a style, almost all web applications have a development cycle that is radically unlike anything from the PC client-server era. 

   . . . and this is the reason why Microsoft won't be able to beat Google: 

  Microsoft's business model depends on everyone upgrading their computing environment every couple of years.

   Google's depends on everyone exploring what's new in their computing environment every hour of every day!

While Microsoft has demonstrated their enormous ability to ultimately beat the competition, there's no question this time, the competition will ultimately require Microsoft (and by extension, every other software company) to become a more collective part of a more intelligent community.

 Web winning companies will always enjoy a natural advantage, - of not having the old baggage of corresponding business models to shed.