Monday 28 July 2008

Google, whacking and cuilness

Google have just announced that they think the internet has 1 trillion unique URL's (Universal Resource locator) – that is 1,000,000,000,000 (1 million, million) web addresses. A URL is the bit that you type in the address bar in Internet Explorer like www.bbc.co.uk or www.microsoft.com.

So that is the number of 'websites' but how unique pages are there?

In 1998, in the first Google index there were 26,000,000 pages. By the year 2000 they were up to 1 billion pages. Today, the answer is that even Google don't know! And in fact the number of new pages every day runs into several billion, so even if they started counting new pages are appearing faster than they can reasonably be counted.

In theory the internet is infinite. If you take the example of a web calendar, then each 'day' page will have a link that leads to the next day and of course that would go on foreverJ. Pages like these are not meaningful to most people, so Google just ignores them.

Google has an Index of all (ok then, most) of the pages on the internet. They start from certain pages, each of which has loads of links to other pages. They have a 'spider', sometimes called a 'crawler' or a 'bot' that follows these links and when it finds unique content then it indexes it, reading the content and creating the search index of keywords that allows you to search for anything. Then it follows the links to all the pages linked from the newly indexed page and starts all over again. Eventually, after removing all the broken links and duplicate pages it has an index to everything – their stated goal is to "index all the world's data".

Google updates their index continually, updating information and re-processing the web-link graph several times every day. This graph consists of the 1 trillion URLs and Google likens this to exploring every junction of every road in the US – except 50,000 times bigger, and all this so that you can search for the latest gossip/pictures of Brad Pitt/Angelina Jolie/Big Brother housemates (take your pick).

So, now that we know the size if the internet, the next thing to do is to try to find a "Googlewhack" – a search for two words that returns a single result – i.e. the only page on the entire internet that contains your two chosen search objects. There is a website dedicated to these, along with the official rules and a page to allow you to post any Googlewhack you find (yes, I have found one and no, I can't remember what it was – it was about four years ago!). Of course, finding and posting a Googlewhack is self-defeating (as Google indexes the new page recording it as a Googlewhack).

The comedian Dave Gorman created a stage show that toured the UK, France, Australia, Canada and the US and has written a spin-off Sunday Times Best Selling book about googlewhacking and googlewhackers.

Finally, just when you thought it was safe (and/or impossible), a new search engine has popped up with the intention of knocking Google of their perch. Cuil claim to index 121,617,892,992 web pages and also that this is three times more than Google (who no longer tell how many pages they index). Their index method is (allegedly) better/stronger/faster than Google , their page is nicer and they are generally all-round nice guys, who will make you a cup of tea if you call in at their offices in California (ok I made that up about the tea). Have a look at www.cuil.com (from the Gaelic for knowledge or hazel and pronounced coolJ). As a comparison, if you search for "Googlewhack Adventure" (Dave Gorman's book title), Cuil returns 23,098 pages, while Google returns "about 26,600". I will leave you to decide which is better…..

No comments: