Google - How It Works

Hardware
To provide sufficient service capacity, Google’s physical structure consists of clusters of computers situated around the world known as server farms. These server farms consist of a large number of commodity level computers running Linux based systems that operate with GFS, or the Google file system, with the largest of these farms have over 1000 storage nodes and over 300 TB of disk storage (Ghemawat, S., Gobioff, H., and Leung, S. T., 2003, pp 2).
It has been speculated that Google has the world’s largest computer. The estimate states Google as having up to:
• 899 racks
• 79,112 machines
• 158,224 CPUs
• 316,448 Ghz of processing power
• 158,224 Gb of RAM
• 6,180 Tb of Hard Drive space

How Google Handles Search Queries
When a user enters a query into the search box at Google.com, it is randomly sent to one of many Google clusters. The query will then be handled solely by that cluster. A load balancer that is monitoring the cluster then spreads the request out over the servers in the cluster to make sure the load on the hardware is even. The actual search takes place in 2 phases (Barroso, L. A., Dean, J., Hölzle, U., 2003, pp23).
In the first phase the words in the query are checked against a list in index servers that contain the details of matching documents. The PageRank system is employed in which a relevant set of documents are then identified by cross referencing the words, which will determine the score for each document, that in turn affects the position of the document on the results page (Barroso, L. A., Dean, J., Hölzle, U., 2003, pp23).
The results from the index servers are then sent to document servers in the form of document identifiers, or docids (Barroso, L. A., Dean, J., Hölzle, U., 2003, pp23). Residing in the document servers is a copy of the World Wide Web, from which the summary of the web page’s contents are then retrieved from the servers. The result is then processed and returned as a HTML document that can be displayed on the user’s browser as a webpage.

The PageRank System
PageRank, named after Larry Page, who came up with it, is one of the ways in which Google determines the importance of a page, which in turn decides where the page will turn up in the results list.
The exact PageRank algorithm as extracted from “The Anatomy of a Large-Scale Hypertextual Web Search Engine� (Brin, S., Page, L., 2000) is as such:
We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.

PageRank actually works on an “intuitive system�, which works as a model of a web surfer’s behaviour. It works on the probability that a web page will be randomly accessed by a web surfer. It also takes into consideration the pages that link to the page. Using Yahoo as an example, the justification of this is that if a page like Yahoo were to link directly to another page, it is very likely that the page is of high quality (Brin, S., Page, L., 2000).

The Googlebot
Google’s servers are populated by Google’s web crawler the Googlebot, which moves from site to site on the internet, downloading copies of web pages and saves them in the Google index (also known as the cache) for future reference (Google Guide, 2003).
The Googlebot itself resides on many computers that simultaneously access thousands of web pages. It emulates a browser, so most webmasters will find that when it visits, it leaves a mark in the “browser� section of the web site’s log rather than the “spider� section that most web crawlers register as (Sullivan, R., 2004).
There are 2 methods that the Googlebot uses to find a web page, either it reaches the webpage after “crawling� through links, or it goes the page after it has been submitted by the webmaster (Google Guide, 2003). By submitting the base link, for example, http://wiki.media-culture.org.au/, the Googlebot will go through all links in the index page and every subsequent page, until the entire site has been indexed.

Google File System
The Google file system is a propriety file management system developed by Sanjay Ghemawat, Shun-Tak Leung and Urs Holzle for Google as a means to handle the massive number of requests over a large number of server clusters.
The system was designed like most other distributed files systems for maximum performance, to handle the large number of users, scalability, to be able to handle inevitable expansions, reliability, to ensure maximum uptime and availability, to ensure computers are available to handle queries (Ghemawat, S., Gobioff, H., and Leung, S. T., 2003).
Because of Google’s decision to use a large number of commodity level computers instead of a smaller number of server type systems, the Google File System had to be designed to handle system failures, which resulted in it being designed to effect constant monitoring, of systems, error detection, fault tolerance and automatic recovery (Ghemawat, S., Gobioff, H., and Leung, S. T., 2003). That meant that clusters would have to hold multiple replicas of the information created by Google’s web crawlers. This is especially relevant with Google’s newly implemented GMail, where user’s personal email must be backed up to prevent loss of information.
Because of the size of the Google database, the system had to be designed to handle huge multi-gigabyte sized files totalling many terabytes in size. It was designed as such due to the fact that storing files in kilobytes would mean having billions of files, which would prove unwieldy to manage (Ghemawat, S., Gobioff, H., and Leung, S. T., 2003). Another method employed to handle huge file sizes is that changes (also known as mutations) are appended rather than having files overwritten, which minimizes file access times.

GS being a proprietary system means that software applications can be custom built around it (Ghemawat, S., Gobioff, H., and Leung, S. T., 2003), ensuring that Google has maximum control over the system, at the same time allowing the system to stay flexible.

No comments: