Google File System (GFS)
Google requires a strong and very huge data storage system for storing great deal of data and catering its users the ability to access, create and alter data. Google does not manages all this through a large distributed computing environment which is equipped with high power computers. Google manages all the data through its exclusive Google File System (GFS) which is based on the principle of utilizing the capabilities of inexpensive commodity components and allowing hundreds of clients to access the data.
Since GFS deals with large data files, so the core concerns for the programmers were the manageability, scalability, fault tolerance and consistency of the system. GFS was designed by the programmers in a way that it could easily manage large data files and also provide quick access to the users for their desired documents.
Structure of Google File System
It is a vivid fact that the manipulation and accessing of large data files is a time-consuming task and takes up a great deal of network bandwidth. So in order to handle large data files efficiently and allow less access time for users, GFS stores data files by dividing them into chunks of 64 megabytes (MB). Each chunk has a unique identification number (chunk handle) and chunks are replicated on different computers to cater failures. Moreover, chunks also have checksum within them to ensure data integrity.
Google file system contains clusters of computers and within each clusters there is one master server, several chunk servers and several clients. Each file chunks is replicated thrice on different chunk servers, to attain high level of reliability. One replica is called the primary one while the other two are called secondary ones.
The master stores the file system metadata, which include information regarding mapping from files to chunks, current chunk location, namespace and access control information. The master server communicates with chunk servers through Heart Beat messages. Clients are the Google Apps, or Google Docs etc. which place file requests. The chunk servers do not transfer the requested file to the master server. Instead, the chunk servers directly transfer the requested file to the client.
Working of Google File System
Google file system works by using two core elements, one is lease and the other is mutation. Mutation includes the changes made to the chunk in a write or append operation. Lease is used for maintaining consistent mutation order across all the replicas. The primary replica is given the chunk lease by the master server. The primary replica picks up a serial mutation order which is followed by the other secondary replicas too. Thus the lease grant order chosen by the master defines the global mutation order and within the lease the serial numbers assigned by the primary define the order of mutations. In GFS a write request by the clients follows the sequence of these numbered steps:
1. The client inquires the master about which chunkserver holds the current lease for the chunks and also the location of other secondary replicas.
2. The master server replies back with the location of the primary and secondary replicas. This location is cached at the client side for future mutations, except in cases when the primary replicas becomes out of reach or does not contain the lease.
3. The client pushes the data to the replicas and then sends a write request to the primary replica.
4. The primary replica assigns serial numbers to the mutations and forwards the same serial mutation order to the other secondary replicas.
5. The secondary replicas reply back to the primary intimating that they have completed the write request in the same order as supplied by the primary.
6. The primary replica then informs the client about the completion of write request and incase of errors, also reports them.
To Read further, click here