Friday, April 11, 2014

Why Hadoop doesn't use Java Serialization ?



Java comes with its own serialization mechanism, called Java Object Serialization (often referred to simply as “Java Serialization”), that is tightly integrated with the language,so it’s natural to ask why this wasn’t used in Hadoop. Here’s what Doug Cutting said in response to that question:
Why didn’t I use Serialization when we first started Hadoop? Because it looked
big and hairy and I thought we needed something lean and mean, where we had
precise control over exactly how objects are written and read, since that is central
to Hadoop. With Serialization you can get some control, but you have to fight for
it.
The logic for not using RMI was similar. Effective, high-performance inter-process
communications are critical to Hadoop. I felt like we’d need to precisely control
how things like connections, timeouts and buffers are handled, and RMI gives you
little control over those.
The problem is that Java Serialization doesn’t meet the criteria for a serialization formatlisted earlier: compact, fast, extensible, and interoperable.
Java Serialization is not compact: it writes the classname of each object being written
to the stream—this is true of classes that implement java.io.Serializable or
java.io.Externalizable. Subsequent instances of the same class write a reference handle
to the first occurrence, which occupies only 5 bytes. However, reference handles
don’t work well with random access, since the referent class may occur at any point in
the preceding stream—that is, there is state stored in the stream. Even worse, reference
handles play havoc with sorting records in a serialized stream, since the first record of
a particular class is distinguished and must be treated as a special case.
All these problems are avoided by not writing the classname to the stream at all, which
is the approach that Writable takes. This makes the assumption that the client knows
the expected type. The result is that the format is considerably more compact than Java
Serialization, and random access and sorting work as expected since each record is
independent of the others (so there is no stream state).
Java Serialization is a general-purpose mechanism for serializing graphs of objects, so
it necessarily has some overhead for serialization and deserialization operations. What’s
more, the deserialization procedure creates a new instance for each object deserialized
from the stream. Writable objects, on the other hand, can be (and often are) reused.

For example, for a MapReduce job, which at its core serializes and deserializes billions
of records of just a handful of different types, the savings gained by not having to allocate
new objects are significant.
In terms of extensibility, Java Serialization has some support for evolving a type, but it
is brittle and hard to use effectively (Writables have no support: the programmer has
to manage them himself).
In principle, other languages could interpret the Java Serialization stream protocol (defined
by the Java Object Serialization Specification), but in practice there are no widely
used implementations in other languages, so it is a Java-only solution. The situation is

the same for Writables. So, In Summary:

Java Serializable
Serializable does not assume the class of stored values is known and tags instances with its class ie. it writes the metadata about the object, which includes the class name, field names and types, and its superclass. ObjectOutputStream and ObjectInputStream optimize this somewhat, so that 5-byte handles are written for instances of a class after the first. But object sequences with handles cannot be then accessed randomly, since they rely on stream state. This complicates things like sorting.
Hadoop Writable
While defining a "Writable", you know the expected class. So Writables don't store their type in the serialized representation as while deserializing, you know what is expected. eg. if the input key is a LongWritable, so an empty LongWritable instance is asked to populate itself from the input data stream. As no meta info needs to be stored (classname, fields, their types, super classes) is done, this results in considerably more compact binary files, straightforward random access and higher performance.

Introduction to Hadoop Ecosystem

Everyone knows data is getting bigger. We are stepping into next generation of data analysis and storage. Apache Hadoop is a open source framework that aims at solving the problem of huge data storage and large scale processing of data sets.

"Hadoop" was the name of its founder Doug Cutting's son's toy elephant. Doug used the name for his open source project because it was easy to pronounce and to Google.

Hadoop MapReduce – a programming model for large scale data processing.

Apache Hadoop's MapReduce and HDFS components originally derived respectively from Google's MapReduce and Google File System (GFS) papers.