Big Data
Google's MapReduce
MapReduce is built on the proven concept of divide and conquer: it’s much faster to break a massive task into smaller chunks and process them in parallel.
In MapReduce, task-based programming logic is placed as close to the data as possible. This technique works very nicely with both structured and unstructured data. It’s no surprise that Google chose to follow a divide-and-conquer approach, given its organizational philosophy of using lots of commoditized computers for data processing and storage instead of focusing on fewer, more powerful (and expensive!) servers. Along with the MapReduce architecture, Google also authored the Google File System. This innovative technology is a powerful, distributed file system meant to hold enormous amounts of data. Google optimized this file system to meet its voracious information processing needs. However, as we describe later, this was just the starting point.
Google’s MapReduce served as the foundation for subsequent technologies such as Hadoop, while the Google File System was the basis for the Hadoop Distributed File System.
How it works
Map
In contrast with traditional relational database–oriented information — which organizes data into fairly rigid rows and columns that are stored in tables — MapReduce uses key / value pairs (Label Value pairs). In the Map phase of MapReduce, records from the data source are fed into the map() function as key/value pairs. The map() function then produces one or more intermediate values along with an output key from the input.
Reduce
After the Map phase is over, all the intermediate values for a given output key are combined together into a list. The reduce() function then combines the intermediate values into one or more final values for the same key.
Apache Hadoop
Hadoop is a well-adopted, standards-based, open-source software framework built on the foundation of Google’s MapReduce and Google File System papers. It’s meant to leverage the power of massive parallel processing to take advantage of Big Data, generally by using lots of inexpensive commodity servers.
The inventors of the HDFS made a series of important design decisions:
- Files are stored as blocks:
These are much larger than most file systems, with a default of 128 MB. - Reliability is achieved through replication:
Each block is replicated across two or more DataNodes; the default value is three. - A single master NameNode coordinates access and metadata:
This simplifies and centralizes management. - No data caching:
It’s not worth it given the large data sets and sequential scans. - There’s a familiar interface with a customizable API:
This lets you simplify the problem and focus on distrib- uted applications, rather than performing low-level data manipulation.
Distributed data storage
These technologies are used to maintain large datasets on commodity storage clusters. Given that this information forms the foundation of your MapReduce efforts, be sure that your Hadoop implementation is capable of working with all types of data storage. Major providers include the following:
- Hadoop HDFS
- IBM GPFS
- Appistry CloudIQ Storage ✓ MapR Technologies
Distributed MapReduce runtime
This software is assigned an extremely important responsibility: scheduling and distributing jobs that consume information kept in distributed data storage. The following are some major suppliers:
- Open-source Hadoop JobTracker
- IBM Platform Symphony MapReduce ✓ Oracle Grid Engine
- Gridgain
Supporting tools and applications
A broad range of technologies lets programmers and non- programmers alike derive value from Big Data, such as these well-known examples:
- Programming tools:
- Apache Pig
- Apache Hive
- Workflow scheduling:
- Apache Oozie
- Data store:
- Apache HBase
- Analytic and related tools:
- IBM BigSheets
- Datameer
- Digital Reasoning
Distributions
These provide a single, integrated offering of all components, pre-tested and certified to work together. Here are the most illustrious of the many providers:
- Apache Hadoop
- IBM InfoSphere BigInsights ✓ Cloudera
- Hortonworks
- MapR Technologies
Business intelligence and other tools
These are popular technologies that have been in use for years working with traditional relational data. They’ve now been extended to work with data that’s accessible via Hadoop. Here are three industry sub-segments, along with some of the best-known vendors in each one:
- Analytics
- IBM Cognos
- IBM SPSS
- MicroStrategy • Quest
- SAS
- Jaspersoft
- Pentaho
- Extract, transform, load (ETL) • IBM InfoSphere DataStage • Informatica
- Pervasive
- Talend
- Business intelligence and other tools
These are popular technologies that have been in use for years working with traditional relational data. They’ve now been extended to work with data that’s accessible via Hadoop. Here are three industry sub-segments, along with some of the best-known vendors in each one: ✓ Analytics • IBM Cognos • IBM SPSS • MicroStrategy • Quest • SAS • Jaspersoft • Pentaho
- Extract, transform, load (ETL)
- IBM InfoSphere DataStage • Informatica
- Pervasive
- Talend
See also
- http://research.google.com/archive/mapreduce.html MapReduce
- http://hadoop.apache.org Hadoop
- https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html Hadoop Distributed File System (HDFS)
- https://pig.apache.org Apache Pig
- https://hive.apache.org Apache Hive,
- NoSQL databases.
- http://cassandra.apache.org Apache Cassandra