Using Hadoop for data migration

What is Hadoop

Hadoop is a Java-based open source framework for storing and processing big data. The data is stored on inexpensive commodity servers that are connected in clusters. Its distributed file system is fault tolerant and enables parallel processing. Developed by Doug Cutting and Michael J. Cafarella, Hadoop leverages the MapReduce programming model to store and retrieve data from its nodes faster. The framework is administered by the Apache Software Foundation and is under the Apache License 2.0.

The computing power of application servers has been increasing enormously for years. The databases, on the other hand, lag behind due to their limited capacity and speed. But today, when huge amounts of data generated by numerous applications have to be processed, Hadoop is one of the central tools that are driving the urgently needed change in the world of databases.

From a business perspective, too, there are both direct and indirect benefits. Organizations can achieve significant cost savings by using open source technologies on inexpensive servers that are primarily hosted in the cloud (and sometimes on-premise).

The vast amounts of data companies collect today and the valuable insights gained from it also lead to better business decisions. Organizations can use it for This includes, for example, eliminating or correcting faulty processes, optimizing workflows, providing relevant search results, performing predictive analyzes or concentrating on the right consumer segment.

How Hadoop improves traditional databases

Hadoop solves two key challenges faced by traditional databases:

1. Capacity: Hadoop stores large amounts of data.

With a distributed file system called HDFS (Hadoop Distributed File System), the data is distributed over several units and stored in clusters of commodity servers. Since these commodity servers use simple hardware configurations, they are economical and can be easily scaled with increasing amounts of data.

2. Speed: Hadoop stores and retrieves data faster.

Hadoop uses the functional programming model MapReduce to process data sets in parallel. When a query is sent to the database, the data is not processed sequentially. Instead, the tasks are shared and executed in parallel on distributed servers. The result of all tasks is then compiled and sent back to the application, which speeds up processing enormously.

5 benefits of Hadoop for big data

Hadoop is essential for big data and analytics. Data on people, processes, objects, tools, etc. are only useful if meaningful patterns can be derived from them, which in turn lead to better decisions. Hadoop helps tackle the massive data challenge:

  1. Resilience - Data stored on any node is replicated to other nodes in the cluster for fault tolerance. If a node fails, there is always a backup for the data in the cluster.
  2. Scalability - In contrast to traditional systems with limited data storage, Hadoop is scalable because it runs in a distributed environment. Depending on requirements, the setup can easily be expanded to include more servers that can store up to several petabytes of data.
  3. Low costs - Since Hadoop is an open source framework that does not require a license, the costs are significantly lower than with relational database systems. The use of inexpensive commodity hardware also helps ensure that the solution remains economical.
  4. speed - Hadoop's distributed file system, parallel processing and MapReduce model enable complex queries to be performed in a matter of seconds.
  5. Data diversity - HDFS can store different data formats such as unstructured (e.g. videos), semi-structured (e.g. XML files) and structured information. No validation using a predefined schema is required. Instead, the data can be stored in any format. Later they are parsed when they are called up and brought into the desired scheme as required. This enables a high degree of flexibility, so that different insights can be obtained from the same data.

The Hadoop Ecosystem: Core Components

Hadoop does not consist of a single application, it is rather a platform with different components that enable distributed data storage and processing. Together, these components make up the Hadoop ecosystem.

Some components are core components and form the basis of the framework, while others are additions that provide add-on functions.

The core components of Hadoop include:

HDFS: Providing a distributed file system

HDFS is the part of Hadoop that maintains the distributed file system so that data can be stored and replicated on multiple servers.

HDFS has a NameNode and a DataNode. DataNodes are the commodity servers on which the data itself is stored. The NameNode, on the other hand, contains media data with information on the data stored on the various nodes. The application only interacts with the NameNode, which in turn communicates with the data nodes as required.

YARN: Yet Another Resource Negotiator

YARN (Yet Another Resource Negotiator) manages and divides the resources. The tool also decides what should happen on each data node. The central master node that manages all processing requests is called the Resource Manager. The resource manager interacts with the node managers; each slave data node has its own node manager to perform tasks.


MapReduce is a programming model first used by Google to index its search operations. It provides the logic to break up data into smaller data sets and is based on the two functions Map () and Reduce (), which parse data quickly and efficiently.

First, the map function groups, filters and sorts several data records at the same time in order to generate tuples (key-value pairs). The Reduce function then aggregates the data from these tuples to obtain the desired result.

The Hadoop ecosystem: additional components

Here is a small selection of additional components widely used in the Hadoop ecosystem:

Hive: data warehousing

Hive is a data warehousing system that helps to query large data sets in HDFS. Before Hive existed, developers had to create complex MapReduce jobs to query the Hadoop data. Hive uses the query language HQL (Hive Query Language), which is similar to the syntax of SQL. Since most developers have a SQL background, Hive is easier for them to work with.

Hive has the advantage that a JDBC / ODBC driver serves as the interface between the application and the HDFS. This outputs the Hadoop file system as tables, converts HQL into MapReduce jobs and vice versa. This allows developers and database administrators to batch process large data sets using simple, familiar queries. Hive was originally developed by the Facebook team but is now an open source technology.

Pig: Reduction of MapReduce functions

Pig was used by Yahoo! is designed and similar to Hive in that it does not require the creation of MapReduce functions to query the HDFS. Similar to HQL, the language used (“Pig Latin”) is closer to SQL. "Pig Latin" is a higher-level data flow layer that lies above MapReduce.

Pig also has a runtime environment with an interface to HDFS. Scripts in languages ​​such as Java or Python can also be embedded in Pig.

Hive vs. Pig

While Pig and Hive have similar functions, they differ in their effectiveness for certain scenarios.

Pig is very useful in the data preparation phase, as complex links and queries can be easily implemented with it. The component also works well with different data formats such as semi-structured and unstructured information. Pig Latin is relatively similar to SQL, but there are a few differences that create a certain learning curve.

Hive however, works well with structured data and is therefore more effective in data warehousing. This component is used on the server side of the cluster.

Researchers and programmers tend to use Pig on the client side of a cluster, while business intelligence users like data analysts prefer Hive.

Flume: Ingesting Big Data

Flume is a big data ingestion tool that serves as a delivery service between multiple data sources and the HDFS. It collects, aggregates and sends huge amounts of streaming data (e.g. log files, events) into the HDFS, generated by social media platforms, IoT apps, e-commerce portals and other applications.

Flume has numerous functions. The tool:

  • has a distributed architecture.
  • ensures reliable data transfer.
  • is fault tolerant.
  • offers the flexibility to collect data in batches or in real time.
  • can be scaled horizontally to handle more traffic if necessary.

Data sources communicate with flume agents, each of which has a source, a channel, and a sink. The source collects the data from the sender, the channel takes over the intermediate storage of the data and the sink finally transmits the data to the destination, a Hadoop server.

Sqoop: data acquisition for relational databases

Sqoop (SQL after Hadoop) is a data acquisition tool like Flume. While Flume is used for unstructured or semi-structured data, Sqoop is suitable for data export from and data import into relational databases. Since most company data is stored in relational databases, Sqoop is used to import this data into Hadoop so that it can be used by analysts.

Database administrators and developers can export and import data with a simple command line interface. Sqoop converts these commands into MapReduce format and sends them to the HDFS with YARN. Sqoop is also fault tolerant and performs multiple operations at the same time like Flume.

Zookeeper: coordination of distributed applications

The zookeeper service coordinates distributed applications. In the Hadoop framework, it serves as an administration tool with a centralized registry that has information on the cluster of the distributed servers. The most important functions include:

  • Maintenance of configuration information (common status of configuration data)
  • Name service (assigns a name to each server)
  • Synchronization service (takes care of blockages, race conditions and data inconsistency)
  • Selection of a superordinate instance (chooses a leader among the servers by means of consensus)

The server cluster on which the Zookeeper service runs is called an "ensemble". The ensemble elects a leader within the group. The others are followers. All write operations from clients must be routed through the leader. Read operations, on the other hand, can go directly to the server.

Zookeeper offers high reliability and stability thanks to fail-safe synchronization, atomicity and serialization of messages.

Kafka: Faster data transfers

Kafka is a distributed publish-subscribe messaging system that is often used in conjunction with Hadoop for faster data transfers. A Kafka cluster consists of a group of servers that act as intermediaries between producers and consumers.

In the context of big data, a producer could, for example, be a sensor that records temperature data and sends it back to the server. The consumers are the Hadoop servers. The producers publish a message on a topic and the consumers pick up messages by listening to the topic.

A single topic can be divided into several partitions. All messages with the same key go to a specific partition. A consumer can listen to one or more partitions.

If messages are grouped under a key and a consumer supplies certain partitions, many consumers can listen to the same topic at the same time. In this way, a topic is parallelized, which increases the system throughput. Kafka is widely used because of its speed, scalability, and robust replication.

HBase: Non-relational database

HBase is a column-oriented, non-relational database that is on top of HDFS. One of the challenges with HDFS is that it can only handle batches. Even for simple interactive queries, data has to be processed in batches, which leads to high latency.

HBase solves this challenge by allowing queries for individual rows in huge tables with low latency. This is achieved through the internal use of hash tables. HBase is modeled on the Google BigTable database system, which supports access to the Google File System (GFS).

HBase is scalable, offers failure support if a node fails and is suitable for both unstructured and semi-structured data. Therefore, HBase is ideal for querying big data stores for analysis purposes.

Challenges from Hadoop

While Hadoop is widely recognized as the primary enabler for big data, there are still some challenges to consider. These are primarily related to its complex ecosystem and the extensive technical knowledge required to execute Hadoop functions. With the right integration platform and tools, however, the complexity can be significantly reduced and working with Hadoop can be made easier.

1. Steep learning curve

In order to query the Hadoop file system, programmers must write MapReduce functions in Java. It's not exactly easy and requires a steep learning curve. It also takes some time to become familiar with the ecosystem and its far too many components.

2. Different data sets require different approaches

There is no one-size-fits-all solution in Hadoop. Most of the additional components described above were developed to fill an existing gap.

For example, Hive and Pig provide an easier way to query records. Data collection tools like Flume and Sqoop support data collection from multiple sources. In addition, there are numerous other components and it takes experience to select the right ones.

3. Limitations of MapReduce

MapReduce is an excellent programming model for processing big data sets in batches. But it also has its limits.

Its file-intensive, multiple read and write approach is not suitable for real-time interactive data analysis or iterative tasks. MapReduce is simply not efficient enough for such operations and results in high latencies. (There are workarounds for this problem. Apache is an alternative to fill the MapReduce loophole.)

4. Data security

With the increasing shift of big data to the cloud, sensitive data is stored on Hadoop servers, so that the security of this data must be guaranteed. The extensive ecosystem consists of a large number of tools. It is therefore important to ensure that the correct data access rights are available for the individual tools. There needs to be adequate authentication, provisioning, data encryption, and frequent audits. Hadoop can handle this challenge, but it requires knowledge and care in execution.

While many tech giants use the Hadoop components listed here, the solution is still relatively new to the industry. Most of the challenges have to do with its creation. However, a robust big data integration platform can solve or reduce this.

Hadoop vs. Apache Spark

Despite its many advantages, the MapReduce model is not an efficient solution for real-time data processing or interactive queries, since it is based on disk write operations between the individual processing phases.

The data processing engine Spark solves this challenge with an in-memory data store.

Often, Spark is built on top of HDFS to only use the storage aspect of Hadoop. For the processing algorithm, Spark uses its own libraries that support SQL queries, streaming, machine learning, and graphing.

Data scientists work a lot with Spark because the engine is extremely fast and has elegant, feature-rich APIs that make working with large data sets easy.

While Spark may seem a bit superior to Hadoop, the two can work together efficiently. Depending on the requirements and the type of data records, Hadoop and Spark complement each other. Spark does not have its own file system and must therefore use HDFS or similar solutions for its storage.

The real difference, however, is in the processing logic of Spark and the MapReduce model. If you have limited memory and jobs that run overnight, MapReduce is best. For streaming data, accessing machine learning libraries and fast real-time operations, however, Spark is the perfect choice.

A future with many possibilities

Within a single decade, Hadoop has become a fixture and a major driving force in the computing industry. The reason: With Hadoop, the possibilities of data analysis can finally be efficiently realized. The areas of application of Hadoop are diverse - from the analysis of page visits to fraud detection and banking applications.

With Talend Open Studio for Big Data, you can easily integrate your Hadoop setup into a wide variety of data architectures. Talend offers more integrated data connectors than any other data management solution and allows seamless data flows between Hadoop and all common file formats (CSV, XML, Excel, etc.), database systems (Oracle, SQL Server, MySQL, etc.), corporate application packages (SAP, SugarCRM, etc.) .) and even cloud data services like Salesforce and

Learn more about Talend's big data solutions or download Talend Open Studio for Big Data today and take advantage of the leading open source big data tool