Medium 9781449364625

Apache Sqoop Cookbook

Views: 1212
Ratings: (0)

Integrating data from multiple sources is essential in the age of big data, but it can be a challenging and time-consuming task. This handy cookbook provides dozens of ready-to-use recipes for using Apache Sqoop, the command-line interface application that optimizes data transfers between relational databases and Hadoop.

Sqoop is both powerful and bewildering, but with this cookbook’s problem-solution-discussion format, you’ll quickly learn how to deploy and then apply Sqoop in your environment. The authors provide MySQL, Oracle, and PostgreSQL database examples on GitHub that you can easily adapt for SQL Server, Netezza, Teradata, or other relational systems.

  • Transfer data from a single database table into your Hadoop ecosystem
  • Keep table data and Hadoop in sync by importing data incrementally
  • Import data from more than one database table
  • Customize transferred data by calling various database functions
  • Export generated, processed, or backed-up data from Hadoop to your database
  • Run Sqoop within Oozie, Hadoop’s specialized workflow scheduler
  • Load data into Hadoop’s data warehouse (Hive) or database (HBase)
  • Handle installation, connection, and syntax issues common to specific database vendors

List price: $12.99

Your Price: $10.39

You Save: 20%

 

7 Slices

Format Buy Remix

1. Getting Started

ePub

This chapter will guide you through the basic prerequisites of using Sqoop. You will learn how to download and install Sqoop on your computer or on any node of your Hadoop cluster. Sqoop comes with a very detailed User Guide describing all the available parameters and basic usage. Rather than repeating the guide, this book focuses on applying the parameters to real use cases and helping you to deploy and use Sqoop effectively in your environment.

You want to install Sqoop on your computer or on any node in your Hadoop cluster.

Sqoop supports the Linux operating system, and there are several installation options. One option is the source tarball that is provided with every release. This tarball contains only the source code of the project. You cant use it directly and will need to first compile the sources into binary executables. For your convenience, the Sqoop community provides a binary tarball for each major supported version of Hadoop along with the source tarball.

In addition to the tarballs, there are open source projects and commercial companies that provide operating system-specific packages. One such project, Apache Bigtop, provides rpm packages for Red Hat, CentOS, SUSE, and deb packages for Ubuntu and Debian. The biggest benefit of using packages over tarballs is their seamless integration with the operating system: for example, Configuration files are stored in /etc/ and logs in /var/log.

 

2. Importing Data

ePub

The next few chapters, starting with this one, are devoted to transferring data from your relational database or warehouse system to the Hadoop ecosystem. In this chapter we will cover the basic use cases of Sqoop, describing various situations where you have data in a single table in a database system (e.g., MySQL or Oracle) that you want to transfer into the Hadoop ecosystem.

We will be describing various Sqoop features through examples that you can copy and paste to the console and then run. In order to do so, you will need to first set up your relational database. For the purpose of this book, we will use a MySQL database with the account sqoop and password sqoop. We will be connecting to a database named sqoop. You can easily create the credentials using the script mysql.credentials.sql uploaded to the GitHub project associated with this book.

You can always change the examples if you want to use different credentials or connect to a different relational system (e.g., Oracle, PostgreSQL, Microsoft SQL Server, or any others). Further details will be provided later in the book. As Sqoop is focused primarily on transferring data, we need to have some data already available in the database before running the Sqoop commands. To have something to start with, weve created the table cities containing a few cities from around the world (see Table2-1). You can use the script mysql.tables.sql from the aforementioned GitHub project to create and populate all tables that are needed.

 

3. Incremental Import

ePub

So far weve covered use cases where you had to transfer an entire tables contents from the database into Hadoop as a one-time operation. What if you need to keep the imported data on Hadoop in sync with the source table on the relational database side? While you could obtain a fresh copy every day by reimporting all data, that would not be optimal. The amount of time needed to import the data would increase in proportion to the amount of additional data appended to the table daily. This would put an unnecessary performance burden on your database. Why reimport data that has already been imported? For transferring deltas of data, Sqoop offers the ability to do incremental imports.

Examples in this chapter use the table visits, which can be created by the script mysql.tables.sql described in Chapter2.

You have a database table with an INTEGER primary key. You are only appending new rows, and you need to periodically sync the tables state to Hadoop for further processing.

Activate Sqoops incremental feature by specifying the --incremental parameter. The parameters value will be the type of incremental import. When your table is only getting new rows and the existing ones are not changed, use the append mode.

 

4. Free-Form Query Import

ePub

The previous chapters covered the use cases where you had an input table on the source database system and you needed to transfer the table as a whole or one part at a time into the Hadoop ecosystem. This chapter, on the other hand, will focus on more advanced use cases where you need to import data from more than one table or where you need to customize the transferred data by calling various database functions.

For this chapter weve slightly altered the test table cities (see Table4-1), normalizing the country name to a standalone table called countries (see Table4-2). The normalized variant of the table cities is called normcities and will be created and populated automatically via the script mysql.tables.sql as described in Chapter2.

Table4-1.Normalized cities

1

1

Palo Alto

2

2

Brno

3

1

Sunnyvale

Table4-2.Countries

1

USA

2

Czech Republic

You need to import one main table; however, this table is normalized. The important values are stored in the referenced dictionary tables, and the main table contains only numeric foreign keys pointing to the values in the dictionaries rather than to natural keys as in the original cities table. You would prefer to resolve the values prior to running Sqoop and import the real values rather than the numerical keys for the countries.

 

5. Export

ePub

The previous three chapters had one thing in common: they described various use cases of transferring data from a database server to the Hadoop ecosystem. What if you have the opposite scenario and need to transfer generated, processed, or backed-up data from Hadoop to your database? Sqoop also provides facilities for this use case, and the following recipes in this chapter will help you understand how to take advantage of this feature.

You have a workflow of various Hive and MapReduce jobs that are generating data on a Hadoop cluster. You need to transfer this data to your relational database for easy querying.

You can use Sqoops export feature that allows you to transfer data from the Hadoop ecosystem to relational databases. For example, to export data from the export-dir directory cities (the directory in HDFS that contains the source data) into table cities (the table to populate in the database), you would use the following Sqoop command:

Export works similarly to import, except export transfers data in the other direction. Instead of transferring data from the relational database using SELECT queries, Sqoop will transfer the data to the relational database using INSERT statements. Sqoops export workflow matches the import case with slight differences. After you execute the Sqoop command, Sqoop will connect to your database to fetch various metadata about your table, including the list of all columns with their appropriate types. Using this metadata, Sqoop will generate and compile the Java class. The generated class will be used in the submitted MapReduce job that will export your data. Similar to the import mode, no data is being transferred through the Sqoop client itself. All transfers are done in the MapReduce job, with Sqoop overseeing the process from your machine.

 

6. Hadoop Ecosystem Integration

ePub

The previous chapters described the various use cases where Sqoop enables highly efficient data transfers between Hadoop and relational databases. This chapter will focus on integrating Sqoop with the rest of the Hadoop ecosystem: we will show you how to run Sqoop from within a specialized Hadoop scheduler named Oozie and how to load your data into Hadoops data warehouse system, Apache Hive, and Hadoops database, Apache HBase.

You are using Oozie in your environment to schedule Hadoop jobs and would like to call Sqoop from within your existing workflows.

Oozie includes special Sqoop actions that you can use to call Sqoop in your workflow. For example:

Starting from version 3.2.0, Oozie has built-in support for Sqoop. You can use the special action type in the same way you would execute a MapReduce action. You have two options for specifying Sqoop parameters. The first option is to use one tag, <command>, to list all the parameters, for example:

In this case, Oozie will take the entire content of the <command> tag and split it by spaces into a list of parameters. This list is then passed as is into Sqoop.

 

7. Specialized Connectors

ePub

Due to its versatility, Sqoop transfers data from a variety of relational database systems, such as Oracle, MySQL, PostgreSQL, and Microsoft SQL Server, as well as from enterprise data warehouses, such as Netezza and Teradata. While working with these database systems, you may encounter issues specific to a system vendor. This chapter guides you through common installation, connection, and syntax issues.

PostgreSQL direct imports boolean values as TRUE or FALSE strings. If your subsequent processing expects different values, you need to override those defaults.

Specify the extra parameters --boolean-true-string and --boolean-false-string to override the default value to a different string. For example, to use 0 for false and 1 for true, you could use the following Sqoop command:

The PostgreSQL direct connector uses the COPY (SELECT QUERY) TO STDOUT clause for retrieving data from your database that will by default use the string constants TRUE and FALSE when importing data from Boolean and Bit columns. The PostgreSQL direct connector only supports import and delegates the export to the nondirect JDBC connector. Therefore, both parameters, --boolean-true-string and --boolean-false-string, are applicable only to import and will be ignored during export operation.

 

Details

Print Book
E-Books
Slices

Format name
ePub
Encrypted
No
Sku
9781449364588
Isbn
9781449364588
File size
0 Bytes
Printing
Not Allowed
Copying
Not Allowed
Read aloud
No
Format name
ePub
Encrypted
No
Printing
Allowed
Copying
Allowed
Read aloud
Allowed
Sku
In metadata
Isbn
In metadata
File size
In metadata