Hi, this is Bill Romine from Dell Software. Today, I'll show you how to get started with the SharePlex Connector for Hadoop. First, let's take a look at the high level architecture. The SharePlex connector for Hadoop is used to replicate data from Oracle to Hadoop. I assume that you already have an Oracle database and, of course, a Hadoop cluster. The replication stream consists of our SharePlex for Oracle product, a JMSQ, and SharePlex Connector for Hadoop. The SharePlex connector for Hadoop comes with SQOOP, so we won't need to independently install SQOOP.
Let's get started, first, with the JMSQ. There are many JMS providers. For my demo today, I'll be using Apache's active MQ. You can download it from the Apache website, and it is licensed under the Apache license. To install the provider, I'll log in to my source system, where SharePlex is installed. The JMS provider gets installed into my SharePlex op directory in the lib/providers directory.
First, I'll extract the binary. Then I'll rename the directory so the path is a bit shorter. Next, I'll copy the jar files from the lib directory to the top level directory. And let's go ahead and start up the provider.
Next, I'll need to configure SharePlex to use Active MQ as the provider. The following commands will tell SharePlex the provider class, the URL, and provider's library location. That completes the set up of the JMS provider.
So we've set up the JMS queue. Let's focus on the SharePlex Connector for Hadoop. The SharePlex Connector for Hadoop is installed on my target system. So I'll log into the system hosting my Hadoop cluster. The connector is distributors as a tar ball. Inside the tar ball is an install.sh script. I'll use this to install the product.
Since the SharePlex Connector for Hadoop ships with SQOOP, and SQOOP is licensed by Apache, I need to accept the terms of the Apache license. To configure the connector, I'll cd into the connector's bin directory and run the con setup script.
The connector can replicate to HDFS, and/or HBase. The first two questions are asking which types of target I'd like to configure. For the purposes of my demo, I'll select both.
Next, if I selected HBase as a target, I'm prompted for the column family name that will be used for the replicated columns. If I selected HDFS replication, them I'm prompted for the target path to HDFS and two parameters to control how often the connector will rebuild the tables in HDFS. For my demo, I'm going to select to rebuild the tables after the first change. In practice, you'll need to tune these parameters much higher on a production system to avoid rebuilding the table too often.
Next, I'm prompted for connection information for the JMS provider. And finally, I'm prompted for connection information to my Oracle database. So now, the SharePlex connector for Hadoop is installed and configured.
Now that everything is installed, let's go back to SharePlex for Oracle and start replication. First, I'll need a table to replicate. Let's create a small table with a primary key. A primary key is required for replication to Hadoop.
Next, let's tell SharePlex to begin replication by activating a configuration file. Typically, I'd use the Edit Config command to create a configuration file. But for my demo, I'll use a file I previously created. Next, I need to activate this configuration.
Now that the replication has started, let's go back to the SharePlex connector for Hadoop. To take a snapshot of the table, I'll run the Con Snapshot command. I'll specify the table name, the fact that I'd like the field separator to be a semicolon. And I'm going to use the -e flag so that it creates tables in Hive.
OK, the copy is complete. Let's take a look at the tables in Hive. Since I selected to replicate both HDFS and HBase, I'll have two tables in Hive. The table in HDFS will have a schema that matches the source table. However, the HBase version has a key column. The connector uses the primary key for the key column in HBase. In addition, all of the table columns are replicated as data values in HBase. And finally, let's start up the connector to complete the replication pipeline.
So now that everything is in place, let's make some changes on the source and see that they get replicated over the target. So I'll go back to my source, and I'm going to run a script that make some changes to my source table. Then, back on my target, let's take a look in Hive and see that the changes were replicated. To learn more about SharePlex, please visit software.Dell.com/products/SharePlex.