Hi. This Bill Romine with Dell Software. And today, I'm going to discuss synchronizing data from Oracle to Hadoop. Often, we wish to maintain a copy of data from an RDBMS such as Oracle in Hadoop, either an HDFS, HBase, or both.
Typically, the tool we'd use is Apache Sqoop. Sqoop will read the table data from Oracle, and make a copy onto HDFS or Hbase, while optionally creating a scheme in Hive. This will create a snapshot of the table at a point in time of the copy. But in an active system, users will continuously make changes to the table, and the data in Hadoop will quickly become out of date.
It may not be practical to frequently refresh the snapshot using Sqoop. SharePlex may be used to replicate changes from Oracle to Hadoop, and keep the copy up to date with minimal overhead. SharePlex works by replicating activity from Oracle's redo logs. As changes are made to the Oracle database they are logged into the redo log.
SharePlex captures these changes and reconstructs records representing the SQL operations. These records are sent via the JMS queue to our connector for Hadoop. The connector then posts these changes to Hbase and HDFS. This process continually monitors for changes, and keeps the tables in Hadoop in sync.