Hadoop provides a highly scalable computing and storage cluster that is well suited for data analytics and data warehousing. Although Hadoop often houses data collected from a variety of sources, traditional relational database management systems (RDBMs) remain a critical source of data for analytics.
To copy data from an RDBMS into Hadoop, organizations typically use Apache SQOOP that takes a snapshot of the data at a particular point in time. However, in the case of an active RDBMS, such as one that supports an online transaction processing (OLTP) application, the snapshots will quickly become out of date. Therefore, the data must be refreshed periodically to be useful for data analytics. But if the data set is large, the refresh process may be time-consuming and resource-intensive, to both the RDBMS and the Hadoop cluster.
This paper explores a better alternative which replicates changes from an Oracle database to a Hadoop cluster, maintaining a real-time or near realtime copy of the source tables that your organization needs to meet its data analytics requirements.