How Should Row Keys Be Generated In Hbase

14.12.2020by admin

Since regions are seldom handled directly in client code and the region names may change over time, the coprocessor RPC calls use row keys to identify which regions should be used for the method invocations. Clients can call coprocessor Service methods against either: a single region - calling Table.coprocessorService(byte) with a single row key. Using Spark with HBase and salted row keys. That interval into keys and secondly use a special HBase‘s API, Scan, for scanning all the rows belonging to that key interval. HBase uses an automatic sharding mechanism for distributing the load across the multiple servers belonging to the. You’ll need to generate one scan for each. HBase only supports a single row key per row and it cannot be empty or null. The HBase Handler maps the primary key value into the HBase row key value. If the source table has multiple primary keys, then the primary key values are concatenated, separated by a pipe delimiter ( ). At the very least, a row key has to be a WritableComparable. This would lead to the most general case being either hadoop.io.BytesWritable or hbase.io.ImmutableBytesWritable.

How Should Row Keys Be Generated In Hbase 2017
How Should Row Keys Be Generated In Hbase Windows 10
How Should Row Keys Be Generated In Hbase In Hadoop
How Should Row Keys Be Generated In Hbase 1
How Should Row Keys Be Generated In Hbase Data

The script content on this page is for navigation purposes only and does not alter the content in any way.

Dec 28, 2014 HBase shards rows by regions, which are defined by a range of row keys. Every region in an HBase cluster is managed by a RegionServer process. Typically, there is a single RegionServer process per HBase node. As the amount of data grows, HBase splits regions and migrates the associated data to different nodes in the cluster for balancing purposes.
Jan 22, 2020 HBase row key design - generating UUIDs To properly create the UUIDs to avoid hotspotting, follow the HBase row key design patterns as outlined in the row key design link in related links at the end of this topic. Hotspots make one node do all the work, thus resulting in a long loading process.

Learn how to use the HBase Handler to populate HBase tables from existing Oracle GoldenGate supported sources.

Topics:

6.1 Overview

HBase is an open source Big Data application that emulates much of the functionality of a relational database management system (RDBMS). Hadoop is specifically designed to store large amounts of unstructured data. Conversely, data stored in databases and replicated through Oracle GoldenGate is highly structured. HBase provides a way to maintain the important structure of data while taking advantage of the horizontal scaling that is offered by the Hadoop Distributed File System (HDFS).

6.2 Detailed Functionality

The HBase Handler takes operations from the source trail file and creates corresponding tables in HBase, and then loads change capture data into those tables.

Table names created in an HBase map to the corresponding table name of the operation from the source trail file. Table name is case-sensitive.

For two-part table names (schema name and table name), the schema name maps to the HBase table namespace. For a three-part table name like Catalog.Schema.MyTable, the create HBase namespace would be Catalog_Schema. HBase table namespaces are case sensitive. A null schema name is supported and maps to the default HBase namespace.

HBase has a similar concept to the database primary keys, called the HBase row key. The HBase row key is the unique identifier for a table row. HBase only supports a single row key per row and it cannot be empty or null. The HBase Handler maps the primary key value into the HBase row key value. If the source table has multiple primary keys, then the primary key values are concatenated, separated by a pipe delimiter (). You can configure the HBase row key delimiter.

If there's no primary/unique keys at the source table, then Oracle GoldenGate behaves as follows:

If KEYCOLS is specified, then it constructs the key based on the specifications defined in the KEYCOLS clause.
If KEYCOLS is not specified, then it constructs a key based on the concatenation of all eligible columns of the table.

The result is that the value of every column is concatenated to generate the HBase rowkey. However, this is not a good practice.

Workaround: Use the replicat mapping statement to identify one or more primary key columns. For example: MAP QASOURCE.TCUSTORD, TARGET QASOURCE.TCUSTORD, KEYCOLS (CUST_CODE);

HBase has the concept of a column family. A column family is a way to group column data. Only a single column family is supported. Every HBase column must belong to a single column family. The HBase Handler provides a single column family per table that defaults to cf. You can configure the column family name. However, after a table is created with a specific column family name, you cannot reconfigure the column family name in the HBase example, without first modifying or dropping the table results in an abend of the Oracle GoldenGateReplicat processes.

6.3 Setting Up and Running the HBase Handler

HBase must run either collocated with the HBase Handler process or on a machine that can connect from the network that is hosting the HBase Handler process. The underlying HDFS single instance or clustered instance serving as the repository for HBase data must also run.

Instructions for configuring the HBase Handler components and running the handler are described in this section.

Topics:

6.3.1 Classpath Configuration

For the HBase Handler to connect to HBase and stream data, the hbase-site.xml file and the HBase client jars must be configured in gg.classpath variable. The HBase client jars must match the version of HBase to which the HBase Handler is connecting. The HBase client jars are not shipped with the Oracle GoldenGate for Big Data product.

HBase Handler Client Dependencies lists the required HBase client jars by version.

The default location of the hbase-site.xml file is HBase_Home/conf.

The default location of the HBase client JARs is HBase_Home/lib/*.

If the HBase Handler is running on Windows, follow the Windows classpathing syntax.

The gg.classpath must be configured exactly as described. The path to the hbase-site.xml file must contain only the path with no wild card appended. The inclusion of the * wildcard in the path to the hbase-site.xml file will cause it to be inaccessible. Conversely, the path to the dependency jars must include the (*) wildcard character in order to include all the jar files in that directory, in the associated classpath. Do not use *.jar. The following is an example of a correctly configured gg.classpath variable:

gg.classpath=/var/lib/hbase/lib/*:/var/lib/hbase/conf

6.3.2 HBase Handler Configuration

The following are the configurable values for the HBase Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).

To enable the selection of the HBase Handler, you must first configure the handler type by specifying gg.handler.jdbc.type=hbase and the other HBase properties as follows:

Table 6-1 HBase Handler Configuration Properties

Properties	Required/ Optional	Legal Values	Default	Explanation
`gg.handlerlist`	Required	Any string.	None	Provides a name for the HBase Handler. The HBase Handler name is then becomes part of the property names listed in this table.
`gg.handler.name.type`	Required	`hbase`.	None	Selects the HBase Handler for streaming change data capture into HBase.
`gg.handler.name.hBaseColumnFamilyName`	Optional	Any string legal for an HBase column family name.	`cf`	Column family is a grouping mechanism for columns in HBase. The HBase Handler only supports a single column family in the 12.2 release.
`gg.handler.name.HBase20Compatible`	Optional	`truefalse`	`false` ( HBase 1.0 compatible)	HBase 2.x removed methods and changed object hierarchies. The result is that it broke the binary compatibility with HBase 1.x. Set this property to `true` to correctly interface with HBase 2.x, otherwise HBase 1.x compatibility is used.
`gg.handler.name.includeTokens`	Optional	`truefalse`	`false`	Using `true` indicates that token values are included in the output to HBase. Using `false` means token values are not to be included.
`gg.handler.name.keyValueDelimiter`	Optional	Any string.	`=`	Provides a delimiter between key values in a map. For example, `key=value,key1=value1,key2=value2`. Tokens are mapped values. Configuration value supports `CDATA[]` wrapping.
`gg.handler.name.keyValuePairDelimiter`	Optional	Any string.	`,`	Provides a delimiter between key value pairs in a map. For example, `key=value,key1=value1,key2=value2key=value,key1=value1,key2=value2`. Tokens are mapped values. Configuration value supports `CDATA[]` wrapping.
`gg.handler.name.encoding`	Optional	Any encoding name or alias supported by Java.^{Foot 1} For a list of supported options, see `https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html`.	The native system encoding of the machine hosting the Oracle GoldenGate process	Determines the encoding of values written the HBase. HBase values are written as bytes.
`gg.handler.name.pkUpdateHandling`	Optional	`abendupdatedelete-insert`	`abend`	Provides configuration for how the HBase Handler should handle update operations that change a primary key. Primary key operations can be problematic for the HBase Handler and require special consideration by you. `abend`: indicates the process will end abnormally. `update`: indicates the process will treat this as a normal update `delete-insert`: indicates the process will treat this as a delete and an insert. The full before image is required for this feature to work properly. This can be achieved by using full supplemental logging in Oracle Database. Without full before and after row images the insert data will be incomplete.
`gg.handler.name.nullValueRepresentation`Mercedes e class w211 user manual.	Optional	Any string.	`NULL`	Allows you to configure what will be sent to HBase in the case of a NULL column value. The default is `NULL`. Configuration value supports `CDATA[]` wrapping.
`gg.handler.name.authType`	Optional	`kerberos`	None	Setting this property to `kerberos` enables Kerberos authentication.
`gg.handler.name.kerberosKeytabFile`	Optional (Required if `authType=kerberos`)	Relative or absolute path to a Kerberos `keytab` file.	-	The `keytab` file allows the HDFS Handler to access a password to perform a `kinit` operation for Kerberos security.
`gg.handler.name.kerberosPrincipal`	Optional (Required if `authType=kerberos`)	A legal Kerberos principal name (for example, `user/FQDN@MY.REALM`)	-	The Kerberos principal name for Kerberos authentication.
`gg.handler.name.rowkeyDelimiter`	Optional	Any string/		Configures the delimiter between primary key values from the source table when generating the HBase `rowkey`. This property supports `CDATA[]` wrapping of the value to preserve whitespace if the user wishes to delimit incoming primary key values with a character or characters determined to be whitespace.
`gg.handler.name.setHBaseOperationTimestamp`	Optional	`truefalse`	`true`	Set to `true` to set the timestamp for HBase operations in the HBase Handler instead of allowing HBase to assign the timestamps on the server side. This property can be used to solve the problem of a row delete followed by an immediate reinsert of the row not showing up in HBase, see HBase Handler Delete-Insert Problem.
`gg.handler.name.omitNullValues`	Optional	`truefalse`	`false`	Set to `true` to omit null fields from being written.

^{Footnote 1}

See Java Internalization Support at https://docs.oracle.com/javase/8/docs/technotes/guides/intl/.

6.3.3 Sample Configuration

The following is a sample configuration for the HBase Handler from the Java Adapter properties file:

6.3.4 Performance Considerations

At each transaction commit, the HBase Handler performs a flush call to flush any buffered data to the HBase region server. This must be done to maintain write durability. Flushing to the HBase region server is an expensive call and performance can be greatly improved by using the Replicat GROUPTRANSOPS parameter to group multiple smaller transactions in the source trail file into a larger single transaction applied to HBase. You can use Replicat base-batching by adding the configuration syntax in the Replicat configuration file.

Operations from multiple transactions are grouped together into a larger transaction, and it is only at the end of the grouped transaction that transaction is committed.

6.4 Security

You can secure HBase connectivity using Kerberos authentication. Follow the associated documentation for the HBase release to secure the HBase cluster. The HBase Handler can connect to Kerberos secured clusters. The HBase hbase-site.xml must be in handlers classpath with the hbase.security.authentication property set to kerberos and hbase.security.authorization property set to true.

You have to include the directory containing the HDFS core-site.xml file in the classpath. Kerberos authentication is performed using the Hadoop UserGroupInformation class. This class relies on the Hadoop configuration property hadoop.security.authentication being set to kerberos to successfully perform the kinit command.

Additionally, you must set the following properties in the HBase Handler Java configuration file:

You may encounter the inability to decrypt the Kerberos password from the keytab file. This causes the Kerberos authentication to fall back to interactive mode which cannot work because it is being invoked programmatically. The cause of this problem is that the Java Cryptography Extension (JCE) is not installed in the Java Runtime Environment (JRE). Ensure that the JCE is loaded in the JRE, see http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html.

6.5 Metadata Change Events

How Should Row Keys Be Generated In Hbase 2017

The HBase Handler seamlessly accommodates metedata change events including adding a column or dropping a column. The only requirement is that the source trail file contains the metadata.

6.6 Additional Considerations

Classpath issues are common during the initial setup of the HBase Handler. The typical indicators are occurrences of the ClassNotFoundException in the Java log4j log file. The HBase client jars do not ship with Oracle GoldenGate for Big Data. You must resolve the required HBase client jars. HBase Handler Client Dependencies includes a list of HBase client jars for each supported version. Either the hbase-site.xml or one or more of the required client JARS are not included in the classpath. For instructions on configuring the classpath of the HBase Handler, see Classpath Configuration.

6.7 Troubleshooting the HBase Handler

Troubleshooting of the HBase Handler begins with the contents for the Java log4j file. Follow the directions in the Java Logging Configuration to configure the runtime to correctly generate the Java log4j log file.

Topics:

How Should Row Keys Be Generated In Hbase Windows 10

6.7.1 Java Classpath

How Should Row Keys Be Generated In Hbase In Hadoop

Issues with the Java classpath are common. A ClassNotFoundException in the Java log4j log file indicates a classpath problem. You can use the Java log4j log file to troubleshoot this issue. Setting the log level to DEBUG logs each of the jars referenced in the gg.classpath object to the log file. You can make sure that all of the required dependency jars are resolved by enabling DEBUG level logging, and then searching the log file for messages like the following:

6.7.2 HBase Connection Properties

The contents of the HDFS hbase-site.xml file (including default settings) are output to the Java log4j log file when the logging level is set to DEBUG or TRACE. This file shows the connection properties to HBase. Search for the following in the Java log4j log file.

Commonly, for the hbase-site.xml file is not included in the classpath or the path to the hbase-site.xml file is incorrect. In this case, the HBase Handler cannot establish a connection to HBase, and the Oracle GoldenGate process abends. The following error is reported in the Java log4j log.

Verify that the classpath correctly includes the hbase-site.xml file and that HBase is running.

6.7.3 Logging of Handler Configuration

The Java log4j log file contains information on the configuration state of the HBase Handler. This information is output at the INFO log level. The following is a sample output:

6.7.4 HBase Handler Delete-Insert Problem

If you are using the HBase Handler with the gg.handler.name.setHBaseOperationTimestamp=false configuration property, then the source database may get out of sync with data in the HBase tables. This is caused by the deletion of a row followed by the immediate reinsertion of the row. HBase creates a tombstone marker for the delete that is identified by a specific timestamp. This tombstone marker marks any row records in HBase with the same row key as deleted that have a timestamp before or the same as the tombstone marker. This can occur when the deleted row is immediately reinserted. The insert operation can inadvertently have the same timestamp as the delete operation so the delete operation causes the subsequent insert operation to incorrectly appear as deleted.

To work around this issue, you need to set the gg.handler.name.setHbaseOperationTimestamp=true, which does two things:

Sets the timestamp for row operations in the HBase Handler.
Detection of a delete-insert operation that ensures that the insert operation has a timestamp that is after the insert.

How Should Row Keys Be Generated In Hbase 1

The default for gg.handler.name.setHbaseOperationTimestamp is true, which means that the HBase server supplies the timestamp for a row. This prevents the HBase delete-reinsert out-of-sync problem.

Setting the row operation timestamp in the HBase Handler can have these consequences:

Since the timestamp is set on the client side, this could create problems if multiple applications are feeding data to the same HBase table.
If delete and reinsert is a common pattern in your use case, then the HBase Handler has to increment the timestamp 1 millisecond each time this scenario is encountered.

How Should Row Keys Be Generated In Hbase Data

Processing cannot be allowed to get too far into the future so the HBase Handler only allows the timestamp to increment 100 milliseconds into the future before it attempts to wait the process so that the client side HBase operation timestamp and real time are back in sync. When a delete-insert is used instead of an update in the source database so this sync scenario would be quite common. Processing speeds may be affected by not allowing the HBase timestamp to go over 100 milliseconds into the future if this scenario is common.