[Hive] 在 Spark 存取自己的 Hive Metastore

擁有一個自己的 Hive Metastore 的好處是方便管理自己的資料,利用 Hive Metastore 可以把資料表與大數據平台上面的資料關連起來。Hive Metastore 可以部署在不同的資料庫上面,例如 MySQL 或是 Microsoft SQL Database。

初始化 Hive Metastore

可以使用 Apache Hive 專案中的工具 schematool,首先可以從 https://dlcdn.apache.org/hive/ 下載適合的  Apache Hive 版本,例如以 Apache Hive 3.1.3  為例,執行以下的指令:

wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
tar -xvzf apache-hive-3.1.3-bin.tar.gz
cd apache-hive-3.1.3-bin/conf
cp hive-default.xml.template hive-site.xml

參考另外一篇設定 Hive 到 MySQL 上 在複製 hive-site.xml 之後取代掉下面列出來的參數之後,執行最後一行指令。

<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://metastoreserver.mysql.database.azure.com:3306/metastoredb</value>

<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>

<name>javax.jdo.option.ConnectionUserName</name>
<value>admin</value>

<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>

最後在執行

schematool -dbType mysql -initSchema
備註:在執行最後這行程式的時候發生一連串的問題
guava-19.0.jar 版本衝突

在 /apache-hive-3.1.3-bin/lib 裡面有一個預設的 guava-19.0.jar 這個可能會跟既有的 guava 版本衝突出現下方的錯誤訊息,移除就可以解決這個報錯。

root@c4916eacdf8342e0b5ba9cf10a601a11000000:/tmp/apache-hive-3.1.3-bin/bin@ ./schematool -dbType mysql -userName atgenomix@metastoretest -initSchema
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/tmp/apache-hive-3.1.3-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html@multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
	at org.apache.hadoop.conf.Configuration.set(Configuration.java:1380)
	at org.apache.hadoop.conf.Configuration.set(Configuration.java:1361)
	at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:536)
	at org.apache.hadoop.mapred.JobConf.setJarByClass(JobConf.java:554)
	at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:448)
	at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:5144)
	at org.apache.hadoop.hive.conf.HiveConf.<init>(HiveConf.java:5107)
	at org.apache.hive.beeline.HiveSchemaTool.<init>(HiveSchemaTool.java:96)
	at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:1473)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:236)

ClassNotFoundException: com.mysql.cj.jdbc.Driver

要解決這個問題可以利用以下的指令下載到 apache-hive-4.0.0-alpha-1-bin/lib

wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.30/mysql-connector-java-8.0.30.jar
root@c4916eacdf8342e0b5ba9cf10a601a11000000:/tmp/apache-hive-4.0.0-alpha-1-bin/bin@ ./schematool -dbType mysql -initSchema
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/tmp/apache-hive-4.0.0-alpha-1-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html@multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Initializing the schema to: 4.0.0-alpha-1
Metastore connection URL:	 jdbc:mysql://metastoreserver.mysql.database.azure.com:3306/metastoredb
Metastore connection Driver :	 com.mysql.cj.jdbc.Driver
Metastore connection User:	 admin
Failed to load driver
Underlying cause: java.lang.ClassNotFoundException : com.mysql.cj.jdbc.Driver
Use --verbose for detailed stacktrace.
*** schemaTool failed ***
成功初始化 database 的訊息
root@c4916eacdf8342e0b5ba9cf10a601a11000000:/tmp/apache-hive-4.0.0-alpha-1-bin/bin@ ./schematool -dbType mysql -userName atgenomix@metastoretest -initSchema
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/tmp/apache-hive-4.0.0-alpha-1-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html@multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Initializing the schema to: 4.0.0-alpha-1
Metastore connection URL:	 jdbc:mysql://metastoreserver.mysql.database.azure.com:3306/metastoredb
Metastore connection Driver :	 com.mysql.cj.jdbc.Driver
Metastore connection User:	 admin@metastoreserver
Starting metastore schema initialization to 4.0.0-alpha-1
Initialization script hive-schema-4.0.0-alpha-1.mysql.sql

Initialization script completed

以下爲 MySQL 資料庫截圖,使用 Hive Schema Version 4.0.0 就有多一個 catalogs 的層級!

從 Spark 連結 Hive Metastore 時 Spark Configurations 的設定

目前再使用 Spark SQL 的時候,現在有比較多的應用都是使用 Deltalake 

spark.ql.extensions io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.jars.packages io.delta:delta-core_2.12:2.1.0

本範例是用 Spark 3.3.0,在 spark-hive_2.12-3.3.0.jar 裡面的是 metastore.version 是 2.3.9。

spark.sql.hive.metastore.version 2.3.9
spark.hadoop.datanucleus.autoCreateSchema true
spark.hadoop.datanucleus.fixedDatastore false
spark.hadoop.datanucleus.schema.autoCreateTables true
spark.databricks.delta.schema.autoMerge.enabled true

如果要利用 Spark 去連結 Hive Metastore 就需要以下的設定

spark.hadoop.javax.jdo.option.ConnectionUserName admin@metastoreserver
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:mysql://metastoreserver.mysql.database.azure.com:3306/metastore?createDatabaseIfNotExist=true&serverTimezone=UTC
spark.hadoop.javax.jdo.option.ConnectionPassword password
spark.hadoop.javax.jdo.option.ConnectionDriverName com.mysql.cj.jdbc.Driver
spark.sql.warehouse.dir abfss://container@storageaccount.dfs.core.windows.net/user/hive/warehouse/