[Spark] 常見問題與討論

本篇想要持續記錄在執行 Spark, Hadoop 開發的時候所遇到的所有問題,並提供相對應的參考資料,提供一個第三方的看法當開發者在遇到類似問題的時候可以有靈感可以解決!

Performance Tuning

https://spark.apache.org/docs/latest/sql-performance-tuning.html

ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Exception thrown in awaitResult:
org.apache.spark.SparkException: Exception thrown in awaitResult:
    at org.apache.spark.util.ThreadUtils@.awaitResult(ThreadUtils.scala:205)

此時可以嘗試增加 Timeout 的設定 spark.sql.broadcastTimeout 預設是 300 秒!

Kerberos

https://spark.apache.org/docs/latest/running-on-yarn.html#kerberos

有時候會發現以下的 Access Security 錯誤:

WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 13, executor 1): org.apache.hadoop.security.AccessControlException: Authentication required
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:476)
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:119)

此時很有可能是因為 Kerberos 的認證沒有設定好,可以嘗試增加 spark.yarn.access.hadoopFileSystems 的具體位址。

SparkContext 初始化

有時候我們會得到以下的錯誤訊息,此時是由於 SparkSession 初始化的時候使用到 setMaster(“local[*]”) 但是在部署的時候deploy-mode 卻選擇 cluster 上面。 

INFO ApplicationMaster:54 - Final app status: FAILED, exitCode: 13 
ERROR ApplicationMaster:91 - Uncaught exception: java.lang.IllegalStateException: User did not initialize spark context! 
    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:498) 
    at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
Hive-site.xml 錯誤提供

在連結到 Hive 資料庫的時候,需要透過 Metastore 去做連結,一般來說 Metastore 的資訊是放在 hive-site.xml,如果 hive-site.xml 裡面放的是錯誤資訊的話,就會產生類似以下的錯誤訊息:

java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/ql/session/SessionState when creating Hive client 
using classpath: file:/home/mountain/hv/lib/, file:/home/mountain/hp/lib/
Please make sure that jars for your version of hive and hadoop are 
included in the paths passed to spark.sql.hive.metastore.jars.
同時讀取多個 Parquet 參考

有時候我們會需要利用 sparkSession.read.parquet() 讀取多個檔案,我們可以如以下清楚定義想要讀取的檔案路徑:

Dataset<Row> input = sparkSession.read.parquet("/root/path1", "/root/path2")

但是如果我們想要讀取多個路徑而且想要用 * 來代替所有可能的路徑,如以下範例:

Dataset<Row> input = sparkSession.read.parquet("/root/date=*/src=test/")

則未出現以下的錯誤訊息:

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.

此時我們需要在讀取 parquet 的地方加入 Option basePath 的參數設定:

Dataset<Row> input = sparkSession.read.option("basePath","/root/").parquet("/root/date=*/src=test/")
Spark 參數功用:

spark.sql.parquet.mergeSchema

spark.sql.hive.convertMetastore.Parquet

其他有用的網址參考:

https://www.jianshu.com/p/0de92480b8b6

關於 MapReduce 與 Yarn 的背後機制:

https://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop-yarn/

連線不到 Spark WebUI

筆者在  linux 

Exception in thread "main" java.lang.ExceptionInInitializerError
	at org.apache.spark.SparkConf$.<init>(SparkConf.scala:668)
	at org.apache.spark.SparkConf$.<clinit>(SparkConf.scala)
	at org.apache.spark.SparkConf$$anonfun$getOption$1.apply(SparkConf.scala:375)
	at org.apache.spark.SparkConf$$anonfun$getOption$1.apply(SparkConf.scala:375)
	at scala.Option.orElse(Option.scala:289)
	at org.apache.spark.SparkConf.getOption(SparkConf.scala:375)
	at org.apache.spark.SparkConf.get(SparkConf.scala:250)
	at org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopConfigurations(SparkHadoopUtil.scala:473)
	at org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:446)
	at org.apache.spark.deploy.SparkSubmit$$anonfun$1.apply(SparkSubmit.scala:383)
	at org.apache.spark.deploy.SparkSubmit$$anonfun$1.apply(SparkSubmit.scala:383)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.deploy.SparkSubmit$.doPrepareSubmitEnvironment(SparkSubmit.scala:383)
	at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:250)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:171)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.SocketException: Bad address (ioctl(SIOCGIFCONF) failed)
	at java.net.NetworkInterface.getAll(Native Method)
	at java.net.NetworkInterface.getNetworkInterfaces(NetworkInterface.java:355)
	at org.apache.spark.util.Utils$.findLocalInetAddress(Utils.scala:922)
	at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress$lzycompute(Utils.scala:908)
	at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress(Utils.scala:908)
	at org.apache.spark.util.Utils$$anonfun$localCanonicalHostName$1.apply(Utils.scala:965)
	at org.apache.spark.util.Utils$$anonfun$localCanonicalHostName$1.apply(Utils.scala:965)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.util.Utils$.localCanonicalHostName(Utils.scala:965)
	at org.apache.spark.internal.config.package$.<init>(package.scala:282)
	at org.apache.spark.internal.config.package$.<clinit>(package.scala)
	... 17 more

在 Init Spark 的時候 Python 發出的錯誤訊息:

Traceback (most recent call last):
  File "LanguageAutoInsertionSingle.py", line 19, in <module>
    sc = SparkContext("local", arguments["format"]+"-load-app")
  File "/usr/local/lib/python3.6/site-packages/pyspark/context.py", line 133, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/usr/local/lib/python3.6/site-packages/pyspark/context.py", line 316, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/usr/local/lib/python3.6/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
    return _launch_gateway(conf)
  File "/usr/local/lib/python3.6/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number