[BigData] 客製化 Apache Spark 在 ARM64 架構上
在創建以 Spark 為基礎的應用程式的時候,基本上都會直接下載 Spark 官網打包好的程式碼,但是在一次實務上驅動 Spark Standalone 的時候遇到以下的錯誤訊息,主要是跟 External Shuffle Service 有關,這個錯誤並不會出現在以 Spark K8S 為方式驅動的應用,所以檸檬爸才會需要利用 Spark 提供的 make-distribution.sh 檔案去客製化 Apache Spark 在 ARM64 架構上。
01:52:12.422 ERROR SparkUncaughtExceptionHandler - Uncaught exception in thread Thread[main,5,main]
java.lang.UnsatisfiedLinkError: Could not load library. Reasons: [no leveldbjni64-1.8 in java.library.path, no leveldbjni-1.8 in java.library.path, no leveldbjni in java.library.path, /tmp/libleveldbjni-64-1-1645791443138296863.8: /tmp/libleveldbjni-64-1-1645791443138296863.8: cannot open shared object file: No such file or directory (Possible cause: can't load AMD 64-bit .so on a AARCH64-bit platform)]
at org.fusesource.hawtjni.runtime.Library.doLoad(Library.java:182) ~[jline-2.14.6.jar:?]
at org.fusesource.hawtjni.runtime.Library.load(Library.java:140) ~[jline-2.14.6.jar:?]
at org.fusesource.leveldbjni.JniDBFactory.<clinit>(JniDBFactory.java:48) ~[leveldbjni-all-1.8.jar:1.8]
at org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:48) ~[spark-network-common_2.12-3.3.0.jar:3.3.0]
at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:126) ~[spark-network-shuffle_2.12-3.3.0.jar:3.3.0]
at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:99) ~[spark-network-shuffle_2.12-3.3.0.jar:3.3.0]
at org.apache.spark.network.shuffle.ExternalBlockHandler.<init>(ExternalBlockHandler.java:81) ~[spark-network-shuffle_2.12-3.3.0.jar:3.3.0]
at org.apache.spark.deploy.ExternalShuffleService.newShuffleBlockHandler(ExternalShuffleService.scala:82) ~[spark-core_2.12-3.3.0.jar:3.3.0]
at org.apache.spark.deploy.ExternalShuffleService.<init>(ExternalShuffleService.scala:56) ~[spark-core_2.12-3.3.0.jar:3.3.0]
at org.apache.spark.deploy.worker.Worker.<init>(Worker.scala:183) ~[spark-core_2.12-3.3.0.jar:3.3.0]
at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:966) ~[spark-core_2.12-3.3.0.jar:3.3.0]
at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:934) ~[spark-core_2.12-3.3.0.jar:3.3.0]
at org.apache.spark.deploy.worker.Worker.main(Worker.scala) ~[spark-core_2.12-3.3.0.jar:3.3.0]
01:52:12.428 INFO ShutdownHookManager - Shutdown hook called
嘗試用 aarch64 為架構去重構 Spark 的原始碼,要特別注意要在 arm64 的 java 之下執行以下的指令:
./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phive -Phive-thriftserver -Pyarn -Pkubernetes -Phadoop-3 -Dhadoop.version=3.3.4 -e -DskipTests
開始執行的時候,出現以下的編譯訊息可以檢查是否有使用正確的編譯環境,由於本篇是想要在 ARM64 的架構中打包,所以 os.detected.arch 必須要是 aarch_64。
[INFO] Error stacktraces are turned on.
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Detecting the operating system and CPU architecture
[INFO] ------------------------------------------------------------------------
[INFO] os.detected.name: osx
[INFO] os.detected.arch: aarch_64
[INFO] os.detected.version: 14.6
[INFO] os.detected.version.major: 14
[INFO] os.detected.version.minor: 6
[INFO] os.detected.classifier: osx-aarch_64
[INFO] ------------------------------------------------------------------------
假設成功利用 Maven 編譯成功的話,會顯示以下的訊息:
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Spark Project Parent POM 3.5.3:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 2.998 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 4.224 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 2.017 s]
[INFO] Spark Project Local DB ............................. SUCCESS [ 3.241 s]
[INFO] Spark Project Common Utils ......................... SUCCESS [ 8.526 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 4.579 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 3.266 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 2.728 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 2.892 s]
[INFO] Spark Project Core ................................. SUCCESS [01:08 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [ 11.242 s]
[INFO] Spark Project GraphX ............................... SUCCESS [ 15.844 s]
[INFO] Spark Project Streaming ............................ SUCCESS [ 15.650 s]
[INFO] Spark Project SQL API .............................. SUCCESS [ 16.183 s]
[INFO] Spark Project Catalyst ............................. SUCCESS [ 43.525 s]
[INFO] Spark Project SQL .................................. SUCCESS [ 42.269 s]
[INFO] Spark Project ML Library ........................... SUCCESS [ 48.001 s]
[INFO] Spark Project Tools ................................ SUCCESS [ 4.290 s]
[INFO] Spark Project Hive ................................. SUCCESS [ 19.292 s]
[INFO] Spark Project REPL ................................. SUCCESS [ 7.000 s]
[INFO] Spark Project YARN Shuffle Service ................. SUCCESS [ 32.032 s]
[INFO] Spark Project YARN ................................. SUCCESS [ 11.904 s]
[INFO] Spark Project Kubernetes ........................... SUCCESS [ 11.906 s]
[INFO] Spark Project Hive Thrift Server ................... SUCCESS [ 14.014 s]
[INFO] Spark Project Assembly ............................. SUCCESS [ 4.154 s]
[INFO] Kafka 0.10+ Token Provider for Streaming ........... SUCCESS [ 6.091 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 9.020 s]
[INFO] Kafka 0.10+ Source for Structured Streaming ........ SUCCESS [ 11.082 s]
[INFO] Spark Project Examples ............................. SUCCESS [ 26.970 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 2.161 s]
[INFO] Spark Avro ......................................... SUCCESS [ 9.767 s]
[INFO] Spark Project Connect Common ....................... SUCCESS [ 24.701 s]
[INFO] Spark Protobuf ..................................... SUCCESS [ 8.417 s]
[INFO] Spark Project Connect Server ....................... SUCCESS [ 21.896 s]
[INFO] Spark Project Connect Client ....................... SUCCESS [ 25.447 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 09:06 min
[INFO] Finished at: 2025-07-18T23:20:09+08:00
[INFO] ------------------------------------------------------------------------
之後就會開始打包其他的應用程式例如:pyspark 與 SparkR 等等終端程式,檸檬爸在編譯的時候,一開始出現以下的錯誤訊息導致打包失敗,從字面是以為是幾個 Rd 檔出現的 syntax error 導致問題。
Use inherits() (or maybe is()) instead.
* checking Rd files ... NOTE
checkRd: (-1) column_collection_functions.Rd:324: Lost braces
324 | \url{https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option}{Data Source Option}
| ^
checkRd: (-1) column_collection_functions.Rd:332: Lost braces
332 | \url{https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option}{Data Source Option}
| ^
checkRd: (-1) read.jdbc.Rd:45: Lost braces
45 | \url{https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option}{Data Source Option} in the version you use.
| ^
checkRd: (-1) read.json.Rd:14: Lost braces
14 | \url{https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option}{Data Source Option} in the version you use.}
| ^
checkRd: (-1) read.orc.Rd:14: Lost braces
14 | \url{https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option}{Data Source Option} in the version you use.}
| ^
checkRd: (-1) read.parquet.Rd:14: Lost braces
14 | \url{https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option}{Data Source Option} in the version you use.}
| ^
checkRd: (-1) read.text.Rd:14: Lost braces
14 | \url{https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option}{Data Source Option} in the version you use.}
| ^
checkRd: (-1) repartition.Rd:24: Lost braces in \itemize; meant \describe ?
checkRd: (-1) repartition.Rd:25-26: Lost braces in \itemize; meant \describe ?
checkRd: (-1) repartition.Rd:27-28: Lost braces in \itemize; meant \describe ?
checkRd: (-1) repartitionByRange.Rd:24-25: Lost braces in \itemize; meant \describe ?
checkRd: (-1) repartitionByRange.Rd:26-27: Lost braces in \itemize; meant \describe ?
checkRd: (-1) spark.kmeans.Rd:69: Lost braces; missing escapes or markup?
69 | (cluster centers of the transformed data), {is.loaded} (whether the model is loaded
| ^
checkRd: (-1) write.jdbc.Rd:28-29: Lost braces
28 | \url{https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option}{
| ^
checkRd: (-1) write.json.Rd:19: Lost braces
19 | \url{https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option}{Data Source Option} in the version you use.}
| ^
checkRd: (-1) write.orc.Rd:19: Lost braces
19 | \url{https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option}{Data Source Option} in the version you use.}
| ^
checkRd: (-1) write.parquet.Rd:19: Lost braces
19 | \url{https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option}{Data Source Option} in the version you use.}
| ^
checkRd: (-1) write.text.Rd:19: Lost braces
19 | \url{https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option}{Data Source Option} in the version you use.}
|
* checking Rd metadata ... OK
* checking Rd line widths ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking installed files from ‘inst/doc’ ... OK
* checking files in ‘vignettes’ ... OK
* checking examples ... OK
* checking for unstated dependencies in ‘tests’ ... OK
* checking tests ... SKIPPED
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes ... OK
* checking re-building of vignette outputs ... [8s/47s] OK
* checking PDF version of manual ... WARNING
LaTeX errors when creating PDF version.
This typically indicates Rd problems.
* checking PDF version of manual without index ... ERROR
Re-running with no redirection of stdout/stderr.
Hmm ... looks like a package
Converting parsed Rd's to LaTeX ..........................
Creating pdf output from LaTeX ...
Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet, :
pdflatex is not available
Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet, :
pdflatex is not available
呼叫 tools::texi2pdf()時發生錯誤
...
* checking for non-standard things in the check directory ... NOTE
Found the following files/directories:
‘.DS_Store’ ‘SparkR-manual.tex’
* checking for detritus in the temp directory ... OK
* DONE
Status: 1 ERROR, 1 WARNING, 8 NOTEs
解決方法:安裝 TinyTex
利用以下指令在地端的 R 環境裡面安裝 TinyTex。
install.packages("tinytex")
tinytex::install_tinytex()
接著在利用以下指令檢查是否安裝成功:
tinytex::tlmgr_path()