The following describes in detail how to use distributed capabilities for data deduplication
First. Build a Spark standalone cluster (1 master2 worker)
- Install jdk
a. Download the jdk package
Download address
b. Decompression
tar -zxvf ./jdk-8u181-linux-x64.tar.gz
c. Environment configuration
~/.bashrc Add at the end of the file, after adding source ~/.bashrc Take effect
export JAVA_HOME=/xxx/jdk1.8.0_181
export PATH=$PATH:${JAVA_HOME}/bin
d. Successful effect
java -version Show version number 2. Install spark cluster
a. Download the spark package
Download address
b. Extract and rename
tar -zxvf spark-2.3.1-bin-hadoop2.6.tgz
mv spark-2.3.1-bin-hadoop2.6.tgz spark-2.3.1
c. Modify the configuration file
d. Configure
i. cd spark-2.3.1
ii. cp
iii. vim
SPARK_MASTER_PORT=7077 #master Service port
SPARK_MASTER_HOST= #master node ip ifconifg command to find,ifconfig The command cannot be found and is required. apt install net-tools Installation and execution ifconfig
JAVA_HOME=/data1/jdk- #master Node jdk address echo $JAVA_HOME Find
PYSPARK_PYTHON=/data1/miniconda3/bin/python #python Environment
SPARK_MASTER_WEBUI_PORT=50010 #master webUI Port prevents port conflicts lsof -i:50010
SPARK_WORKER_WEBUI_PORT=50011 #worker webUI Port
Use which python to determine the path
If there is no python environment, it is recommended that conda manage:
Install Miniconda on Linux:
1. For Linux systems, download the Miniconda installation script using the following command:
2. Next, run the following command to execute the installation script:
3. Follow the instructions of the installer to install. Install according to the default settings, or customize the settings as needed.
4. After the installation is complete, you may need to activate Miniconda. You can activate Miniconda by executing the following command:
Add to the tail of vim ~ / .bashrc
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/root/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
if [ -f "/root/miniconda3/etc/profile.d/" ]; then
. "/root/miniconda3/etc/profile.d/"
export PATH="/root/miniconda3/bin:$PATH"
unset __conda_setup
# <<< conda initialize <<<
source ~/.bashrc
5. Now, to verify that Miniconda is installed successfully, try the following command to check the version of Miniconda:
conda --version
If the installation is successful, you can see the version number of the installed Miniconda.
After installation, you can use Miniconda to manage your Python environment and install various packages and dependencies.
e. Configure Spark-defaults.conf
i. cp spark-defaults.conf.template spark-defaults.conf
ii. vim spark-defaults.conf
spark.master spark:// #master Nodes: Port #master Nodes ip
spark.eventLog.enabled true #Open the log
spark.eventLog.dir file://your_path/spark/eventLog #Log address
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.executor.instances 5
spark.driver.memory 32g # Spark application drive memory, take the previous example to give a reference value
# spark.executor.memory 340g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.executor.extraJavaOptions -Dio.netty.tryReflectionSetAccessible=true #Spark Actuator configures Netty Network Library to improve Network performance 10000000
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 4g #Out-of-heap memory
f. Configure the spark worker nod
i. Non-secret
- ~/.ssh/config Several work configurations, vim / etc/hosts configuration local hosts, enter hostname verification, # ubuntu should not be written as ubuntu vim ~/.ssh/config
Host spark-worker0
HostName # Ifconifg command lookup
User root
Port 22 # Machine login port number
Host spark-worker1
User root # Ubuntu should not be written as ubuntu
Port 22
Host spark-master-langchao
User root
Port 60022
Host spark-worker-langchao
User root
Port 22
- Send secret-free ssh-copy-id -i ~/.ssh/ spark-worker1【execute under root user】 If master is missing a public key file. Follow these steps to check and generate the public key file:
1. First, check to see if a public key file named `id` already exists. You can execute the following command to check:
ls ~/.ssh/
2. If the file does not exist, a new SSH key pair can be generated using the `ssh-keygen` command. Execute the following command:
Follow the prompts to enter information such as path and password to generate a new SSH key pair.
3. Copy the public key content in the generated public key file `public key`. Then execute the `public key id` command to copy the public key to the target host, and make sure to replace `< your username >` and `< remote host >` as the correct user name and remote host name:
ssh-copy-id -i ~/.ssh/ spark-worker1
ii. Configure workers cp workers.template workers vim workers,tail addition
spark-master #Master nodes can also be used as worker nodes at the same time
g. Send the configured spark packets to each worker node respectively
scp -r spark-2.3.1 [email protected]:spark-standalone (the address of worker startup must be the same or it cannot be started) every worker must have (shared data directory / or hdfs) the files being accessed
- Start the spark cluster
Start master
./sbin/ #Close master
Start work
./sbin/ #close work
- Visit web
- Submit spark task
Standalone commit command
cd bin
./spark-submit --master spark:// --class org.apache.spark.examples.SparkPi ../examples/jars/spark-examples_2.12-3.4.0.jar 10000