Managing the yield of wafer is one of the most important tasks to the semiconductor manufacturers. A lot of efforts for enhancing the yield of wafer have been conducted in both industries and academia. Thanks to the advance of IoT and data analytics techniques, huge amount of process operational data, such as indices of process parameters, equipment condition data, or historical data of manufacturing process, is collected and analyzed in realtime. In this project I will analyze the big data of Wafer manufacturing for troubleshooting and fault detection by using PySpark in several steps.
-
Pearson Correlation
-
Box plot(by median gap)
-
PLSR
Semiconductor manufacturing is one of the most complex works that has hundreds of process steps, several kinds of wafers, machinery, re-entrant flow, and innumerable process parameters, so it takes few months for completing the whole processes accordingly. Also, since semiconductor manufacturing process is very sensitive on stream, yield management is one of the most important issues directly connected to survival of a company.Here I briefly introduced the manufacturing process of wafer.
-
Wafer Manufaturing
The profound process of fabricating silicon wafers with IC design by masks and semiconductors processing machine tools.
-
Wafer Processing
-
Wafer fabrication
The process is roughly divided into several stations, each of which corresponds to several stages. Each stage corresponds to a certain type of machine processing (there are several machines of various types in the factory). A type of machine will be used in many stages and all the wafer must be processed in sequence from start to finish.
-
Wafer probe
A wafer prober is a machine used to test integrated circuits. In this project, we will foucus on the parameter called WAT(Wafer Acceptance Test)
-
-
What is Spark and Hadoop?
Hadoop is a extremely powerful tool for distributed, scalable and economical data storage, processing and analysis. The data of semiconductor manufaturing is really tremendous thus I choose Hadoop to store them and use Pyspark to analyze.
-
Simple command
Copy file data.csv from local disk to the user’s directory in HDFS
$ hdfs dfs –put data.csv data.csv
Get a directory listing of the user’s home directory in HDFS
$ hdfs dfs ls
Display the contents of the HDFS file /user/semiconductor/data.csv
$ hdfs dfs –cat /user/semiconductor/data.csv
-
Set up Pyspark environment
#!/usr/bin/env python # coding: utf-8 # In[ ]: import os import sys #this part is used for pyspark submit os.environ['PYSPARK_SUBMIT_ARGS']='--verbose --master=yarn --queue test pyspark-shell' os.environ['JAVA_HOME']='/usr/lib/jvm/java-8-openjdk-amd64/' os.environ['YARN_CONF_DIR']='/etc/alternatives/hadoop-conf/' #this line is used for spark1.6 #os.environ['SPARK_HOME']='/opt/cloudera/parcels/CDH/lib/spark' #this line is used for spark2.2 os.environ['SPARK_HOME']='/opt/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2' # this line is used for python2.7 #os.environ['PYSPARK_PYTHON']='/usr/bin/python' #this line is used for python3.5 os.environ['PYSPARK_PYTHON']='/usr/bin/python3' spark_home = os.environ.get('SPARK_HOME', None) sys.path.insert(0, os.path.join(spark_home, 'python')) sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.4-src.zip')) #execfile(os.path.join(spark_home, 'python/pyspark/shell.py')) exec(open(os.path.join(spark_home, 'python/pyspark/shell.py')).read())
Let's recall our misson again:
Find out the relevant factors which affect yield; therefore, we can delete some irrelevant data manually at first.
In the directory wat_data:
wat_data/Parameter_set | wat_data/Wat_root_cause |
---|---|
![]() |
![]() |
The aa column in Parameter_set are arbitrary numbers and so does the Range column. The other column in wat_root_cause isn't really meaningful to the process as well.
In the directory wat_data/wat:
Just like we mentioned, WAT is a test key whick will affect yield.
wat.header | wat.first_raw |
---|---|
![]() |
![]() |
In the directory FDC_data/stageXX:
SVID(Status Variables Identification) is the physical data collected by sensors embedded in the advanced machines during the manufacturing process. To state the physical nature of certain SVID, we usually transform SVID into Fault Detection and Classification parameters (FDC parameters) using statistical indicators.
stageXX.header and stageXX.row |
---|
![]() |
From this picture we will know what those columns(toolid, chamberid, process, stage) represent.
Here is the code!!
-
Data preprocessing
- Merge all the dataframe
- Drop uncessary columns
-
Pearson correlation
- Compute the pearson matrix and print the top ten
-
Draw correlation plot
- plot the specific WAT columns repect to yield
-
Result
Correlation matrix Top ten WAT WAT1036 WAT2985 WAT2848 WAT748 WAT517 WAT1477 WAT33 WAT2064 WAT2086
Here is the code!!
-
Data preprocessing
- Data type transform
- Merge all the dataframe
- Drop uncessary columns and renamed the columns
-
Compute median
- Compute the median in each Process stage first
- Then compute the median in each toolid
- Compute the median gap
from pyspark.sql.window import Window
andimport pyspark.sql.functions as func
are two useful tools inPyspark
-
Draw box plot
- We can observe(or by coding) the most worst toolid in each step and we will list the top five toolid in specific process stages that influnce the yield most.
-
Result
Top five toolid Stage209 Stage102 Stage200 Stage207 Stage95
Here is the code!! After previous steps, we can concentrate on specific stages and toolid for doing further analysis; fortunately, we will precisely figure out which SVID steps affect the yield.
-
Data preprocessing
- Data type transform
- Merge all the dataframe(Stage2_SVIDX_StepX~Stage300_SVIDX_StepX, waferid, yield)
- Drop uncessary columns and the same headers
- Drop the missing value
NaN
-
Using PLSR
- PLSR is a Linear Regression Model with Multiple Input X and Multiple Output Y
- We make the input X and output Y do the principal component analysis(PCA) through axis rotation first and then do the coefficient estimation of linear regression model
- Show the VIP score in each SVID step
-
Result
Top 20 VIP score of SVID steps
I propose an approach of yield managment which could probably handle complex semiconductor manufacturing process and detect where false will operate.
- It is impossible to restore compressed features to original features for accurate troubleshooting.
We know that Random Forest also has the function of importance Feature
, which tells us what features are used as root nodes (the most important one).
- We must classify the data first, etc:(yield > 70 = 1, yield < 70 = 0), but the features chosen to be the root greatly depends on how we classify the data.
- There are many parameters in model training, such as num_tree, max_depth and so on. It will also greatly affect the outcome of feature selection.