Abstract

Managing the yield of wafer is one of the most important tasks to the semiconductor manufacturers. A lot of efforts for enhancing the yield of wafer have been conducted in both industries and academia. Thanks to the advance of IoT and data analytics techniques, huge amount of process operational data, such as indices of process parameters, equipment condition data, or historical data of manufacturing process, is collected and analyzed in realtime. In this project I will analyze the big data of Wafer manufacturing for troubleshooting and fault detection by using PySpark in several steps.

Guide line

Wafer processing introduction

Spark and Hadoop introduction

Wafer processing Data introduction

Data analyze

Pearson Correlation
Box plot(by median gap)
PLSR

Conclusion

Wafer processing introduction

Semiconductor manufacturing is one of the most complex works that has hundreds of process steps, several kinds of wafers, machinery, re-entrant flow, and innumerable process parameters, so it takes few months for completing the whole processes accordingly. Also, since semiconductor manufacturing process is very sensitive on stream, yield management is one of the most important issues directly connected to survival of a company.Here I briefly introduced the manufacturing process of wafer.

Wafer Manufaturing

The profound process of fabricating silicon wafers with IC design by masks and semiconductors processing machine tools.
Wafer Processing
- Wafer fabrication
  
  The process is roughly divided into several stations, each of which corresponds to several stages. Each stage corresponds to a certain type of machine processing (there are several machines of various types in the factory). A type of machine will be used in many stages and all the wafer must be processed in sequence from start to finish.
- Wafer probe
  
  A wafer prober is a machine used to test integrated circuits. In this project, we will foucus on the parameter called WAT(Wafer Acceptance Test)

Spark and Hadoop introduction

What is Spark and Hadoop?

Hadoop is a extremely powerful tool for distributed, scalable and economical data storage, processing and analysis. The data of semiconductor manufaturing is really tremendous thus I choose Hadoop to store them and use Pyspark to analyze.
Simple command

Copy file data.csv from local disk to the user’s directory in HDFS
```
$ hdfs dfs –put data.csv data.csv
```
Get a directory listing of the user’s home directory in HDFS
```
$ hdfs dfs ls
```
Display the contents of the HDFS file /user/semiconductor/data.csv
```
$ hdfs dfs –cat /user/semiconductor/data.csv
```

Set up Pyspark environment

#!/usr/bin/env python
# coding: utf-8

# In[ ]:


import os
import sys

#this part is used for pyspark submit
os.environ['PYSPARK_SUBMIT_ARGS']='--verbose --master=yarn --queue test pyspark-shell'

os.environ['JAVA_HOME']='/usr/lib/jvm/java-8-openjdk-amd64/'
os.environ['YARN_CONF_DIR']='/etc/alternatives/hadoop-conf/'

#this line is used for spark1.6
#os.environ['SPARK_HOME']='/opt/cloudera/parcels/CDH/lib/spark'

#this line is used for spark2.2
os.environ['SPARK_HOME']='/opt/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2'

# this line is used for python2.7
#os.environ['PYSPARK_PYTHON']='/usr/bin/python'

#this line is used for python3.5
os.environ['PYSPARK_PYTHON']='/usr/bin/python3'

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.4-src.zip'))  
#execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
exec(open(os.path.join(spark_home, 'python/pyspark/shell.py')).read())

Wafer processing Data introduction

Let's recall our misson again:

Find out the relevant factors which affect yield; therefore, we can delete some irrelevant data manually at first.

In the directory wat_data:

wat_data/Parameter_set	wat_data/Wat_root_cause

The aa column in Parameter_set are arbitrary numbers and so does the Range column. The other column in wat_root_cause isn't really meaningful to the process as well.

Here we move to the two of most important data

In the directory wat_data/wat:

Just like we mentioned, WAT is a test key whick will affect yield.

wat.header	wat.first_raw

In the directory FDC_data/stageXX:

SVID(Status Variables Identification) is the physical data collected by sensors embedded in the advanced machines during the manufacturing process. To state the physical nature of certain SVID, we usually transform SVID into Fault Detection and Classification parameters (FDC parameters) using statistical indicators.

stageXX.header and stageXX.row

From this picture we will know what those columns(toolid, chamberid, process, stage) represent.

Data analyze

Step 1 - Pearson Correlation

Here is the code!!

Data preprocessing
- Merge all the dataframe
- Drop uncessary columns
Pearson correlation
- Compute the pearson matrix and print the top ten
Draw correlation plot
- plot the specific WAT columns repect to yield
Result

Correlation matrix

Top ten WAT

WAT1036 WAT2985 WAT2848

WAT748 WAT517 WAT1477

WAT33 WAT2064 WAT2086

Step 2 - Box plot(By median gap)

Here is the code!!

Data preprocessing
- Data type transform
- Merge all the dataframe
- Drop uncessary columns and renamed the columns
Compute median
- Compute the median in each Process stage first
- Then compute the median in each toolid
- Compute the median gap
- from pyspark.sql.window import Window and import pyspark.sql.functions as func are two useful tools in Pyspark
Draw box plot
- We can observe(or by coding) the most worst toolid in each step and we will list the top five toolid in specific process stages that influnce the yield most.
Result

Top five toolid

Stage209 Stage102

Stage200 Stage207

Stage95

Step 3 - PLSR

Here is the code!! After previous steps, we can concentrate on specific stages and toolid for doing further analysis; fortunately, we will precisely figure out which SVID steps affect the yield.

Data preprocessing
- Data type transform
- Merge all the dataframe(Stage2_SVIDX_StepX~Stage300_SVIDX_StepX, waferid, yield)
- Drop uncessary columns and the same headers
- Drop the missing value NaN
Using PLSR
- PLSR is a Linear Regression Model with Multiple Input X and Multiple Output Y
- We make the input X and output Y do the principal component analysis(PCA) through axis rotation first and then do the coefficient estimation of linear regression model
- Show the VIP score in each SVID step
Result

Top 20 VIP score of SVID steps

Conclusion

I propose an approach of yield managment which could probably handle complex semiconductor manufacturing process and detect where false will operate.

Why don't I use PCA but PLSR?

It is impossible to restore compressed features to original features for accurate troubleshooting.

Why don't I use Random Forest?

We know that Random Forest also has the function of importance Feature, which tells us what features are used as root nodes (the most important one).

But!

We must classify the data first, etc:(yield > 70 = 1, yield < 70 = 0), but the features chosen to be the root greatly depends on how we classify the data.
There are many parameters in model training, such as num_tree, max_depth and so on. It will also greatly affect the outcome of feature selection.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
Box_plot(by median gap).py		Box_plot(by median gap).py
PLSR.py		PLSR.py
README.md		README.md
wat_correlation.py		wat_correlation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstract

Guide line

Wafer processing introduction

Spark and Hadoop introduction

Wafer processing Data introduction

Here we move to the two of most important data

Data analyze

Step 1 - Pearson Correlation

Step 2 - Box plot(By median gap)

Step 3 - PLSR

Conclusion

I propose an approach of yield managment which could probably handle complex semiconductor manufacturing process and detect where false will operate.

Why don't I use PCA but PLSR?

Why don't I use Random Forest?

But!

The result is unstable and the process is time-consuming.

About

Releases

Packages

Languages

WAT1036	WAT2985	WAT2848

WAT748	WAT517	WAT1477

WAT33	WAT2064	WAT2086

Stage209	Stage102

Stage200	Stage207

Stage95

Jeff-67/Spark-big-data-project

Folders and files

Latest commit

History

Repository files navigation

Abstract

Guide line

Wafer processing introduction

Spark and Hadoop introduction

Wafer processing Data introduction

Here we move to the two of most important data

Data analyze

Step 1 - Pearson Correlation

Step 2 - Box plot(By median gap)

Step 3 - PLSR

Conclusion

I propose an approach of yield managment which could probably handle complex semiconductor manufacturing process and detect where false will operate.

Why don't I use PCA but PLSR?

Why don't I use Random Forest?

But!

The result is unstable and the process is time-consuming.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages