Skip to content

In this project, I am using several techniques to analyze the TSMC fabricating data

Notifications You must be signed in to change notification settings

Jeff-67/Spark-big-data-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Abstract

Managing the yield of wafer is one of the most important tasks to the semiconductor manufacturers. A lot of efforts for enhancing the yield of wafer have been conducted in both industries and academia. Thanks to the advance of IoT and data analytics techniques, huge amount of process operational data, such as indices of process parameters, equipment condition data, or historical data of manufacturing process, is collected and analyzed in realtime. In this project I will analyze the big data of Wafer manufacturing for troubleshooting and fault detection by using PySpark in several steps.

Guide line

Wafer processing introduction

Spark and Hadoop introduction

Wafer processing Data introduction

Data analyze

  • Pearson Correlation

  • Box plot(by median gap)

  • PLSR

Conclusion

Wafer processing introduction

Semiconductor manufacturing is one of the most complex works that has hundreds of process steps, several kinds of wafers, machinery, re-entrant flow, and innumerable process parameters, so it takes few months for completing the whole processes accordingly. Also, since semiconductor manufacturing process is very sensitive on stream, yield management is one of the most important issues directly connected to survival of a company.Here I briefly introduced the manufacturing process of wafer.

  • Wafer Manufaturing

    The profound process of fabricating silicon wafers with IC design by masks and semiconductors processing machine tools.

    screen shot 2018-11-26 at 7 17 55 pm

  • Wafer Processing

    • Wafer fabrication

      The process is roughly divided into several stations, each of which corresponds to several stages. Each stage corresponds to a certain type of machine processing (there are several machines of various types in the factory). A type of machine will be used in many stages and all the wafer must be processed in sequence from start to finish.

    • Wafer probe

      A wafer prober is a machine used to test integrated circuits. In this project, we will foucus on the parameter called WAT(Wafer Acceptance Test)

    screen shot 2018-11-26 at 7 30 15 pm

Spark and Hadoop introduction

  • What is Spark and Hadoop?

    Hadoop is a extremely powerful tool for distributed, scalable and economical data storage, processing and analysis. The data of semiconductor manufaturing is really tremendous thus I choose Hadoop to store them and use Pyspark to analyze.

    screen shot 2018-11-27 at 1 36 43 pmscreen shot 2018-11-27 at 1 38 55 pm

  • Simple command

    Copy file data.csv from local disk to the user’s directory in HDFS

    $ hdfs dfs –put data.csv data.csv

    Get a directory listing of the user’s home directory in HDFS

    $ hdfs dfs ls

    Display the contents of the HDFS file /user/semiconductor/data.csv

    $ hdfs dfs –cat /user/semiconductor/data.csv
  • Set up Pyspark environment

    #!/usr/bin/env python
    # coding: utf-8
    
    # In[ ]:
    
    
    import os
    import sys
    
    #this part is used for pyspark submit
    os.environ['PYSPARK_SUBMIT_ARGS']='--verbose --master=yarn --queue test pyspark-shell'
    
    os.environ['JAVA_HOME']='/usr/lib/jvm/java-8-openjdk-amd64/'
    os.environ['YARN_CONF_DIR']='/etc/alternatives/hadoop-conf/'
    
    #this line is used for spark1.6
    #os.environ['SPARK_HOME']='/opt/cloudera/parcels/CDH/lib/spark'
    
    #this line is used for spark2.2
    os.environ['SPARK_HOME']='/opt/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2'
    
    # this line is used for python2.7
    #os.environ['PYSPARK_PYTHON']='/usr/bin/python'
    
    #this line is used for python3.5
    os.environ['PYSPARK_PYTHON']='/usr/bin/python3'
    
    spark_home = os.environ.get('SPARK_HOME', None)
    sys.path.insert(0, os.path.join(spark_home, 'python'))
    sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.4-src.zip'))  
    #execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
    exec(open(os.path.join(spark_home, 'python/pyspark/shell.py')).read())

Wafer processing Data introduction

Let's recall our misson again:

Find out the relevant factors which affect yield; therefore, we can delete some irrelevant data manually at first.

In the directory wat_data:

wat_data/Parameter_set wat_data/Wat_root_cause
screen shot 2018-11-27 at 2 41 15 pm screen shot 2018-11-27 at 2 41 26 pm

The aa column in Parameter_set are arbitrary numbers and so does the Range column. The other column in wat_root_cause isn't really meaningful to the process as well.

Here we move to the two of most important data

In the directory wat_data/wat:

Just like we mentioned, WAT is a test key whick will affect yield.

wat.header wat.first_raw
screen shot 2018-11-27 at 3 00 05 pm screen shot 2018-11-27 at 3 00 19 pm

In the directory FDC_data/stageXX:

SVID(Status Variables Identification) is the physical data collected by sensors embedded in the advanced machines during the manufacturing process. To state the physical nature of certain SVID, we usually transform SVID into Fault Detection and Classification parameters (FDC parameters) using statistical indicators.

stageXX.header and stageXX.row
screen shot 2018-11-27 at 3 25 38 pm

From this picture we will know what those columns(toolid, chamberid, process, stage) represent.

Data analyze

Step 1 - Pearson Correlation

Here is the code!!

  • Data preprocessing

    • Merge all the dataframe
    • Drop uncessary columns
  • Pearson correlation

    • Compute the pearson matrix and print the top ten
  • Draw correlation plot

    • plot the specific WAT columns repect to yield
  • Result

    Correlation matrix
    screen shot 2018-11-27 at 8 22 41 pm
    Top ten WAT
    screen shot 2018-11-27 at 8 22 54 pm
    WAT1036 WAT2985 WAT2848
    screen shot 2018-11-27 at 8 23 06 pm screen shot 2018-11-27 at 8 23 13 pm screen shot 2018-11-27 at 8 23 20 pm
    WAT748 WAT517 WAT1477
    screen shot 2018-11-27 at 8 31 26 pm screen shot 2018-11-27 at 8 31 31 pm screen shot 2018-11-27 at 8 31 37 pm
    WAT33 WAT2064 WAT2086
    screen shot 2018-11-27 at 8 31 46 pm screen shot 2018-11-27 at 8 31 52 pm screen shot 2018-11-27 at 8 31 58 pm

Step 2 - Box plot(By median gap)

Here is the code!!

  • Data preprocessing

    • Data type transform
    • Merge all the dataframe
    • Drop uncessary columns and renamed the columns
  • Compute median

    • Compute the median in each Process stage first
    • Then compute the median in each toolid
    • Compute the median gap
    • from pyspark.sql.window import Window and import pyspark.sql.functions as func are two useful tools in Pyspark
  • Draw box plot

    • We can observe(or by coding) the most worst toolid in each step and we will list the top five toolid in specific process stages that influnce the yield most.
  • Result

    Top five toolid
    screen shot 2018-11-27 at 9 21 22 pm
    Stage209 Stage102
    screen shot 2018-11-27 at 9 28 44 pm screen shot 2018-11-27 at 9 29 00 pm
    Stage200 Stage207
    screen shot 2018-11-27 at 9 29 18 pm screen shot 2018-11-27 at 9 29 29 pm
    Stage95
    screen shot 2018-11-27 at 9 29 38 pm

Step 3 - PLSR

Here is the code!! After previous steps, we can concentrate on specific stages and toolid for doing further analysis; fortunately, we will precisely figure out which SVID steps affect the yield.

  • Data preprocessing

    • Data type transform
    • Merge all the dataframe(Stage2_SVIDX_StepX~Stage300_SVIDX_StepX, waferid, yield)
    • Drop uncessary columns and the same headers
    • Drop the missing value NaN
  • Using PLSR

    • PLSR is a Linear Regression Model with Multiple Input X and Multiple Output Y
    • We make the input X and output Y do the principal component analysis(PCA) through axis rotation first and then do the coefficient estimation of linear regression model
    • Show the VIP score in each SVID step
  • Result

    Top 20 VIP score of SVID steps
    screen shot 2018-11-27 at 10 14 29 pm

Conclusion

I propose an approach of yield managment which could probably handle complex semiconductor manufacturing process and detect where false will operate.

Why don't I use PCA but PLSR?

  • It is impossible to restore compressed features to original features for accurate troubleshooting.

Why don't I use Random Forest?

We know that Random Forest also has the function of importance Feature, which tells us what features are used as root nodes (the most important one).

But!

  • We must classify the data first, etc:(yield > 70 = 1, yield < 70 = 0), but the features chosen to be the root greatly depends on how we classify the data.
  • There are many parameters in model training, such as num_tree, max_depth and so on. It will also greatly affect the outcome of feature selection.

The result is unstable and the process is time-consuming.

About

In this project, I am using several techniques to analyze the TSMC fabricating data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages