PublishedPaper 2020-APCSM MachineLearning
PublishedPaper 2020-APCSM MachineLearning
Abstract: In semiconductor manufacturing, a low defect rate of manufactured integrated circuits is crucial.
To minimize outgoing device defectivity, thousands of electrical tests are run, measuring tens of
thousands of parameters, with die that are outside of specified parameters considered as fails. However,
conventional test techniques often fall short of guaranteeing acceptable quality levels. Given the large
number of electrical tests, it can be difficult to determine which electrical test to rely upon for die quality
screening. To address these issues, semiconductor companies have recently begun leveraging artificial
intelligence and machine learning to better identify defective devices while minimizing the fallout of good
die from electrical tests. To implement these advanced machine learning applications, a novel remote
inference capability is also proposed. By placing an inference engine and corresponding machine
learning models at the assembly and test house, inferences can be made without any sensitive data
leaving the assembly and test house. The result is faster turnaround times on inferences, reduced data
loss, increased security, and the enablement of advanced machine learning capabilities for real-time
solutions such as adaptive testing.
Motivation
Recently, however, the above-mentioned conventional test techniques often fall short of guaranteeing
acceptable quality levels. In today’s advanced semiconductor environment, thousands of electrical tests
are performed to determine the quality of the die. Given this large number of electrical tests, it can be
difficult to determine which electrical test to rely upon for die quality screening. Furthermore, many of the
above-mentioned conventional screening techniques assume a normal distribution of the electrical test
measurement results across a die population. Finally, device quality may not be a function of any single
electrical test result, but could be multi-variate in nature.
To address the issues of a massively-dimensional test measurement space, the possibly non-normal
distribution of electrical tests, and the multi-order interactions of individual electrical test measurements,
fabless semiconductor manufacturers and integrated device manufacturers (IDM’s) have recently begun
leveraging artificial intelligence and machine learning to better identify defective devices while minimizing
the fallout of good die from electrical tests. The end result is a reduction in testing costs and improved
product quality. In this publication, we will explore the application of modern machine learning techniques
to predict devices at risk while concurrently expediting and/or reducing tests for die with little risk of
defectivity. By employing more advanced, multivariate outlier screening techniques powered by machine
learning, defective chips can be identified more efficiently with less fallout. Additionally, the residual cost
of test can be invested to more thoroughly screen devices exhibiting marginal quality to increase overall
outgoing quality.
To implement these advanced machine learning applications, a novel remote inference capability is also
proposed. By placing an inference engine and corresponding machine learning models at the assembly
and test house, inferences can be made without any data leaving the assembly and test house, where
much of the data used to make inferences already resides. Furthermore, machine learning models that
contain sensitive intellectual property remain secure within the assembly and test house. The result is
faster turnaround times on inferences, reduced data loss, increased security, and the enablement of
advanced machine learning capabilities for real-time solutions such as adaptive testing.
Approach
In order to implement the pass/fail classification of chips correctly, it is ideal to generate a multiclass
classifier for each of the failure types. However, there usually isn’t enough training samples for each type
of failure, as is in the case of failed field returns, or RMA’s. These types of situations are well-suited for a
branch of machine learning called Anomaly Detection. Anomaly Detection defines a boundary of what is
normal and treats anything outside of this boundary as abnormal.
Many univariate outlier screening techniques, such as PAT, are used today in the semiconductor industry
for the purpose of outlier screening and can be considered within the field of Anomaly Detection. Some
multivariate Anomaly Detection techniques already exist in the semiconductor industry to find outliers in
Page | 2
APCSM 2020
wafer sort data [3]. However, these techniques typically use a principal component analysis (PCA) to
transform the measurement parameters into a reduced set of new parameters with removed correlations.
The same univariate method is then used to find outliers. A limitation of PCA is that it can only remove the
linear dependence between parameters.
Multivariate Anomaly Detection defines normal ranges, while allowing for correlated multimodal
distributions for normal chips. For Multivariate Anomaly Detection to work well, it is optimal to rely on
selected features that have the most predictive power. There are a number of ways to accomplish this
input parameter selection. One technique is to employ univariate feature selection using a failure label
(e.g. a field return of burn-in failure). By doing so, we isolate the measurements that are more critical to
predicting a failure.
It can also be important to choose methods that contemplate non-Gaussian distributions of the parameter
population. These methods can find outliers that are not seen in a univariate analysis, e.g. for non-
linearly dependent features.
It should be noted that output from univariate Anomaly Detection methods can be used as input to
multivariate approaches (rather than just the raw test parameter values). Table 1 describes the
Multivariate Anomaly Detection techniques proposed, describing the pros, cons, and benefits.
A technique is proposed which identifies chips that are considered higher risk for failure by the machine
learning model compared to typical die on the same spatial locations. The basic algorithm approach,
herein referred to as Modeled Yield, is to develop two sets of models:
An ensemble of these two models identifies which die are likely defective, or low yielding. Yielding die
with low predicted yield are identified as likely candidates for early lifetime failure. If the predicted yield
considering only spatial information is high while the predicted yield including parametric values is low,
the die are extremely likely to become failures. Experience has shown that die with low predicted yield
Page | 3
APCSM 2020
can be an order of magnitude more likely to become field returns. If the spatial only model predicts that
the die should be high yielding, the increased likelihood of a field return increases to nearly two orders of
magnitude.
Results
The above-mentioned multivariate anomaly detection and modeled yield techniques were evaluated using
an actual production dataset which contained roughly 20,000 total chips of which roughly 50 chips were
field returns (not simulated). To predict field returns, input parameters were obtained from a wafer sort
test insertion. There were roughly 10,000 raw input parameters. As a baseline to compare against, the
industry standard DPAT outlier screening technique was used.
The comparison metrics are False Positive Rate (FPR) and True Positive Rate (TPR). FPR and TPR are
common machine learning metrics and are calculated as follows:
Where Positive = field returned chip, and Negative = good die. Thus TPR is the percent of all actual field
returns that are correctly predicted as defective, and FPR is the percent of die predicted as defective but
are actually good die. FPR can be interpreted as the amount of “overkill” population sacrificed to screen
defective die for a given defect capture rate.
To demonstrate the true production decision making process, the validation methodology shown in Fig. 1
was employed to train and test each of the multivariate anomaly detection techniques.
Figure 1 Validation approach to evaluate the multivariate anomaly detection techniques. A sliding window
of training and testing data, partitioned by time, is applied to the entire dataset to simulate a real
production scenario. nTrain = 20 timestamps, and nTest = 5 timestamps, where each timestamp
contained virtually equivalent die count.
For modeled yield, there are no training/testing partitions as this approach does not rely on a rank
ordering or selection of input parameters as determined by a correlation to failures.
For the baseline DPAT comparison, there were 38 total significant parameters identified by multivariate
approach in each sliding window over the duration of the dataset. The reoccurrence of each significant
parameter across the duration of the dataset was calculated, and those parameters that appeared at least
Page | 4
APCSM 2020
25% of the time were selected. 25% was chosen so that there would be a substantial amount of
parameters to run DPAT.
Results for all the testing windows are shown in Table 2 for FPR = 1%, 2%, 4%, and 6%.
Results in Table 2 show MV-1, MV-4, and Modeled Yield techniques are all able to capture more field
returns in respective portions of the total chip population identified as at-risk (“overkill”). Overall, the
Modeled Yield technique performed the best for this dataset. This would imply that for this dataset, yield
at wafer sort is a good predictor for field returns. It is important to note that this is not always the case, as
it has been observed that for some datasets Modeled Yield does not outperform the other multivariate
techniques. Additionally, we have observed that different Multivariate Anomaly Detection techniques
outperform others for different datasets and/or chip product lines.
Deployment
Once perfected and trained, machine learning models must be deployed to and integrated with the overall
product manufacturing flow. This likely requires deployment of prediction models to multiple remote
facilities in today’s distributed manufacturing ecosystem.
The term “Edge Prediction” as used in this paper refers to deployment of machine learning to facilities
where production test and assembly operations are performed. Distributed machine learning requires
reliable mechanisms to transport and update prediction models, compute infrastructure at remote
facilities, timely access to test data, and potentially, integration with factory process automation and
control systems.
Use Cases
Page | 5
APCSM 2020
Deployment Challenges
There are numerous considerations and related challenges when contemplating the effective deployment
of machine learning.
- What are the timing constraints for prediction?
- How will the necessary input data be collected, merged and sourced to the prediction model?
- What are the confidentiality and security requirements for the prediction model?
- When can and should predictions be run?
Page | 6
APCSM 2020
several potential platforms upon which machine learning computations can be based. Some
of these are inherently more opaque than others. For instance, a model compiled from C-
language will be binary in nature and would require significant effort to reverse engineer.
Other machine learning platforms are, by default, much less opaque. For example, the
greater Python data science ecosystem is perhaps the most popular machine learning
platform, and due to its Python heritage of open source, is difficult to produce an intractably
opaque executable model. For device manufacturers concerned with the confidentiality of
their prediction models, deployment must also include forms of strong code obfuscation,
encryption and/or server security.
4) Prediction Timing
Timing of real-time predictions is rather obvious; compute as the data is produced by the test
system. Timing the execution of post-process predictions is not as straight forward. Post-
process computations are usually needed at manufacturing operation boundaries. Examples
include, when a wafer test is complete, after partial wafer tests have been merged, when a
wafer lot has completed the wafer sort test operation or visual inspection operations, or
before the start of the assembly operation. Simple observation of incoming data does not
provide a reliable trigger for when prediction models should be executed. Machine learning
can be thought of as a virtual test operation with associated yield loss, and therefore, requires
integration with the overall manufacturing flow much like physical test operations. This
implies that a successful edge prediction deployment must also have integration points with
the test and assembly facility manufacturing execution system (MES).
Conclusion
In this publication, we have demonstrated the possible benefits of employing more advanced, multivariate
outlier screening techniques powered by machine learning. In order to make multivariate screening work
well, prescreening of test measurements is required to reduce the noise. We have demonstrated a
second approach based on creating a proxy to the target variable when a target variable is difficult to
obtain. By using these approaches, the residual cost of test can be re-invested to more thoroughly
screen devices exhibiting marginal quality to increase overall outgoing quality.
Additionally, this paper discussed an approach for implementing a novel remote inference engine.
Inferences can be made without any data leaving the assembly and test house by placing an inference
engine and corresponding machine learning models at the OSAT. This approach insures that prediction
will be faster by reducing unnecessary data transfer over internet. Furthermore, sensitive intellectual
property including test data and machine learning models remain secure within the assembly and test
house.
Acknowledgment
The authors of this paper would like to thank Kenneth Harris of PDF Solutions, Dennis Ciplickas of PDF
Solutions, Hanna Bullata of Progineer, and Rauf Afifi of Progineer for their contributions and support in
making the findings in this paper possible.
References
[1] Manuel J. Moreno-Lizaranzu and Federico Cuesta, Sensors 2013, 13, 13521-13542;
doi:10.3390/s131013521
[2] Automotive Engineering Council AEC-Q001
Page | 7
APCSM 2020
[3] Jeff Tikkanen, Nik Sumikawa, Li-C. Wang, Magdy S. Abadir (2014). Multivariate Outlier Modeling
for Capturing Customer Returns – How Simple It Can Be.
http://mtv.ece.ucsb.edu/licwang/PDF/2014-IOLTS.pdf
Page | 8