"Improvements over the duplicate performance method for outlier detection in categorical multivariate surveys"

Carlos López

Ingenieros Consultores Asociados Cerro Largo 1321

Montevideo, URUGUAY


The duplicate performance method is commonly used as an attempt to detect errores (outliers included) in categorical multivariate data. That implies typically typing twice the same data without any special precedence. If the errors are uniformly distributed among individuals, retyping a fraction of the total will also remove typically the same fraction of the errors. A new method which is able to improve that procedure by sorting the records putting first the most unlikely ones is presented. The ability of the present methodology has been tested by a Monte Carlo simulation, using an existing database of categorical answers of housing characteristics in Uruguay. At first, it has been randomly contaminated, and after that, the proposed procedure applied. The results show that if a partial retyping is done following the proposed order about 50% of the errors can be removed while keeping the retyping effort between 4 and 14 % of the dataset, while to attain a similar result with the standard methodology 50% (on average) of the database should be processed. The new ordering is based upon the unrotated Principal Component Analysis (PCA) transformation of the previously coded data. No special shape of the multivariate distribution function is assumed or required.

Published in:

Journal of the Italian Statistical Society, 1996, 5, 2, 211-228

If you are still interested in it; here you have: THE TECHNICAL REPORT (.PS) (410KB) or the same as .PDF (322 KB)