Quality of geographic data - Detection of outliers and imputation of missing values : PhD. thesis of Carlos López

November 1997 Royal Institute of Technology - Centre for Geoinformatics Stockholm, Sweden

In Geographic Information System (GIS) typical applications data usually comes from a wide range of providers. Such data has variable quality and typically the end user has limited access to the original source (if any). Among other problems those datasets might have missing values and also be affected by outliers. Missing values are common in tabular datasets (like population census, meteorological records, etc.) and the end user is forced to apply any methodology in order to fill the gaps. The data producer cannot recover the missing value and typically does not assign or suggest alternative values. Outliers might arise from careless measurements, instrument malfunction, wrong data processing routines, etc. Current systems give little help to the end user, while the data producer might go back and make another reading, or check the original records if available.

This thesis is concerned with the development and testing of tools intended for two purposes: a) given some dataset, point out dubious values and b) suggest a procedure to assign suitable values for those in doubt or missing. The algorithms were designed in order to be useful for end users as well as data producers.

Only some of the data types usually found in GIS applications have been analyzed, namely tabular categorical data, tabular quantitative data and raster quantitative data. For all of them we suggested new methods and made extensive comparison with traditional alternatives.

For the problem of outlier detection we applied a number of known and new techniques to tabular quantitative data. The examples are from daily precipitation and hourly surface wind records. For raster quantitative datasets we developed and analyzed a new general method suitable for detecting outliers. Digital Elevation Models (DEM) were used as an example. Tabular quantitative (categorical) data (e.g. census data) is also extensively used in GIS applications (opinion polls, economic surveys, etc.). Unfortunately, the procedure cannot be applied to other categorical data typically available in GIS (like a geological or land-use map). For the missing value problem we only treat the case of quantitative tabular data. Most of the methods considered are general purpose, and can be regarded as independent of the dataset. They can be used by the end user as well as the data producer. All the experiment were carried out using MATLAB in UNIX workstations.

KEY WORDS: outliers, blunders, missing values, precipitation, wind, digital elevation models, DEM, categorical data, error model, Geographic Information Systems, GIS.
All the thesis (plus papers) in PDF format (4.4Mb)

Individual papers:

  1. "Análisis por componentes principales de datos pluviométricos. a) Aplicación a la detección de datos anómalos" López, C., González, E. and Goyret Estadística (1994) 46,146,147, pp. 25-54 (Also english translation available!)
  2. "Análisis por componentes principales de datos pluviométricos. b) Aplicación a la eliminación de ausencias" López, C., González, J. F. and Curbelo, R., Estadística (1994) 46,146,147, pp. 25-54 (Also english translation available!)
  3. "Application of Artificial Neural Networks to the prediction of missing daily precipitation records, and comparison against linear methodologies" López, C.; In proceedings of the International Conference on Engineering applications of Neural Networks, Stockholm, Sweden, 16-18 June, 1997, editeb by A. B. Bulsari and S. Kallio, 337-340
  4. "Locating some types of random errors in Digital Terrain Models" López, C. International Journal of Geographic Information Science 1997, 11, 7, 677-689
  5. "On the improving of height accuracy of Digital Terrain Models: a comparison of some error detection procedures" López, C.; Presented in the Sixth Scandinavian Research Conference on Geographical Information Systems. Stockholm, Sweden 1-3 June, 1997
  6. "Improvements over the duplicate performance method for outlier detection in categorical multivariate surveys" López, C. To appear in the Journal of the Italian Statistical Society
  7. "A general purpose procedure for locating outliers in multivariate time series: Application to an hourly wind dataset" López, C., Kaplan, E. To be submitted
  8. "A new technique for imputation of multivariate time series: Application to an hourly wind dataset" López, C., Kaplan, E. In Proceedings of the Tenth Brazilian Meteorological Conference. Brasilia, Brazil 26-30 October, 1998
  9. "An error model for daily rain records" López, C. In Proceedings of the Tenth Brazilian Meteorological Conference. Brasilia, Brazil 26-30 October, 1998