Look for Duplicates
Objectives
The Look for Duplicates functionality is designed to find duplicated samples, meaning samples at the same location (or very close to each other). These samples may cause problems during the kriging matrix inversion.
Two samples are said to be duplicates if, for a given variable, their values are not undefined and the distance between the two samples is less than a given value. One sample can have more than one duplicate and the duplicates related to a single sample constitute a group.
Interface
-
Input:
- Data table: Choice of the data table on which the Look for Duplicates task will be applied.
- Selection: Choice of the selection (as a restriction of the Input Data table) to apply the Look for Duplicates task.
- Minimum Distance between two samples: A number must be chosen at this step. This number corresponds to the minimum distance between two samples. It means that if two samples are closer than this distance, they are considered as duplicated samples and only one of them is selected.
-
Algorithm: Two algorithms have been developed to find the duplicates.
-
Sampling algorithm: The algorithm starts with the first sample number. It creates a cluster including itself and all the other points contained in a circle (or sphere) with the minimum distance chosen by the user as a radius. Then it moves to the second point and applies the same process.
All the points already included in a previous cluster are not processed again. All the points that may have been in two different clusters are, in fact, only in the first created cluster.
This method is the recommended algorithm as it is the fastest.
The major problem may be a memory problem in case of very big data.
-
Declustering algorithm: The algorithm starts with the first sample number. Then it calculates the distance between each pair of samples and compares the distance to the chosen minimum distance. If the length is smaller, the point is added to the cluster. The cluster becomes bigger and bigger, until no more samples are reached.
This algorithm might be long since it calculates all the distances between the points.
A problem may occur when the data samples have a high connectivity. The algorithm will consider all the points in only one cluster.
In the following example, there are two clusters localized at the extremities of the data set, but they are linked by some intermediate samples. If the minimum distance chosen is large enough to take a point in the trajectory, all the points will finally be included in a unique big cluster.
However, if the minimum distance is small and does not reach the trajectory points, the intermediate points will not be included in the first cluster and the result will be two distinct clusters, one at each extremity.
-
-
Selection option: To choose which sample will be kept between the duplicates, the following options are available:
- Keep the smallest sample number: the sample that is defined first in the input file is kept.
- Keep the highest sample number: the sample that is defined last in the input file is kept.
- Keep the minimum value: the sample which has the lowest value of the chosen variable is kept
- Keep the maximum value: the sample which has the greatest value of the chosen variable is kept
- Variable for Minimum/Maximum: Choice of the variable that will be used if the Selection option "Keep the minimum value" or "Keep the maximum value" has been selected.
-
Output:
-
Mode:
-
Save Selection: a selection will be created in the input data table, containing only the selected samples (without the duplicated samples)
-
Extract Selected Points: a new file will be created only with the selected samples (without the duplicated samples)
- File: the name of the new file
- Compute Cluster's Center of Gravity: if activated, the output samples will be located at the gravity center of the group they belong to. If not activated, the output samples keep the same coordinates as the selected samples.
- Variables and Statistics Mode: possibility to compute the statistics (minimum, maximum, mean and migration) of several variables
-
Copy Other Variables: if activated, all the variables in the file will be copied
-
-







