# How we evaluate the performances of our classification softwares

People often wonder what the success rate (or the accuracy) of our automated classification softwares are. Unfortunately, there is no straightforward answer to that question. In this article, we describe what we think is the most scientific way to assess a classification performance and how some of our clients choose to complement the quantitative indices with evaluations of their own.

## Performance metrics

The two most common performance metrics in a classification context are precision and recall. Precision $$P$$ is the fraction of actually relevant instances among the instances the algorithm proposed (“is the algorithm not raising to many false alarms ?”) while recall $$R$$ is the fraction of relevant instances that have been retrieved on the scene (“is the algorithm not missing some objects of interest ?”).

For a given class, let $$T_p$$, $$F_p$$, $$T_n$$ and $$F_n$$ denote the number of true positives, false positives, true negatives and false negatives, respectively. Precision can be computed as

$$P = \frac{T_p}{T_p + F_p}$$

and recall as

$$R = \frac{T_p}{T_p + F_n}$$

When tuning the parameters of a detection algorithm, there usually is an inverse relationship between precision and recall: it is possible to increase one at the cost of reducing the other by increasing or reducing the sensitivity.

The $$F_1$$-score and the Intersection over Union $$IoU$$ are examples of measures combining precision and recall into a single figure. The $$F_1$$-score is an harmonic mean:

$$F_1 = 2 \frac{P \times R}{P + R}$$

and the $$IoU$$ is computed as

$$IoU = \frac{T_p}{T_p + F_p + F_n}$$

However, the most comprehensive tool to discuss the results of a classification algorithm is the confusion matrix: a square table where each row depicts the instances in an actual class, while each column represents the instances in a predicted class. The closer it is to a diagonal matrix, the better the algorithm performed.

## Scale of the study

In a classified point cloud, each point is assigned a label. As a result, it is possible to compute performance metrics at different scales:

• At the scale of points. In this case, each point is considered as an instance. This approach is relevant for applications where each individual misclassified point can lead to an erroneous analysis, or for classes with a large spatial extent (ground, building, cable, ...). However, it is not robust to clouds with varying point densities.
• At the scale of objects. In some applications, the most important is to detect instances of objects and it is unnecessary to detect all the points that compose them. For example, when studying traffic signs or transmission towers, the most important is to detect enough points to locate them, model them or identify their type. It is pointless to bother detecting each single point. In this kind of application, the performance metrics can be computed at the scale of objects. The only difference is that we need to define what a true positive or a false negative is at the scale of objects. For example, we can decide that an object is detected (ie. is a true positive) if 90% of its points (or 85% of its hull) are found, and is missed (ie. is a false negative) otherwise.
• Specific cases. Some applications might require custom rules. For example, in the case of linear objects (powerlines, rails, ...), theses numbers can represent the length of well detected segments instead of simple instances.

## Testing set

In practice, such performance metrics are computed based on a sample dataset. The results we obtained automatically are compared to a ground truth: a reference version of the testing set where all the points are perfectly classified.

Unfortunately, the only way to obtain such a ground truth is to perform the classification by hand, to have the labels assigned manually by a specialist of the data we are dealing with. This is a very tedious work and that explains why providing precision and recall figures is sometimes difficult: in most cases, no representative ground truth is available and it takes too much time to build one.

Example of a testing set in a powerline environment. On the left: point cloud classified automatically. On the right: reference point cloud manually classified by a specialist.

## Industrial applications

In some industrial applications, the indices we defined above do not really matter and the qualitative analysis is more important to our clients.

Moreover, not all the classification errors are of the same importance to them: some errors are critical and put the rest of the process chain at risk, while some are insignificant and have no influence on the remainder of the process. For example, if the final application is vegetation monitoring, a cable point classified as vegetation will raise a false alarm, whereas a building point remaining unclassified is not harmful.

This is why some of our clients decide to add a study on the manual fixing time: they measure the time it takes for an operator to correct the output of our algorithms so that it is fit for their industrial use. And better precision and recall scores does not always mean shorter fixing times.

## The importance of designing a learning-free algorithm

Ultimately, even if precision, recall and $$F_1$$-score are useful work tools, our goal is to reduce manual fixing time and make it each time more straightforward for our clients to integrate our softwares into their process chains.

This is achieved through the reduction of what our clients see as critical errors, that is to say the reduction of some particular values of the confusion matrix.

We could have designed a theory-of-everything algorithm based on learning techniques*: it would have maximized our "accuracy" scores, minimized our efforts and sounded very sexy. However, our learning-free approach has two noteworthy benefits:

1. There is no need for you to provide a manually labelled training set; our softwares are self-contained.
2. Our engineers master each layer of the algorithms; they'll know how to treat the specific errors you might find in our results. This would be impossible with a black-box classifier no human being knows how to influence.

*: Actually, we dit it. Twice. But we won't use it in production. [Handcrafted features][Deep Learning]

## Recent Posts

How we evaluate the performances of our classification softwares

People often wonder what the success rate (or the accuracy) of our automated classification softwares are. Unfortunately, there is no straightforward answer to that question. In this article, we describe what we think is the most scientific way to assess a classification performance and how some of our clients choose to complement the quantitative indices with […]