Deep learning has revolutionized computer vision and Convolutional Neural Networks (CNN) now produce some great results on tasks like semantic classification and object segmentation for 2D images. Convolution is the fundamental building block of a neural network for image processing, but its definition is not directly adapted to point cloud processing. In this article, we present a convolution operator that operates on 3D point clouds without any intermediate representation. It can be used to design networks performing segmentation or classification tasks on point clouds.

## Convolutions and image processing

Discrete convolution is the base operation in CNNs and image processing. This linear space filter combines the data of local neighborhoods on a 2D grid. Given a kernel \(g\) (a set of weights localized by pixel coordinates), the convolution of image \(I\) and \(g\) at pixel \(x\) is defined by:

\begin{equation}

(I \ast g)(x)=\sum_{y \in N_x}g(x-y)I(y)

\end{equation}

where \(N_x\) is the neighborhood of \(x\), whose size depends on the choice of kernel \(g\).

In CNNs, convolution weights (the values of kernels \(g\) for each pixel) are learned at various layers (corresponding to various scales), allowing the network to respond to geometric features of the input.

Thanks to the regular structure of images, convolutions can be computed with high efficiency on modern hardware. However, 3D point clouds have no such regular structure.

## Definition of a convolution operator for point clouds

Inspired by image convolutions, we can define a convolution operator that operates directly on points.

In an image, each pixel is associated to a color, a gray level or more generally a vector of features. Likewise, in a point cloud, each point can be associated to a vector of features. Let \(P \subset \mathbb{R}^3\) be a point cloud and \(F = \left\{ f(x) ~|~ x \in P \right\}\) be the set of features held by the points of \(P\). The generic definition of a convolution of \(P\) (or better yet \(F\)) with kernel \(g\) at point \(x \in \mathbb{R}^3\) would be:

\begin{equation}

(F \ast g)(x)=\sum_{y \in N_x}g(x-y)f(y).

\end{equation}

By analogy, we would like \(g\) to apply different weights to different areas in a well chosen neighborhood of \(x\). We fix a radius \(r\) and try to design a kernel \(g\) whose definition domain is the ball \(B_r^3 = \left\{ x \in \mathbb{R}^3 ~|~ \left\| x \right\| \leq r \right\}\).

There are many ways to define areas in 3D space but relate the definition to points may be the most intuitive as features are also localized by them. Let \( (\tilde{x}_k)_{k < K} \subset B_r^3\) be a set of \(K\) kernel points and \((W_k)_{k < K}\) be their associated weight matrices. We define the kernel function \(g\) for any point \(x \in B_r^3\) as:

\begin{equation}

g(x)=\sum_{k < K}h(x,\tilde{x}_k)W_k

\end{equation}

where \(h\) is a correlation between \(x\) and a given kernel point \(\tilde{x}_k\) that should increase when the distance from \(x\) to \(\tilde{x}_k\) decreases. For its simplicity, a linear correlation is a reasonable choice:

\begin{equation}

h(x,\tilde{x}_k) = \max(0, 1 - \frac{\left | x - \tilde{x}_k \right |}{\sigma})

\end{equation}

where \(\sigma\) is the influence distance of the kernel points. \(\sigma\) must be chosen according to the input density.

Now that we have defined a generic convolution operator, it is possible to design CNNs operating directly on points where

- the radius \(r\) and influence distance \(\sigma\) are fixed for each layer,
- the number \(K\) and positions of the kernel points \((\tilde{x}_k)_{k < K}\) are fixed for each layer,
- the associated weight matrices \((W_k)_{k < K}\) are learned.

## Geometric features learned in a Convolutional Neural Network

We can visualize the geometric features learned by the network by coloring the points according to their level of activation for this feature at a given layer.

In the following figures, we show the low and high-level features learned by a CNN constructed around the convolution operator we defined. We observe that, in its first layer, the network is able to learn low-level features like vertical/horizontal planes, linear structures or corners. In the later layers, the network detects more complex shapes like small buttresses, balls, cones, or stairs.

## References

For a full description of this convolution operator and an overview of the applications it allows through neural networks, please read: Thomas, H. et al. *KPConv: Flexible and Deformable Convolution for Point Clouds*, In International Conference on Computer Vision (ICCV), 2019. [arXiv] [Code]