CNN-based Density Estimation and Crowd Counting

5 min readJun 19, 2021

A survey of recent advances in CNN-based single image crowd counting and density estimation

What is crowd counting ??

Crowd Counting is a technique to count or estimate the number of people in an image.

We can use a direct method to count the number of people in an image. But it is nearly impossible in the high dense crowded areas.

We do not have an algorithm or method to calculate exact the number of people in the crowd image yet. Most computer vision techniques give the approximate number of the crowd count for an image.

Why crowd count?

There are plenty of other scenarios where crowd counting algorithms are changing the way industries work:

Counting the number of people attending a sporting event
Estimating how many people attended an inauguration or a march (political rallies, perhaps)
Monitoring of high-traffic areas
Helping with staffing allocate and resource allotment

There are many methods proposed to achieve crowd count.

Evolution of crowd counting methods.

There are 4 types of crowd counting methods.

Detection based methods
Cluster-based methods
feature — regression methods
CNN based methods

Detection based methods

This is a supervised method. This method uses the object detection method to count the crowd then identifies the people in the image and counts how many people are in the image.

This method type has good at detecting faces but this method failed because,

The image of people is varied due to changes in body pose and clothes.
Suffers in crowd scenes because of partially visible objects
When image resolution is low hard to detect objects.

Custer-based methods

This is an unsupervised method. This method group clusters together to represent independent moving entities. The method detects the way of individual moves or other visual features such as clothes. This method requires the target to be applied to have continuous motion.

This method fails because

The method is based on the continuous motion of the target if people in the static scene then inaccuracies can arise.
If two or more targets share common features or that has common moves over time, then inaccuracies can arise.

Feature regression-based methods

This method is a supervised method. This methods commonly have 4 main steps,

Define the region of interest.
Perspective map from previously defines the region.
From the input image extract low-level features such as foreground pixels and edges from the image.
Pass them as inputs to the regression model

The weakness of this method is the defined perspective map. When the model is used in another scene the perspective will have much inaccurate in the result.

CNN (Convolutional Neural Network) — based methods.

A neural network can help to achieve the extract meaningful features from the image. In the neural network, training can found patterns intuitively see or hard to handle. If we compare other methods with CNN methods which allows the image to be better represented in the network.

The CNN-based algorithms are very popular right now, behaving better in accuracy and flexibility in the field of crowd counting. According to the methods mentioned above, foreground segmentation is indispensable, but it is difficult to implement. In contrast, the DCNN proposed systems do not require foreground segmentation and hand-crafted feature extraction.

So far, the crowd counting algorithm in deep learning has obtained many breakthroughs and achievements. Therefore, this article will explain further CNN-based algorithms, discuss their structures and innovation.

Categories for crowd counting based on CNN-based methods.

Architecture for crowd counting
Learning paradigm of the method
Interference manner of the network
Supervision form of the network
Domain Adoption
Instance-/image-based supervision

Architecture for crowd counting

In view of different types of network architectures, we divide crowd counting models into three categories:

Basic CNN-based methods
Multi-column-based methods
Single-column-based methods.

Basic CNN-based methods

This network architecture adopts the basic CNN layers which convolutional layers, pooling layers, uniquely fully connected layers, without additional feature information required. They generally are involved in the initial works using CNN for density estimation and crowd counting.

Multi-column-based methods

These network architectures usually adopt different columns to capture multi-scale information corresponding to different receptive fields.

Single-column-based methods.

The single-column-based approach usually deploys single and deeper CNN rather than the wide structure of MCNN.

Paradigm based methods

From the view of different paradigms, crowd counting networks can be divided into two categories

Single-task-based method
Multi-task-based method

Single task-based method

Most CNN-based crowd counting methods belong to this paradigm, which generally generates density maps and then sum all the pixels to obtain the total count number, or the count number directly

Multi task-based method

This method archives good performance by combing density estimation and other tasks such as classification, detection, segmentation, etc. Multi-task-based methods are generally designed with multiple subnets; besides, in contrast to pure single column architecture, there may be other branches corresponding to different tasks.

Multi-task architecture-based methods exchange ideas and information between multi-column-based methods and single-column-based methods.

Interference manner of the network

Based on the different training manners, the CNN-based crowd counting approaches can be divided into two categories

Patch-based methods
Whole image-based methods

Patch-based methods

Model training using patches randomly crop from the image. In the test phrase uses a sliding window over the whole image.

Whole image-based method

Take the whole image as an input and output the density map and/or count. The weakness of this method is that loss the of local information.

Supervised form of the network

According to whether human-labeled annotations are used for training, crowd counting methods can be classified into two categories

fully-supervised methods
un-/self- /semi-supervised methods

Domain adaptation

Almost all the existing counting methods are designed in a specific domain; therefore, designing a counting model which can count any object domain is a challenging yet meaningful task. The domain adaptation technique may be a powerful tool to tackle this problem.

Instance-/image-based supervision

The aim of object counting is to estimate the number of objects. If the ground truth is labeled with a point or bounding box, the method pertains to instance-level supervision. In contrast, image-level supervision just needs to count the number of different object instances instead.

References

https://github.com/gjy3035/Awesome-Crowd-Counting

CNN-based Density Estimation and Crowd Counting

Written by Rukmal Senavirathne