VAR-CNN(Football): Foul or Clean Tackle

8 min readJan 5, 2022

Abstract

Football is the most followed sport in the world, played in more than 200M+ countries. The sport has developed a lot in the recent century and so has the technology involved in the game. The Virtual Assistant Referee (VAR) is one of them and has impacted the game to a large extent. The role of VAR is simple yet complex; to intervene in between the play when the referees make a wrong decision or cannot make one. A specific scenario arises when they have to decide if a sliding tackle inside the box has resulted in a clean tackle or penalty for the opposition team. The technology is there to watch the moment at which tackle took place on repeat but the decisions are still made by humans and hence can be biased. I propose a CNN based foul detection which is theoretically based on the principle of the initial point of contact.

Introduction

Football originated in 1863 in England, over 160 years ago. Since then, it has become a major entertainment sport in sporting history. According to FIFA, there are more than 265M+ professionals both men and women, more than 3.5B+ people proclaim themselves as football fans. This insanely popular sport has come a long way but all the 17 laws have remained intact. The decision of a penalty generally after a tackle is made in their own penalty box on the opposition is given by the on-field referee till recent past. In the 2012–13 season, the dutch league was introduced by Virtual Assistant Referee(VAR) that like any other machine or technology which is used by humans to make their lives easy, it was used to make the referees live easy in the game. According to the official FIFA website, the goals of VAR for different scenarios are :

Goals

The role of the VAR is to assist the referee to determine whether there was an infringement that means a goal should not be awarded. As the ball has crossed the line, play is interrupted so there is no direct impact on the game.

2. Penalty Decisions

The role of the VAR is to ensure that no clearly wrong decisions are made in conjunction with the award or non-award of a penalty kick.

3. Direct Red Card Incidents

The role of the VAR is to ensure that no clearly wrong decisions are made in conjunction with sending off or not sending off a player.

4. Mistaken Identity

The referee cautions or sends off the wrong player, or is unsure which player should be sanctioned. The VAR will inform the referee so that the correct player can be disciplined.

In this project, we concentrate on the second use case, the penalty decisions. Although the decisions are taken using a thorough check using repeat video recordings of the moment when a tackle takes place, reviewing from different angles, this task still has a human dependency and may contain a bias. To automate this process I propose a Convolutional Neural Network, that will take the initial point of contact image as an input and predict if a foul has been committed or not. Hence the penalty decisions can now be automated rather than based on the human investigation.

The VAR-CNN

The model we propose is a Convolutional Neural Network-based model which takes images based on the initial contact and provides a classification for the same. In this section, we will discuss the data, the model architecture used, results and inferences from our model. The model is a small workaround for the actual virtual assistant referee, hence we named it VAR-CNN. Without any further dues let’s get started.

Data

Collecting the data has been a ponderous task, there are no open-source resources for the kind of data of any league. The only available sources are the video clips of the European matches and compilations on youtube of tackling and fouls. A small chunk of data is also acquired from the paper Soccer Event Detection Using Deep Learning.

The variations in data can be observed above. A total of 1200+ images were scrapped for two classes namely Clean Tackles and Fouls. Clean Tackles as the name suggests is when a defender gets the ball first and the initial contact would be on the ball. On the contrary, a foul is when a defender gets in contact with the player first. This approach was the basis of this study and data collection. The initial contact data and the moment just after it are recorded in this dataset. The dataset will be made available in the GitHub repository that will be shared with this article.

Model & Architecture

The model used in this study is based on the convolutional neural networks implemented in python using tensorflow. The CNN’s are based on the principle that the inputs(images) will be convolved kernels present inside the filter, which will, in turn, generate a feature map. The convolution is an element-wise multiplication of the kernel weights with the pixels. In a filter, there is a separate kernel for each channel of the input and the sum of the outputs of the kernel for each channel is the corresponding pixel value on the feature maps. The model was quite simple having 650k+ parameters with only dropout used as the form of regularization. Other regularization combinations like batch normalization, l1 and l2 norm but dropout was the most successful in terms of generalization. Though it was evident that with BatchNorm in the system the loss converged much faster for the same hyperparameters setting. The dropout rate was kept at 0.5 in the dense layers, the advantage of dropout is it prevents overfitting by making the neurons learn individually more than co-dependently as for each batch/example in training it randomly drops out 50%( dropout rate 0.5) of the neurons which in turn gives a new neural network to each batch and what we have is the average predictions of all those possible combinations.

The initial Convolution layers had a kernel size of 5 and 64 filters each followed by max-pooling while the later layers had a decreasing number of filters, kernel size of 3 and a dilation rate of 2 followed by max-pooling. The dilation rate was kept at standard and the advantage of dilated convolutions is pretty evident, they provide a larger receptive field. An example of dilated convolutions is shown below:

The model architecture is given below:

The last 3 blocks of CNN contains dilation, the dense layers use relu activations and the output operates on sigmoid activation, we use a binary classification model. The size of the input images was trimmed to 256,256 using nearest interpolation while using data augmentation we provided various techniques like rotation, horizontal flip, vertical flip, brightness ranges etc. While training an early stopping callback was also used with the patience of 10 epochs having the best weights restored. Validation loss was monitored in the early stopping call back. The training accuracy achieved was 76.6% and the validation accuracy achieved was 78%. The accuracies were low but acceptable knowing the data size and complexity of the datasets.

Overfitting was observed in each and every model with different regularization combinations but with dropout, it was last observed and as we used early stopping best weights were restored.

Inference

The inference has been taken using GradCam++ methods using tf_keras_viz module implemented in python. There were interesting insights gathered while drawing the inferences. 1) The model does take into account the initial point of contact. 2) The model also considers the pose ( the slide of the defender) in making the decision. 3) In the decisions where the model produced wrong decisions the model was not able to focus on the above two but the surroundings or the green grass.

The above inference is a case where the model predicted the classes correctly. The focus has been on player postures and the initial contacts. In Figure 4, you can clearly see it takes into account both the players postures and initial point of contact. Figure 3, shows that the initial point of contact with the player as well the ball of the opposition player is taken into account for the decision making.

In Figure 5, the original image corresponds to a foul but is classified as a clean tackle, observe that the initial point of contact is not considered at all, some focus is on the postures but mainly on the green grass. This is pretty common in the images classified in the wrong classes. This issue can be resolved if more data is available for both classes and the quality of data improves.

V in VAR stands for video so what use of our model is if we cant have the predictions in real-time with the inferences. Predictions, for now, we are skipping as they will be relevant after more data can be acquired and better accuracy can be achieved but we can still answer the question of why by using the above inferences and making them real-time is the task in hand. Here is an example of real-time inferences.

The better the model the better the inferences, the noticeable fact is the model is able to focus on the player's legs and initial point of contact with the ball. The inferences do wander about but it is getting where we wanted it to.

Future Work

The future work is improving the model by increasing the volume of the data as well as the variety of fouls. In this project, we have studied sliding tackles. Once a model with better accuracy is achieved, it may become the next advancement in football’s decision making.

The code for this project will be available here on Github. The data has been made available there.