Analysing cinema is a time-consuming process. In the
cinematography domain alone, there's
a lot of factors to consider, such as shot scale, shot
composition, camera movement, color, lighting, etc.
Whatever you shoot is in some way influenced by what
you've watched. There's only so much one can watch, and
even lesser that one can analyse thoroughly.
This is where neural networks offer ample promise. They
can recognise patterns in images that weren't possible
until less than a decade ago, thus offering an
unimaginable speed up in analysing cinema.
I've developed a neural network that focuses on
one fundamental element of visual grammar:
shot types.
It's capable of recognising 6 unique shot types, and is
~91% accurate. The pretrained model, validation dataset
(the set of
images used to determine its accuracy), code used to
train the network, and some more code to classify your own
images is freely available.
What is Visual Language, and Why Does it Matter?
When you're writing something — an email, an essay,
a report, a paper, etc, you're using the rules of
grammar to put forth your point. Your choice of words,
the way you construct the sentence, correct use of
punctuation, and most importantly, what you have to
say, all contribute towards the effectiveness of your
message.
Cinema is about how ideas and emotions are expressed
through a visual form. It's a visual language, and just
like any written language, your choice of words (what
you put in the shot/frame), the way you construct the
sentence (the sequence of shots), correct use of
punctuation (editing & continuity) and what you have
to say (the
story) are key factors of creating effective cinema.
The comparison doesn't apply rigidly, but is a good
starting point to start thinking about cinema as a
language.
The most basic element of this language is a shot.
There's many factors to consider while filming a shot —
how big should the subject be,
should the camera be placed above or below the subject,
how long should the shot be,
should the camera remain still or move with the
subject, and if it's moving, how should it move?
Should it follow the subject,
observe it from a certain point while turning right/left or
up/down and should the movement be smooth or jerky.
There
are other major visual factors, such as color and lighting,
but we'll restrict our scope to these factors only. A
filmmaker chooses how to construct a shot based on what
he/she wants to convey, and then juxtaposes them effectively
to drive home the message.
Neural Networks 101
'AI' is most often a buzzword for
deep learning, the field that uses
neural networks to learn from data.
The key idea is that instead of explicitly specifying patterns
to look for, you specify the rules for the neural network to
autonomously detect patterns from data. The data could be something
structured, like a
database of customers' purchasing decisions, or
something unstructured, like images, audio clips,
medical scans, or video. Neural networks are good at tasks like
predicting
a customer's desired products, differentiating
the image of a dog and a cat, the mating calls of dolphins
and whales, a video of a goal being scored vs. the goalkeeper
saving the day, or whether a tumor is benign or malignant.
With a large enough labelled dataset (say 1000
images of dogs and cats stored separately), you could use
a neural network to learn patterns from these images. The
network puts the image through a pile of computation, and spits
out two probabilities: P(cat)
and P(dog)
.
You calculate how wrong the network was using a loss function
,
then use calculus (chain rule) to tweak this pile of computation to
produce a lower loss (a more correct output). Neural networks
are nothing but a sophisticated mechanism of optimising this
function.
If the network's output is far off
from the truth, the loss is larger, and so the tweak
made is also larger. Tweaks that are too large are bad, so you
multiply the tweaking factor with a
tiny number known as the learning rate
.
One pass through the entire dataset is known as an epoch
.
You'd probably run through many epochs
to reach
a good solution; it's a
good idea to tweak the images non-invasively (such as flipping them
horizontally), so
that the network sees different numbers for the same image and can
more robustly detect patterns. This is known as data
augmentation
.
Neural networks can transfer knowledge from one project to another.
It's very common to take a network that's been trained with 14 million
images of a thousand common objects
(ImageNet),
and then tweak it
to adapt to your project. It works because it has already learnt
basic visual concepts like curves, edges, textures, eyes, etc, which
come in handy for any visual task. This process is known as
transfer learning
.
Rinse and repeat this process carefully, and you have in your
hands an 'AI' solution to your problem.*
If that piqued your interest, I suggest you watch
this
(~19mins) for a fairly detailed explanation of how
a neural network works. If you're bursting with excitement,
follow through with
this course
.
Neural networks
burst into popularity in the past decade with the
development of large datasets and the ability to
leverage GPUs (graphics cards) for the heavy
computation demanded by neural nets.
Rapid advances on the technical side open up
opportunities to solve novel problems like the one
presented in the post. Shot scale recognition is
one of many possible applications to film. It's
possible to recognise camera movement with
3D
CNNs
(convolutional neural networks);
the only missing piece is the dataset.
Camera angles could be detected with the
same methodology as this project, but the dataset for
it doesn't exist either.
Cut detection — transition from one shot to the next —
has been
worked
on
extensively,
and can be adopted to
film*.