Analysing cinema is a time-consuming process. In the cinematography domain alone, there's a lot of factors to consider, such as shot scale, shot composition, camera movement, color, lighting, etc. Whatever you shoot is in some way influenced by what you've watched. There's only so much one can watch, and even lesser that one can analyse thoroughly.
This is where neural networks offer ample promise. They can recognise patterns in images that weren't possible until less than a decade ago, thus offering an unimaginable speed up in analysing cinema. I've developed a neural network that focuses on one fundamental element of visual grammar: shot types. It's capable of recognising 6 unique shot types, and is ~91% accurate. The pretrained model, validation dataset (the set of images used to determine its accuracy), code used to train the network, and some more code to classify your own images is freely available.
What is Visual Language, and Why Does it Matter?
When you're writing something — an email, an essay, a report, a paper, etc, you're using the rules of grammar to put forth your point. Your choice of words, the way you construct the sentence, correct use of punctuation, and most importantly, what you have to say, all contribute towards the effectiveness of your message.
Cinema is about how ideas and emotions are expressed through a visual form. It's a visual language, and just like any written language, your choice of words (what you put in the shot/frame), the way you construct the sentence (the sequence of shots), correct use of punctuation (editing & continuity) and what you have to say (the story) are key factors of creating effective cinema. The comparison doesn't apply rigidly, but is a good starting point to start thinking about cinema as a language.
The most basic element of this language is a shot. There's many factors to consider while filming a shot — how big should the subject be, should the camera be placed above or below the subject, how long should the shot be, should the camera remain still or move with the subject, and if it's moving, how should it move? Should it follow the subject, observe it from a certain point while turning right/left or up/down and should the movement be smooth or jerky. There are other major visual factors, such as color and lighting, but we'll restrict our scope to these factors only. A filmmaker chooses how to construct a shot based on what he/she wants to convey, and then juxtaposes them effectively to drive home the message.
Neural Networks 101
'AI' is most often a buzzword for deep learning, the field that uses neural networks to learn from data.
The key idea is that instead of explicitly specifying patterns to look for, you specify the rules for the neural network to autonomously detect patterns from data. The data could be something structured, like a database of customers' purchasing decisions, or something unstructured, like images, audio clips, medical scans, or video. Neural networks are good at tasks like predicting a customer's desired products, differentiating the image of a dog and a cat, the mating calls of dolphins and whales, a video of a goal being scored vs. the goalkeeper saving the day, or whether a tumor is benign or malignant.
With a large enough labelled dataset (say 1000
images of dogs and cats stored separately), you could use
a neural network to learn patterns from these images. The
network puts the image through a pile of computation, and spits
out two probabilities: P(cat)
and P(dog)
.
You calculate how wrong the network was using a loss function
,
then use calculus (chain rule) to tweak this pile of computation to
produce a lower loss (a more correct output). Neural networks
are nothing but a sophisticated mechanism of optimising this
function.
If the network's output is far off
from the truth, the loss is larger, and so the tweak
made is also larger. Tweaks that are too large are bad, so you
multiply the tweaking factor with a
tiny number known as the learning rate
.
One pass through the entire dataset is known as an epoch
.
You'd probably run through many epochs
to reach
a good solution; it's a
good idea to tweak the images non-invasively (such as flipping them
horizontally), so
that the network sees different numbers for the same image and can
more robustly detect patterns. This is known as data
augmentation
.
Neural networks can transfer knowledge from one project to another.
It's very common to take a network that's been trained with 14 million
images of a thousand common objects
(ImageNet),
and then tweak it
to adapt to your project. It works because it has already learnt
basic visual concepts like curves, edges, textures, eyes, etc, which
come in handy for any visual task. This process is known as
transfer learning
.
Rinse and repeat this process carefully, and you have in your hands an 'AI' solution to your problem.*
If that piqued your interest, I suggest you watch this (~19mins) for a fairly detailed explanation of how a neural network works. If you're bursting with excitement, follow through with this course .
Neural networks burst into popularity in the past decade with the development of large datasets and the ability to leverage GPUs (graphics cards) for the heavy computation demanded by neural nets.
*Frederic Brodbeck's Cinemetrics project does something similar, and is worth looking at.
Barry Salt's Database also does something similar, but at a higher level (summary statistics for the entire movie); a filmmaker would find individual shot analysis more useful.
Source: https://rsomani95.github.io/ai-film-1.html
Primitive Source: https://towardsdatascience.com/a-i-for-filmmaking-f2a2197020aa
Δεν υπάρχουν σχόλια:
Δημοσίευση σχολίου