At the moment, we use a simple model; (feel free to change it!)
We calculate the first two "central sample moments" of the video data.
In plain language, this is the mean and the covariance.
Since all the statistics are normalized, this ends up being the same thing as the sample correlation, which might be more familiar.
Approximately 25 times per second we receive 3 matrixes full of pixel information from the video camera. Depending on the camera, this could be 1 megapixel or approximately 3 megabytes of data per frame
This is too much data, however, for such simple statistics, so we preprocess it in the following ways.
First, we downsample the video and chop it into a 64x64 pixel square.
Second, we convert the colorspace from Red/Green/Blue to Y/Cb/Cr
because otherwise blue and green are too similar to each other,
and everything is too similar to brightness).
Approximately speaking, "Y" is "brightness", "Cb" is "blueness"
and "Cr" is "Redness".
At this stage we have 3 64x64 matrixes: $$ Y,C^b,C^r. $$
For each of these we calculate the mean, a Y-mean, $$ \bar{Y}, $$ a Cb-mean $$ \bar{C}^b $$ etc
These are the first moments.
Now for the second moments....
We add spatial coordinates to the pixels - a pixel on the left side of the screen has a $$ V $$ coordinate of 0. On the right side it has a coordinate of 64. On the bottom of the screen it would have a $$ U $$ coordinate of 0 and on the top, 64. Now we "unpack" these index matrices into a 5x4096 sample matrix (We call the first column
The second moments are the sample covariance /correlation of this unpacked matrix $$ X. $$ But for these remaining values we can calculate covariances, e.g.
We can also calculate self-variance, e.g. $$ \operatorname{Cov}(\mathbf{y},\mathbf{y}) $$ which is just the usual variance.
The u,v columns aren't interesting, because we just made them up, so we do not calculate $$ \operatorname{Cov}(\mathbf{u},\mathbf{u}), \operatorname{Cov}(\mathbf{u},\mathbf{v}), \operatorname{Cov}(\mathbf{v},\mathbf{v})
NB this part is constantly changing.
If you notice that I've flipped green and red, or up and down, then feel free to update the documentation ;-)
The problem now is that these value are arbitrary. Who cares if
What MIDI note is
What tempo? How can we relate statistics to parameters?
The answer is that we choose a common language.
In Synestizer/LTC, all the numbers are transformed so that they range between -1 and +1. And all musical parameters expect a number from -1 to +1... Then we map -1 to, say, a low note like C2 and +1 to a high note like C4, and everything else in between to notes in between, e.g. 0 goes to C3 and 0.5 to G3 and so on.
We can also make combination parameters from the existing parameters. When you use the Patching
interface to create a MIDI CC output, or a combination signal... this is how it works. Let's say you have a combination signal $$ \alpha $$
Then you can choose some scale parameter and some other inputs,
Here
Sigmoid functions 'squash' a possibly-infinite number range into the
In Synestizer, these $$ s_i $$ parameters are represented as the slider positions in the patching interface:
-
nearest-neighbours in color space
-
image descriptors
-
PCA
-
Haar cascade.
-
random IIR filters
-
cascaded IIR lowpass filters at successively lower resolution
- You know what could lower the dimension further? Reporting only the extremest n extrema of the filtered fields, and the coordinates thereof.
- Also arbitrary differences between layers
- would the color coordinates interact at all?
- could even takes squared difference features to extract localised frequency
-
autocorrelation
-
particle filters
-
FFT features (or something else translation/phase-invariant?)
-
inner-product with desired eigen-features
-
user interaction: They choose a few key scenes, and we try to measure distance from those scenes.
-
other clustering, say, spectral?
-
neural networks?
- We can train them online
- https://github.com/karpathy/recurrentjs
- or use a high-performance pre-trained JS prediction models via neocortex.js? Note this would also get us features for free even if we ignore the model.
- can we do recursive NN for realtime stuff this way?
- Examples from keras https://github.com/fchollet/keras/tree/master/examples
- torch examples http://www.di.ens.fr/willow/research/weakcnn/ https://hal.inria.fr/hal-01015140
-
random forests