index (http://www.personal.reading.ac.uk/~shshawin/teaching/biocybernetics)

Machine intelligence

Incomplete State-of-the-art

Reinforcement learning scenario (Wikipedia)

The term deep learning is often used in machine intelligence, and tends to refer to artifical neural networks with a large number of layers each with a specific non-linear element or purpose. Examples include, variational autoencoders (vae), convolutional neural networks (cnn), long short term memory networks (lstm), recurrent neural networks (rnn), generative adversarial networks (gan), etc.

IEEE Spectrum article[moore:_how_deep_learn_works] (Recommended)

Linear classifier

Problems

Probability density functions (pdf)

The probability that some random variable $x$ lies between $a$ and $b$ is

\[ 2P(a < x < b)=\int_a^b f(x)dx \]

where $f(x)$ is the probability density function. In interesting problems the probability $P(-\infty < x <\infty)=1$, ie the event must occur at some point.

Relation to a histogram

A probability distribution is an idealisation of a bar chart that gives the ratio of things in each bin. For many experiments, as the bins get smaller the histogram look more and more like the probability distribution.

You can do this easily for the normal distribution, with the following matlab code. (you could try to find other data and compare the histogram with a distribution, for example the interspike time between two spikes in a spike train might follow a gamma or Poisson distribution.

>> x=-4:.1:4;
>> data=(randn(1,10000)); % generate random numbers with zero mean std 1
>> histogram(data,10,'normalization','pdf') % plot on a histogram with 10 bins
>> hold on
>> histogram(data,100,'normalization','pdf')
>> l=plot(x,pdf('norm',x,0,1)); % compare with the normal distribution
>> l.LineWidth=4;
>> hold off
Example histogram of normally distributed data with 10 bins, 100 bins, and the comparison with the probability distribution function.

Cumulative distribution function (cdf)

The cumulative distribution function is the probability that a random variable has value that is less than a point $b$. That is

\[ P(-\infty < x < b)=\int_{-\infty}^b f(x)dx \]

It is sometimes easier to work with the cdf as probabilities can be measured directly from the graph. By definition, for small values of $b$ the cdf should approach 0 (very unlikely) and for large values of $b$ the cdf should approach 1 (very likely).

The cdf also has the property that when differentiated with respect to the independent variable ($x$) it gives the pdf.

The Normal distribution as a classifier

A normal distribution is the most common (likely) expectation for any random variable data.

Examples that might have a normal distribution

Counter examples

For a normal distribution

\begin{equation} \mathcal{N}(\mu,\sigma^{2})= f(x\mid \mu ,\sigma ^{2})= {\frac {1}{\sqrt {2\pi \sigma ^{2}}}} e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}} \label{eq:pdfnormal} \end{equation}

Where $\mu$ is the mean (sometimes called the expected value) and $\sigma$ is the standard deviation.

or in vector form

\[ \mathcal{N}(\vec{\mu},\vec\Sigma)= \frac1{\sqrt {(2\pi )^{k}|{\vec {\Sigma }}|}} e^{\left(-{\frac {1}{2}}({\vec {x} }-{\vec {\mu }})^{\mathrm {T} }{\vec {\Sigma }}^{-1}({\vec {x} }-{\vec {\mu }})\right)} \]

The matrix $\vec\Sigma$ is the symmetric covariance matrix. $k$ is the dimension of the space (number of elements in $\mu$. If the vector has only one element, then we work out that $\vec\Sigma$ would be $\sigma^2$.

The Greek symbols.

$\mu$ mu
$\sigma$ sigma
$\Sigma$ Sigma
$P(a < x < b)$probability that an event that depends on $x$ lies between the values $a$ and $b$

Example (1D) Pets and people

How to classify people and pets. Guess at means and standard devisions

Figure 1: Algorithm to distinguish pets from people
>> muppl=1.700; sigppl=.400; % mean and standard deviation of the height of all people (a guess)
>> mupets=.300; sigpets=.100; % mean and standard deviation of the height of all pets (a guess)

First create some data (10 people and 10 pets)

>> people=randn(10,1)*sigppl+muppl; % start with a [0,1] distribution and expand to mean muppl and std sigppl
>> pets=randn(10,1)*sigpets+mupets; % ditto pets
>> ht=1; plot(pets,ht*ones(size(pets)),'x',people,ht*ones(size(people)),'o')
>> legend({'pets','people'}); xlabel('length (m)'); pause

How does this look as a propability density function (PDF)?

>> x=0:max(people)/100:max(people);
>> Npeople=exp(-(x-muppl).^2/(2*sigppl^2))/sqrt(2*pi*sigppl^2);
>> Npets=exp(-(x-mupets).^2/(2*sigpets^2))/sqrt(2*pi*sigpets^2);
>> plot(pets,ht*ones(size(pets)),'x',people,ht*ones(size(people)),'o',x,Npets,x,Npeople)
>> legend({'pets','people'}); xlabel('length (m)');ylabel('probability density');pause

Figure 1 above shows this result.

The probability is the integral of the PDF. We can plot the cumulitive probability function $\mathrm{cdf}(x)$ as the probability that the event has occured somewhere between $-\infty$ and $x$

>> plot(pets,ht*ones(size(pets)),'x',people,ht*ones(size(people)),'o',x,cumsum(Npets),...
>> x,cumsum(Npeople))
>> xlabel('length (mm)'); ylabel('probability pet/person is less than height x');pause

Example (2D) Iris species

Using the Fisher Iris data set[Fisher1936] on the petal and sepal widths and lengths for three species of iris, setosa, versicolor and virginica.

Can we identify an algorithm to classify the species based on measurements of two parameters?

The meas columns are sepal width, sepal length, petal width, petal length. Species are in the order

>> setosa=1:50;versi=51:100;virg=101:150;
>> T = readtable('fisheriris.csv','format','%f%f%f%f%C');
>> meas=T{:,1:4};
>> plot(meas(setosa,3),meas(setosa,4),'x',meas(versi,3),meas(versi,4),'o',meas(virg,3),...
>> meas(virg,4),'d')
>> legend({'setosa','versicolor','virginica'},'Location','southeast')
>> title('petal width vs petal length');xlabel('petal width (cm)');
>> ylabel('petal length (cm)'); pause

A linear classifier needs to have the class means of each group. These are sufficient to attempt a classification. While we are calculating the means we can also calculate the covariance.

>> muset=mean(meas(setosa,:)); covset=cov(meas(setosa,:));
>> muversi=mean(meas(versi,:)); covversi=cov(meas(versi,:));
>> muvirg=mean(meas(virg,:)); covvirg=cov(meas(virg,:));
>> muall=[muversi;muset;muvirg];

We can use the Voronoi method to work out boundaries based on means (linear discriminant analysis)

>> plot(meas(setosa,3),meas(setosa,4),'x',meas(versi,3),meas(versi,4),'o',meas(virg,3),meas(virg,4),'d')
>> hold on; voronoi(muall(:,3),muall(:,4)); hold off; pause

Can see how well it does on other pairs, e.g. sepal length and petal length

>> plot(meas(setosa,2),meas(setosa,4),'x',meas(versi,2),meas(versi,4),'o',meas(virg,2),meas(virg,4),'d')
>> hold on; voronoi(muall(:,2),muall(:,4)); hold off;
>> legend({'setosa','versicolor','virginica'})
>> xlabel('sepal length (cm)'); ylabel('petal length (cm)');pause

Figure 2: below shows the results

Figure 2: Best linear boundaries between classes

Probably easiest to use libraries and programmes such as Matlab to extend this to more than 2 dimensions. Matlab has a linear classifier fitcdiscr that does much of the work needed.

>> d = fitcdiscr(T,'Species'); % fit the discriminator
>> [cm,order]=confusionmat(T.Species,predict(d,T)) % predict from the original data and compute a confusion matrix
Confusion matrix for a linear classifier applied to Fisher Iris data

Decision trees

Training done by computing the gain in entropy (information gain) and choosing a rule that results in the highest information gain. This process is then applied at the next level of the tree until there is no information gain if a new rule is introduced.

Matlab has a package to fit a classification tree called fitctree. This can again be illustrated using the Fisher iris data set

>> load fisheriris
>> t4 = fitctree(meas, species,'PredictorNames',{'SL' 'SW' 'PL' 'PW'});
>> [cm4,order4]=confusionmat(species,predict(t4,meas))
>> m=confusionchart(cm4,order4)
>> view(t4,'mode','graph')

Using all measurements as predictors we get the following rules and results.

  Decision tree for classification
      1  if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
      2  class = setosa
      3  if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
      4  if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
      5  class = virginica
      6  if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
      7  class = virginica
      8  class = versicolor
      9  class = virginica

Which can be drawn up as a tree diagram

Confusion matrix and Decision tree for Fisher Iris data predicting from petal length (PL) and petal width (PW)
Confusion matrix and decision trees for Fisher Iris predicting from sepal length (SL) and sepal width (SW)

Potential issues with decision trees include