In this project, students were given the opportunity to apply computer vision to both real-world and highly-relevant problems. Face Recognition is a common technology in modern smartphones and laptops with webcams. These systems extract visual keypoints on a person's face and match those features against an existing representation of the user's face. In state of the art applications, these systems are driven by deep computer vision. In this project, we tackled two related, but simpler vision problems and deployed our algorithms and networks on a Raspberry Pi with camera.
Face Detection
In the first part of this project, I used PCA on a large dataset of people's faces to determine the L2-distance between a sample face image projected in the PCA-space, and the so-called eigen-faces. Eigen-faces are just eigen-vectors in the PCA-space. By thresholding the distance between a sample projection and the eigen-vectors of the PCA-space, we can create a classifier for whether or not there is a face present in the picture. A smaller L2-distance would mean that whatever pattern is in the image, is has features that resemble a human face, while a larger L2-distance would indicate that whatever is in the image does not have much similarity to a human face. Shown below is a sample video of my detector deployed on a Raspberry Pi camera:
The first step, as with any ML/CV project, is preprocessing. The dataset prepared for students contains 2000 images of different human faces at various scales, all of which are 64x64 grayscaled images. I vectorize all of the images in the dataset, and then compute the average for each pixel over all vectorized faces, $\mu$. In order to make the data 0 centered, I subtract the average face vector from all samples. Then, we can use our 0-centered data matrix to compute the covariance matrix of our data, with the following formula:
$$ Cov(X,X) = E[X - \mu]E[X - \mu]^T $$
The expectation of the 0-centered data matrix is just the input itself, so our expression is simply:
$$ Cov(X,X) = (X - \mu)(X - \mu)^T $$
By computing the eigenvectors of the covariance matrix, we now have a set of bases which describe our PCA-space. Since the data matrix is of shape (2000, 4096), the resulting dimensionality of the PCA-space comprised of the eigenvectors is at most 2000. Eigenfaces are then generated by multiplying our 0-centered data matrix by the eigenvectors. Eigenfaces with larger eigenvalues capture more variance than smaller eigenvalues, so we only really need to use the eigenfaces with the highest eigenvalues in order to correctly identify if there is a face present in an image. Using the eigenface matrix, we can now compute the projection of an input image onto the span of the eigenbasis. This is done by:
$$ eigenfaces = ((X - \mu)^T \cdot eigenvectors)^T $$ $$ \omega = eigenfaces \cdot (v - \mu)^T $$ $$ p = \omega^T \cdot eigenfaces $$
where $v$ is the vectorized sample image, and $p$ is the vector projection of the sample image in the PCA-space. Finally, we can take the projection and measure the L2 distance between the PCA-space projected vector, $p$ and the 0-centered sample $v - \mu$.
Next comes the implementation on the Raspberry Pi. After SSH-ing into the Raspberry Pi and connecting to a server to stream camera data, we can process the frames using our above functions and OpenCV. OpenCV comes in handy to read the streamed video frames from the Raspberry Pi, but most of the processing is done by functions we wrote from above. First, we constrain ourselves to detect within small rectangular region of the frame, shown visually by the square in the above video. We also limit the number of eigenfaces used to compute the summed projection distance with the image in the rectangular region. Tuning the number of eigenfaces to use for measurement, as well as the L2-distance threshold is left to students. I found that using 20 eigenfaces and an L2-threshold of 47000 worked well, which is shown in the video above.
A video is all well and good when it looks nice, but there are definitely some caveats to this method. The first limitation is that we can only really hope to accurately detect faces in a region that is about the size of our training images (eigenfaces). The PCA face detector thus only works well when the face is about 5 feet away and centered in the frame. Additionally, the background needs to provide contrast to whatever face is in the frame. Training performace may not reflect generalization performance, esepecially when looking in the realm of real-time image processing and potentially noisy sensor data. Furthermore, balancing real-time processing speed with quality of prediction is one of the cruxes of modern computer vision. One way to approach the transition from performance vs. speed tradeoff of vision algorithms is to use neural networks with a low cost camera. The use of a low-cost camera may introduce the problem of noisy sensor data, but one of the defining chaaracteristics of modern deep computer vision is to work well in the presence of noise by extracting features at multiple levelss, like what a CNN does. In the next part, we take on the challenge of deploying such a CNN for a slightly different task.
Mask Detection
In this section, the goal is to visually determine whether or not a person's mask is on. In this part we had access to another dataset with around 2400 images of human faces with masks on, and around 2200 images of human faces without masks on. Unlike to the first part, we preprocess this dataset to be passed into supervised learning algorithms. So we make all grayscale images into RGB images, scale them to 128x128x3, and normalize pixel values from [0, 255] to [-1, 1]. To better understand this problem, it is great practice to start with a simple algorithm like the perceptron algorithm. This is a simple linear classification algorithm that you can learn more about here. I use the sklearn perceptron clasifier and train the model on a similarly-distributed 80% subset of the dataset and evaluate the model's performance on the held-out 20% subset of the dataset. After flattening the images to vecotrs, the vanilla sklearn perceptron can achieve about 82% training accuracy and 75% validation accuracy. This result helps to confirm that our dataset is well prepared, but nothing special.
Now, we attack this problem with deep learning using Keras and Tensorflow. Keras is a very organized deep learning API with high-speed prototyping potential, and is built on Google's ML platform Tensorflow. Coming from a background mostly in Pytorch, Keras was extremely easy to start-up and learn quickly. Honestly, I feel like Keras almost makes things TOO easy, and I think controlling more intricate settings might require reaching back into the seemingly infinite tensorflow documentation. Anyway, making a simple CNN was easy enough. For feature extraction, I use two convolutional layers, each followed by max-pooling operations. For the classification head, I use a two fully-connected layers, the first with a relu activation function and the second with a softmax activation function on the binary class raw class scores. With this architecture, we train using mini-batch SGD and cross-entropy loss. This architecture trained on 10 epochs achieves about 91% training accuracy and 88% validation accuracy. This is great performance, but what if we want better? What if we want to say that our model correctly identifies mask and no-mask faces with 95% accuracy?
One way to improve performance could be to finetune our own architecture and training scheme. We could try adding a few layers to our feature extractor or classifier, choose a diffrent optimizer, or change hyperparameters like batch size, learning rate, momentum, etc. Choosing the right hyperparameters can give small but compounding improvements to model performance. For effective sweeping of hyperparameters, look to tools like Weights and Biases Sweeps, which can integrate with distributed training across multiple GPUs. In this project, I don't think hyperparameter tuning would be enough to improve the performance of this model without sufficiently changing the network architecture. As such, I opt to utilize some of the power of built-in models in Ternsorflow. I choose to use the MobileNetV2 architecture because it is built for deployment mobile and embedded applications, which is exactly our end goal here. I also opt to only use the pretrained backbone from the Tensorflow MobileNetV2 implementation, on top of which I add a 2-layer fully connected network for the binary classifier. This model can achieve real-time performance, which is demonstrated in the video below, while also achieving 97% training accuracy and 95% validation accuracy. Shown below is a video of a trained mask detection model deployed on frames streamed over wifi from Raspberry Pi, and processed in OpenCV.