Fri Sep 20 06:16:51 PM IST 2024
These are two python scripts, one that takes a livestream and one that takes a video input. The basic difference between the two (other than turning on your camera etc.) is that the livestream mediapipe detector will drop (i.e. ignore) frames if it doesn't process them fast enough while the video detector will just slow down. I compared this with dealing with the C++ code directly and there must be at least a 5x speedup, so performance in this python script may not translate to an actual performance gain, so don't worry about it too much. At this stage we're checking for any small problems in our approach, and also the accuracy of the model. We don't have the object detection model yet so we'll only be checking the hand going over the mouth.
There are 481 face and 21 hand landmarks that the models generate. It should be possible to have a cheaper model detect a face without the detailed landmarks, and switch between the two when the hand gets close, if necessary, etc. We can check that later when we start to think more about performance.
The tasks that should be done are:
-
Get the code running on your computer. You'll need to download the ".task" files from the documentation (Really they're just zip files). Then change the model_path variables at the beginning of the script wherever you've placed them. Then to run the video python script, do "./video.py ".
-
Make a function to check if the hand is in front of the lips and see how accurate that is. I think you can use opencv's pointPolygonTest (https://stackoverflow.com/questions/13786088/determine-if-a-point-is-inside-or-outside-of-a-shape-with-opencv) and check if the landmarks on the fingers are inside the polygon made from the lips. Unfortunately I didn't find a list of landmarks that are the border of the lips. There is a labelled image (https://storage.googleapis.com/mediapipe-assets/documentation/mediapipe_face_landmark_fullsize.png and https://i.sstatic.net/T1ypF.jpg) so you can make a list of those 20 or so points from that by hand. I think there's a list in some places of the source code where they draw a line around the lips, so if it's too difficult I can find that.
-
Think of how you would detect gestures like pinching. One way for pinching is to simply divide: the distance between the tip of the index and thumb by the length of the index or thumb, and if that's small enough, you can consider that to be pinching. I don't know how accurate the predicted z-coordinate of the hand landmarks are though so that might be a problem if the hand is nearly perpendicular to the camera's plane. Maybe you could take the average of some other lengths or take some maximum in order to find some measure of the size of the hand. There is a labelled image for the hand landmarks (https://ai.google.dev/static/edge/mediapipe/images/solutions/hand-landmarks.png). Mediapipe has its own gesture recognition solution that adds some small ML model after the hand landmarker. We can try that or make an ML model outside mediapipe if that's easier, if the manual approach doesn't work. So I'd like you to try to make a manual function like I was talking about and see how well that works.
-
Similarly there should be a way to detect if the mouth is open. Mediapipe has "blendshapes" but it might be quicker to disable computing them and do them manually. To be seen depending on how well the manual approach works, similarly.
-
Consider how to use background subtraction to detect if the pill falls. We don't have a video of the pill falling right now. The idea is simply to subtract this frame from the last one then to search for large changes. Since a falling pill moves quite quickly I think the difference should be easy to see. So read about that and see if you can think of a better method.
Currently we only have the one video I sent in the whatsapp group that I'll send again in whatsapp. You can buy skittles for 10 inr at shetty's. Just record it on your phone. We don't need that many videos yet, and it's a very easy process, and you'll get 9 videos for each pack, and it should only take roughly a minute a video. If you don't need more videos you don't need to take them.
The documentation for mediapipe tasks is at https://ai.google.dev/edge/mediapipe/solutions/guide.
There's also some parameters you can play with about how confident the model needs to be before it marks the hand/face as detected, and also to do with whether it tracks or detects anew each frame. You can read the documentation for that, and change it in the options part, or just ask me. A low confidence threshold will mean the model is more likely to have false positives.