DIY AI for Home Automation - A Roadmap

I was reading the news this morning when I came across a trending article which publicizes Mark Zuckerberg's (creator of Facebook) newfound goal and personal challenge to create an AI butler. I found this rather intriguing, because this is an idea that I've been drawn to for some time. In fact, I've already done some experimentation with this concept, but this news story has renewed my interest. In this article, I'm going to explore some of my own thinking on this problem, what I've already done, and lay out a sort of a road-map for how I might like to tackle it this challenge moving forward.

There were two thoughts that came to my mind when I first read the headline "Mark Zuckerberg to build AI to help at home and work". First: there's been a lot of discourse and uncertainty about the potential (both positive and negative) of AI, a lot of great minds (like Stephen Hawking and Elon Musk) have already voiced their opinions on the subject. Second: it seems unlikely that Zuckerberg is going to be building a full-blown AI, instead it seems he plans to build a software with predictive potential in a number of specific situations. In this sense, he has likely set out to create a set of machine learning models which can be used in his everyday life. Nonetheless, I think that this sort of project is a lot of fun, and a great learning experience!

Zuckerberg outlines his goals for the project here. I'm going to attempt to break these goals into manageable parts and discuss potential DIY approaches where I can think of them! Having approached this type of project before, I think it is extremely useful to draw out a road-map of features and components that will go into the system before beginning development. This can save a lot of time and headaches down the road, while also allowing for an elegantly designed and efficient final product.

Dorm Room Automation

Zuckerberg's first goal may seem familiar in this time of increasing home automation, and incorporates many of the hardware aspects of these systems - for instance, controlling lights, and temperature. Last year, I built such a system, so I have already made some strides towards developing a control system for my room lights and temperature. Notably, given that I live in an on-campus dorm room, my options for controlling my room hardware are significantly more limited than if I owned my residence (and could play around with the light switches and other electrical systems).

RRAD Control Panel An image of the control panel from my first attempt at a dorm room automation system.

You can take a look at what I accomplished in my dorm room last school year (my sophomore year) on my project page. I utilized some not so sophisticated hardware, like home-made relay boxes for controlling wall-outlet voltage (120V AC), allowing me to switch lamps and fans (for cooling in the summer). About a week ago, I actually ordered a set of remote control wall-outlets when they were on a big sale (I got a set of 3 outlets for $14); I hope to tinker with them when I get back to school. I also bought a big pack of 433Mhz transmitters and receivers which I should be able to use in controlling these outlets from my Raspberry Pi (which I used as the hub of my original automation system). I think that the Raspberry Pi is a great platform for use as a home automation hub (master) and controller (slave). These little computers have a small form-factor, and can control hardware using the built-in GPIOs. In addition to being inexpensive, they can also connect over both Wifi and Bluetooth (using USB transceivers) allowing for intricate wireless control systems. Should more advanced processing power be required, tasks can be offloaded to other computers/servers.

Goal #1: Home Automation

In thinking about the sort of system described above, I have identified several key sub-goals and requirements that I would prioritize:

Goal #2: Answering the Door and Other Security

Another of Zuckerberg's goals was to secure his home, and only allow in certain visitors. He describes his vision of a smart doorbell, which would incorporate face-recognition to allow in certain guests. There are two major components to this phase of the project: hardware and software. The hardware for this goal seems to already exist. There are a number of WiFi connected doorbells on the market which incorporate both a camera and intercom (like ring and SkyBell). These devices would allow for face-detection and would allow for Zuckerberg to answer the door using his voice, or using a Text-to-Speech (TTS) generated voice.

My major concern with this project goal is security: building an accurate face-detection algorithm, and training it for use with specific guests might be challenging. In addition, inaccuracies or failures in the system could result in serious security breaches. In addition, spoofing a face could be as easy as using a printed picture. Instead, I propose a slightly different approach utilizing the same hardware and a two-factor authentication approach:

  1. Users would first approach the door, ring it, and present their face for identification and security logs.
  2. A TTS voice would then instruct users to present an entry pass (an encoded pass only given to specific individuals). I envision a mobile phone app which would generate a QR code based upon a set of credentials (username & password), as well as a hashing or other time-based function which would generate unique and unpredictable entry keys. This approach, called a Time-based One-time Password Algorithm (TOTP), is well recognized and currently in use by large companies like Google (for their two-factor authentication system) and Blizzard.
  3. Users could then present their pass for inspection (I envision using the camera on the doorbell for reading the pass)
  4. Software would then process the image of the pass and determine if the user is authorized for entry.

Even with a stronger system, such as the one described above, no security is fool-proof. The use of secondary security measures would be essential: at the least I would recommend the use of security cameras. I personally have used a piece of software (running on a Raspberry Pi) called Motion in the past to detect motion from webcams and record security footage. This is a fine, open-source solution, but is not terribly complex nor feature-rich; I actually ended up developing my own viewer for motion, though it is not very polished (repository here). There are numerous commercial alternatives which would likely work better; in the past I've used Sighthound Video, previously Vitamin D Video, and liked it a lot.

Goal #3: Voice Recognition

In his posting on automation, Zuckerberg evoked a vision of a virtual butler which can respond to verbal commands. Indeed, I think that being able to interact with a system using spoken commands, and getting auditory responses/feedback is a great feature (so futuristic)! This is actually a problem that I've approached a number of times, though in the past I've yet to find a satisfactory answer. In writing this post, I've renewed my interest in this pursuit, and I was actually able to find some new research and publications on the subject. I'm going to summarize my latest findings for you here.

There are three major approaches to the problem of speech recognition:

  1. Develop your own software: This is the least practical of the three approaches, though it may yield the most knowledge and learning potential. I would caution against this if it's your first time working with speech-recognition software, also called Speech-to-Text (STT) or Automatic Speech Recognition (ASR/SR). There are already some very mature solutions available for free or for $$$.
  2. Open-source software: This is the approach which I'm personally most interested in - why pay for software when there's a free alternative! I've come across four software solutions in this category so far: Sphinx, Kaldi and Julius, and HDecode. I have not personally tested these softwares against one another, I've only used PocketSphinx on Raspberry Pi, but others have. You can find a comparison of these softwares in a study by Christian Gaida, et al., which can be found here. The group found that Kaldi had the highest accuracy on both a German and an English corpus. That said, there are numerous other possibilities out there; in researching this article I found several (one promising API-based service I came across is In addition, there is a specific toolset called Jasper which will quickly get you up and running with speech recognition on Raspberry Pi. I don't have room to cover all these solutions, so you might want to do some further research on your own.
  3. Paid software: I'm not going to cover this type of software in much depth, however there are a number of paid solutions which can perform STT. Johanna Bjorklund of Mico wrote about her own experimentation with four software solutions (two open-source, two proprietary), and found that a proprietary software called VoxSigma produced the most accurate transcriptions from English youtube videos. Notably, Kaldi again performed higher on average as compared to Sphinx in this experiment.

With regard to Text-to-Speech (TTS), I only have experience using festival on Linux, which seems to work pretty well for the purposes I require. That said, there are numerous options for TTS. I'll have to do some further tinkering, and write more on this at a later time.

Closing Thoughts

In this article, I've explored some of my past-experience and thoughts on the design and development of home automation systems. I think that home automation is a great project to tackle: it's both a lot of fun, and has huge practical benefits. In the future I might write more on the subject. I didn't cover all of the aspects of Zuckerberg's proposed system, as these components would require extensive application of computer vision, which is a subject that I am still in the process of studying. If you've got any questions, thoughts, or suggestions, please drop them in the comments section below. Thanks for reading!