Big Data Digest
Source: Medium
Translated by: Chen Zhiyan
As a new dad and programmer, the most frequent question I ponder in my new role is, “Is it really impossible to automate the work of caring for a baby?“
Of course, it might be possible, even if there are robots for changing diapers (assuming enough parents agree to test such devices on their toddling children), there are indeed very few parents willing to automate baby care.
As a father, the first thing I realized is that babies cry a lot, and even when I’m home, I can’t always hear my child crying.
Typically, commercial baby monitors fill this gap; they serve as intercoms, allowing you to hear the baby’s cries from another room.
But I quickly realized that commercial baby monitors are not as smart as I imagined:
-
They only serve as a transmitter: transmitting sound from the source to the speaker, but cannot detect the meaning of the child’s cries;
-
When parents go to another room, they have to take the speaker with them and cannot play sound on any other existing audio devices;
-
Speakers are usually low-power devices and cannot connect to external speakers—this means that if I’m playing music in another room, I might not hear the baby’s cries, even if the monitor and I are in the same room;
-
Most speakers operate on low-power radio waves, which means they only work if the baby is in his/her room and you have to walk downstairs.
Therefore, I came up with the idea of making a better “smart baby monitor” myself.
Without further ado, I defined some necessary functions for this “smart baby monitor.
-
It can run on an inexpensive Raspberry Pi and USB microphone.
-
It should detect the child’s cries when they start/stop crying and notify me (ideally on my phone), or track data points on my dashboard, or run corresponding tasks. It should not just be an intercom that simply transmits sound from one source to another compatible device.
-
It should be able to transmit audio to devices such as speakers, smartphones, and computers.
-
It should not be affected by the distance between the source and the speaker, and there is no need to move the speaker around the house.
-
It should also have a camera, allowing real-time monitoring of the child. When he/she starts crying, I can capture images or short videos of the crib to check for any issues.
Let’s see how a new dad uses his engineer’s brain and open-source tools to accomplish this task.
Collecting Audio Samples
Collecting Audio Samples
First, purchase a Raspberry Pi, burn a Linux operating system on an SD card (preferably using Raspberry Pi 3 or higher), and run a TensorFlow model. You can also buy a USB microphone compatible with the Raspberry Pi.
Then install the necessary items:
[sudo] apt-get install ffmpeg lame libatlas-base-dev alsa-utils[sudo] pip3 install tensorflow
The first step is to record enough audio samples of when the baby cries and when he/she does not cry. These samples will later be used to train the audio detection model.
Note: In this example, I will show how to use sound detection to identify a baby’s cries. The same precise program can be used to detect any other type of sound—as long as they are long enough (e.g., alarms or the sound of drilling from a neighbor’s house).
First, check the audio input devices:
arecord -l
On the Raspberry Pi, you will get the following output (note, there are two USB microphones):
**** List of CAPTURE Hardware Devices ****card 1: Device [USB PnP Sound Device], device 0: USB Audio [USB Audio] Subdevices: 0/1 Subdevice #0: subdevice #0card 2: Device_1 [USB PnP Sound Device], device 0: USB Audio [USB Audio] Subdevices: 0/1 Subdevice #0: subdevice #0
I use the second microphone to record sound—i.e., card 2, device 0. Its ALSA method can be either hw:2,0 (directly accessing the hardware device) or plughw:2,0 (which will input sample rate and format conversion plugins if needed). Ensure there is enough space on the SD card, and then start recording some audio:
arecord -D plughw:2,0 -c 1 -f cd | lame - audio.mp3
With the child in the same room, record a few minutes or hours of audio—preferably long periods of silence, baby crying, and other unrelated sounds—press Ctrl-C when the recording is done. Repeat this process as many times as possible at different times of day or different days to obtain various audio samples.
Labeling Audio Samples
Labeling Audio Samples
Once you have enough audio samples, you can copy them to your computer to train the model—you can use SCP to copy files or directly copy from the SD card.
Store them all in the same directory, e.g., ~/datasets/sound-detect/audio. Additionally, create a new folder for each sample audio file containing an audio file (named audio.mp3) and a label file (named labels.json), which you will use to label the positive/negative audio segments in the audio file. The structure of the original dataset is as follows:
~/datasets/sound-detect/audio -> sample_1 -> audio.mp3 -> labels.json -> sample_2 -> audio.mp3 -> labels.json ...
Next: label the recorded audio files—if they contain several hours of the baby’s cries, it may be particularly challenging. Open each dataset audio file in your favorite audio player or Audacity, and create a new label.json file in each sample directory. Determine the exact start and end times of the crying and label them in labels.json using the key-value structure time_string -> label. Example:
{ "00:00": "negative", "02:13": "positive", "04:57": "negative", "15:41": "positive", "18:24": "negative"}
In the example above, all audio segments from 00:00 to 02:12 will be labeled as negative, from 02:13 to 04:56 as positive, and so on.
Generating the Dataset
Generating the Dataset
After labeling all audio samples, the next step is to generate the dataset and finally input it into the TensorFlow model. First, create a generic library named micmon and a set of utilities for sound monitoring. Then, start installing:
git clone git@github.com:/BlackLight/micmon.gitcd micmon[sudo] pip3 install -r requirements.txt[sudo] python3 setup.py build install
This model is designed based on frequency samples of audio rather than raw audio, as we want to detect a specific sound that has a specific “spectrum” label, namely: fundamental frequency (or a narrow band range of fundamental frequency) and a set of specific harmonics. The ratio of these harmonic frequencies to the fundamental wave is not affected by amplitude (frequency ratios are constant and independent of input amplitude) and not affected by phase (consecutive sounds recorded at any time will have the same spectral characteristics).
This characteristic, which is independent of amplitude and phase, makes it more likely to train a robust sound detection model rather than simply feeding raw audio samples into the model. Additionally, the model can be simpler (it can group multiple frequencies without affecting performance, thus effectively achieving dimensionality reduction). Regardless of the sample duration, the model will use 50-100 frequency bands as input values. One second of raw audio typically contains 44100 data points, and the length of the input increases with the duration of the sample, making it less prone to overfitting.
micmon can compute the FFT (Fast Fourier Transform) of certain segments of audio samples, dividing the resulting spectrum into low-pass and high-pass filter bands, and saving the results into a set of numpy compressed (.npz) files. This can be achieved by executing the micmon-datagen command on the command line:
micmon-datagen \
--low 250 --high 2500 --bins 100 \
--sample-duration 2 --channels 1 \
~/datasets/sound-detect/audio ~/datasets/sound-detect/data
In the example above, we generate a dataset from the raw audio samples stored in ~/dataset/sound-detect/audio and store the generated spectral data in ~/datasets/sound-detect/data. –low and ~/datasets/sound-detect/data. –high indicate the lowest and highest frequencies, with the default lowest frequency being 20Hz (the lowest frequency audible to the human ear) and the highest frequency being 20kHz (the highest frequency audible to healthy young people).
By limiting this range, we can capture as many other types of audio backgrounds and unrelated harmonic sounds as possible. In this case, the range of 250-2500 Hz is sufficient to detect a baby’s cry.
Baby cries are typically high-frequency (the highest note an operatic soprano can reach is around 1000 Hz), so here we set at least double the highest frequency to ensure we capture enough high harmonics (harmonics are higher frequencies), but we should also avoid setting the highest frequency too high to prevent capturing harmonics from other background sounds. I cut off audio signals below 250 Hz—baby cries are unlikely to occur in the low-frequency range; for example, you can open some positive audio samples and use an equalizer/spectrum analyzer to check which frequencies dominate in the positive samples and focus the dataset on those frequencies. –bins specifies the number of groups in the frequency space (default: 100). A larger value means higher frequency resolution/granularity, but if it’s too high, it may make the model prone to overfitting.
The script will segment the raw audio into smaller segments and calculate the frequency spectrum labels for each segment. Sample duration specifies how long each audio segment lasts (default: 2 seconds). For longer sounds, taking a larger value works better, but it will also reduce detection time and may fail on short sounds. For shorter sounds, a lower value may be taken, but the captured segments may not have enough information to reliably identify the sound.
In addition to the micmon-datagen script, you can also use the micmon API to write scripts to generate datasets. Example:
import os
from micmon.audio import AudioDirectory, AudioPlayer, AudioFile
from micmon.dataset import DatasetWriter
basedir = os.path.expanduser('~/datasets/sound-detect')
audio_dir = os.path.join(basedir, 'audio')
datasets_dir = os.path.join(basedir, 'data')
cutoff_frequencies = [250, 2500]
# Scan the base audio_dir for labelled audio samples
audio_dirs = AudioDirectory.scan(audio_dir)
# Save the spectrum information and labels of the samples to a
# different compressed file for each audio file.
for audio_dir in audio_dirs: dataset_file = os.path.join(datasets_dir, os.path.basename(audio_dir.path) + '.npz') print(f'Processing audio sample {audio_dir.path}') with AudioFile(audio_dir) as reader, \
DatasetWriter(dataset_file, low_freq=cutoff_frequencies[0], high_freq=cutoff_frequencies[1]) as writer: for sample in reader: writer += sample
Whether using micmon-datagen or the micmon Python API to generate the dataset, at the end of the process, you should find a bunch of .npz files in the ~/datasets/sound-detect/data directory, each corresponding to a labeled audio raw file. You can then use this dataset to train a neural network for sound detection.
Training the Model
Training the Model
micmon uses TensorFlow+Keras to define and train the model. With the Python API, it can be easily implemented. For example:
import os
from tensorflow.keras import layers
from micmon.dataset import Dataset
from micmon.model import Model
# This is a directory that contains the saved .npz dataset files
datasets_dir = os.path.expanduser('~/datasets/sound-detect/data')
# This is the output directory where the model will be saved
model_dir = os.path.expanduser('~/models/sound-detect')
# This is the number of training epochs for each dataset sample
epochs = 2
# Load the datasets from the compressed files.
# 70% of the data points will be included in the training set,
# 30% of the data points will be included in the evaluation set
# and used to evaluate the performance of the model.
datasets = Dataset.scan(datasets_dir, validation_split=0.3)
labels = ['negative', 'positive']
freq_bins = len(datasets[0].samples[0])
# Create a network with 4 layers (one input layer, two intermediate layers and one output layer).
# The first intermediate layer in this example will have twice the number of units as the number
# of input units, while the second intermediate layer will have 75% of the number of
# input units. We also specify the names for the labels and the low and high frequency range
# used when sampling.
model = Model( [ layers.Input(shape=(freq_bins,)), layers.Dense(int(2 * freq_bins), activation='relu'), layers.Dense(int(0.75 * freq_bins), activation='relu'), layers.Dense(len(labels), activation='softmax'), ], labels=labels, low_freq=datasets[0].low_freq, high_freq=datasets[0].high_freq)
# Train the model
for epoch in range(epochs): for i, dataset in enumerate(datasets): print(f'[epoch {epoch+1}/{epochs}] ') model.fit(dataset) evaluation = model.evaluate(dataset) print(f'Validation set loss and accuracy: {evaluation}')# Save the model
model.save(model_dir, overwrite=True)
After running this script (once you are satisfied with the model’s accuracy), you should find the saved new model in the ~/models/sound-detect directory. In my case, I collected about 5 hours of sound, which was enough to train the model with a defined optimal frequency range, achieving an accuracy greater than 98%. If training the model on a computer, simply copy it to the Raspberry Pi, and you’re ready for the next step.
Using the Model for Prediction
Using the Model for Prediction
At this point, create a script that uses the previously trained model to notify us when the baby starts crying:
import os
from micmon.audio import AudioDevice
from micmon.model import Model
model_dir = os.path.expanduser('~/models/sound-detect')
model = Model.load(model_dir)
audio_system = 'alsa' # Supported: alsa and pulseaudio
device = 'plughw:2,0' # Get list of recognized input devices with arecord -l
with AudioDevice(audio_system, device=audio_device) as source: for sample in source: source.pause() # Pause recording while we process the frame prediction = model.predict(sample) print(prediction) source.resume() # Resume recording
Run the script on the Raspberry Pi and let it run for a while—if no crying is detected in the past 2 seconds, it will print negative to standard output; if crying is detected, it will print positive.
However, simply printing messages to standard output when the baby cries is not very useful—we want to receive explicit real-time notifications!
This can be achieved using Platypush. In this example, we will use the Pushbullet integration to send a message to our phone when a cry is detected. Next, install Redis (Platypush uses it to receive messages) and Platypush, using HTTP and Pushbullet for integration:
[sudo] apt-get install redis-server[sudo] systemctl start redis-server.service[sudo] systemctl enable redis-server.service[sudo] pip3 install 'platypush[http,pushbullet]'
Install the Pushbullet application on your smartphone, and go to pushbullet.com to get the API token. Then create a ~/.config/platypush/config.yaml file that enables HTTP and Pushbullet integration:
backend.http: enabled: True
pushbullet: token: YOUR_TOKEN
Next, modify the previous script so that it triggers a custom event CustomEvent that can be captured by Platypush hooks, instead of printing messages to standard output:
#!/usr/bin/python3
import argparse
import logging
import os
import sys
from platypush import RedisBus
from platypush.message.event.custom import CustomEvent
from micmon.audio import AudioDevice
from micmon.model import Model
logger = logging.getLogger('micmon')
def get_args(): parser = argparse.ArgumentParser() parser.add_argument('model_path', help='Path to the file/directory containing the saved Tensorflow model') parser.add_argument('-i', help='Input sound device (e.g. hw:0,1 or default)', required=True, dest='sound_device') parser.add_argument('-e', help='Name of the event that should be raised when a positive event occurs', required=True, dest='event_type') parser.add_argument('-s', '--sound-server', help='Sound server to be used (available: alsa, pulse)', required=False, default='alsa', dest='sound_server') parser.add_argument('-P', '--positive-label', help='Model output label name/index to indicate a positive sample (default: positive)', required=False, default='positive', dest='positive_label') parser.add_argument('-N', '--negative-label', help='Model output label name/index to indicate a negative sample (default: negative)', required=False, default='negative', dest='negative_label') parser.add_argument('-l', '--sample-duration', help='Length of the FFT audio samples (default: 2 seconds)', required=False, type=float, default=2., dest='sample_duration') parser.add_argument('-r', '--sample-rate', help='Sample rate (default: 44100 Hz)', required=False, type=int, default=44100, dest='sample_rate') parser.add_argument('-c', '--channels', help='Number of audio recording channels (default: 1)', required=False, type=int, default=1, dest='channels') parser.add_argument('-f', '--ffmpeg-bin', help='FFmpeg executable path (default: ffmpeg)', required=False, default='ffmpeg', dest='ffmpeg_bin') parser.add_argument('-v', '--verbose', help='Verbose/debug mode', required=False, action='store_true', dest='debug') parser.add_argument('-w', '--window-duration', help='Duration of the look-back window (default: 10 seconds)', required=False, type=float, default=10., dest='window_length') parser.add_argument('-n', '--positive-samples', help='Number of positive samples detected over the window duration to trigger the event (default: 1)', required=False, type=int, default=1, dest='positive_samples') opts, args = parser.parse_known_args(sys.argv[1:]) return opts
def main(): args = get_args() if args.debug: logger.setLevel(logging.DEBUG)
model_dir = os.path.abspath(os.path.expanduser(args.model_path)) model = Model.load(model_dir) window = [] cur_prediction = args.negative_label bus = RedisBus()
with AudioDevice(system=args.sound_server, device=args.sound_device, sample_duration=args.sample_duration, sample_rate=args.sample_rate, channels=args.channels, ffmpeg_bin=args.ffmpeg_bin, debug=args.debug) as source: for sample in source: source.pause() # Pause recording while we process the frame prediction = model.predict(sample) logger.debug(f'Sample prediction: {prediction}') has_change = False if len(window) < args.window_length: window += [prediction] else: window = window[1:] + [prediction] positive_samples = len([pred for pred in window if pred == args.positive_label]) if args.positive_samples <= positive_samples and \
prediction == args.positive_label and \
cur_prediction != args.positive_label: cur_prediction = args.positive_label has_change = True logging.info(f'Positive sample threshold detected ({positive_samples}/{len(window)})') elif args.positive_samples > positive_samples and \
prediction == args.negative_label and \
cur_prediction != args.negative_label: cur_prediction = args.negative_label has_change = True logging.info(f'Negative sample threshold detected ({len(window)-positive_samples}/{len(window)})') if has_change: evt = CustomEvent(subtype=args.event_type, state=prediction) bus.post(evt) source.resume() # Resume recording
if __name__ == '__main__': main()
Save the above script as ~/bin/micmon_detect.py. If a positive sample is detected within the sliding window time (to reduce noise from false predictions or temporary faults), the script triggers an event, and it only triggers when the current prediction changes from negative to positive. It is then dispatched to Platypush. For other different sound models (not necessarily crying babies), this script is also generic and will work with other positive/negative labels, other frequency ranges, and other types of output events.
Create a Platypush hook to respond to the event and send notifications to the device. First, create a Platypush script directory:
mkdir -p ~/.config/platypush/scriptscd ~/.config/platypush/scripts# Define the directory as a module
touch __init__.py# Create a script for the baby-cry events
vi babymonitor.py
The content of babymonitor.py is as follows:
from platypush.context import get_plugin
from platypush.event.hook import hook
from platypush.message.event.custom import CustomEvent
@hook(CustomEvent, subtype='baby-cry', state='positive')
def on_baby_cry_start(event, **_): pb = get_plugin('pushbullet') pb.send_note(title='Baby cry status', body='The baby is crying!')
@hook(CustomEvent, subtype='baby-cry', state='negative')
def on_baby_cry_stop(event, **_): pb = get_plugin('pushbullet') pb.send_note(title='Baby cry status', body='The baby stopped crying - good job!')
Create a service file for Platypush and start/enable the service so that it starts in the terminal:
mkdir -p ~/.config/systemd/user
wget -O ~/.config/systemd/user/platypush.service \
https://raw.githubusercontent.com/BlackLight/platypush/master/examples/systemd/platypush.service
systemctl --user start platypush.service
systemctl --user enable platypush.service
Create a service file for the baby monitor, such as:
~/.config/systemd/user/babymonitor.service:
[Unit]
Description=Monitor to detect my baby's cries
After=network.target sound.target
[Service]
ExecStart=/home/pi/bin/micmon_detect.py -i plughw:2,0 -e baby-cry -w 10 -n 2 ~/models/sound-detect
Restart=always
RestartSec=10
[Install]
WantedBy=default.target
This service will start the microphone monitor on the ALSA device plughw:2,0; if at least 2 positive 2-second samples are detected in the past 10 seconds, and the previous state was negative, it will trigger the state=positive event; if fewer than 2 positive samples are detected in the past 10 seconds and the previous state was positive, it will trigger state=negative. You can then start/enable the service:
systemctl --user start babymonitor.service
systemctl --user enable babymonitor.service
Confirm that you will receive notifications on your phone once the baby starts crying. If you do not receive notifications, check the audio sample labels, the neural network architecture and parameters, or whether the sample length/window/bandwidth parameters are correct.
Additionally, this is a relatively basic automation example—you can add more automation tasks to it. For example, you can send requests to another Platypush device (e.g., in the bedroom or living room) to audibly prompt that the baby is crying using the TTS plugin. You can also extend the micmon_detect.py script so that the captured audio samples can also be streamed via HTTP—e.g., using a Flask wrapper and ffmpeg for audio conversion. Another interesting use case is to send data points to a local database when the baby starts/stops crying (you can refer to my previous article on “How to Create a Flexible and Self-Managing Dashboard Using Platypush+PostgreSQL+Mosquitto+Grafana” https://towardsdatascience.com/how-to-build-your-home-infrastructure-for-data-collection-and-visualization-and-be-the-real-owner-af9b33723b0c): this is a very useful set of data to track the baby’s sleeping, waking, or needs for feeding. Although monitoring the baby has been my original intention for developing micmon, the same program can also be used to train and detect models for other types of sounds. Finally, consider using a good power supply or lithium battery pack so that the monitor can be portable.
Installing the Baby Camera
Installing the Baby Camera
With a good audio feed and detection method in place, you can also add a video feed to keep an eye on the child. Initially, I installed a PiCamera on the Raspberry Pi 3 for audio detection, but later I found this configuration quite impractical. Think about it: a Raspberry Pi 3, an additional battery pack, and a camera combined would be quite cumbersome; if you find a lightweight camera that can easily be mounted on a stand or flexible arm and can be moved around, you can keep a close watch on the child wherever he/she is. Ultimately, I chose the smaller Raspberry Pi Zero, which is compatible with the PiCamera, along with a small battery.

Similarly, first insert an SD card burned with a Raspberry Pi-compatible operating system. Then insert a camera compatible with the Raspberry Pi into its slot, ensuring that the camera module is enabled in raspi-config, and install Platypush integrated with PiCamera:
[sudo] pip3 install 'platypush[http,camera,picamera]'
Then add the camera configuration in ~/.config/platypush/config.yaml:
camera.pi: listen_port: 5001
Check this configuration when Platypush restarts and get snapshots from the camera via HTTP:
wget http://raspberry-pi:8008/camera/pi/photo.jpg
Or open the video in the browser:
http://raspberry-pi:8008/camera/pi/video.mjpg
Similarly, when the application starts, you can create a hook that starts the camera feed via TCP/H264:
mkdir -p ~/.config/platypush/scriptscd ~/.config/platypush/scripts
touch __init__.py
vi camera.py
You can also watch the video via VLC:
vlc tcp/h264://raspberry-pi:5001
Watch the video on your phone using the VLC app or via the RPi Camera Viewer app.
From idea to final implementation, the effect is quite good, which can be considered a new dad’s self-redemption from the chores of caregiving.
Original link:
https://towardsdatascience.com/create-your-own-smart-baby-monitor-with-a-raspberrypi-and-tensorflow-5b25713410ca
Intern/Full-time Editorial Reporter Wanted
Join us and experience every detail of writing for a professional technology media outlet, growing alongside a group of the best people in the most promising industry. Located in Beijing, Tsinghua East Gate, reply “Recruitment” on the Big Data Digest homepage dialogue page to learn more. Please send your resume directly to zz@bigdatadigest.cn

