Research Paper: Sound Module For The Hybrid Drone Detection System
Research Paper: Sound Module For The Hybrid Drone Detection System
Abstract
This paper documents the current sound module used in the Hybrid Drone
Detection System (HDDS). The active audio branch is based on a single training
notebook, Notebooks/drone_thesis_audio_training.ipynb, and a single retained
runtime artifact, Project v1/src/audio/drone_sound_model.h5.
The sound module is an offline video-audio inference pipeline. It extracts audio from a video file, converts each window into a normalized log-Mel spectrogram, applies the trained CNN, and produces timestamped drone probabilities for later multimodal fusion.
1. Problem Definition
The purpose of the sound module is to answer a binary question:
- does this time interval contain drone audio or not?
The branch is intentionally offline and file-based. It does not depend on live microphones, audio drivers, or embedded deployment. At the current thesis stage, this is the correct engineering tradeoff because it keeps experiments reproducible and easy to review.
2. Repository Evidence Reviewed
The current sound evidence in the repository is:
- notebook:
/home/mo/dev/python/HDDS2/Notebooks/drone_thesis_audio_training.ipynb - runtime package:
/home/mo/dev/python/HDDS2/Project v1/src/audio/ - active model artifact:
/home/mo/dev/python/HDDS2/Project v1/src/audio/drone_sound_model.h5
The .h5 model is about 25 MB on disk, which is consistent with a small
Keras CNN rather than a lightweight classical model.
3. Sound Module Architecture
The cleaned runtime package is:
src/audio/
├── __init__.py
├── preprocess.py
├── features.py
├── classifier.py
├── persistence.py
├── schemas.py
├── report.py
├── video_test.py
└── drone_sound_model.h5
This separation is technically strong because it keeps:
- video-audio extraction in
preprocess.py, - spectrogram preparation in
features.py, - model loading and prediction in
classifier.py, - temporal smoothing in
persistence.py, - data contracts in
schemas.py, - reproducible reporting in
report.py.
That is substantially more defendable than keeping the whole sound branch as a single notebook.
4. Training Notebook Findings
The notebook explicitly trains a binary drone-versus-non-drone classifier.
Its retained preprocessing assumptions are:
SAMPLE_RATE = 22050
DURATION = 2 # seconds
SAMPLES_PER_TRACK = SAMPLE_RATE * DURATION
Feature extraction is based on Mel spectrograms:
mel_spec = librosa.feature.melspectrogram(y=chunk, sr=sr, n_mels=128)
log_mel_spec = librosa.power_to_db(mel_spec, ref=np.max)
log_mel_spec = (log_mel_spec - log_mel_spec.min()) / (
log_mel_spec.max() - log_mel_spec.min()
)
This means the notebook does not learn directly from raw waveform samples. It learns from a normalized time-frequency image representation of the sound.
The notebook also uses overlapping chunk generation:
- clip length:
2.0 s - overlap:
50% - hop:
1.0 s
These assumptions are now reflected in the cleaned runtime defaults.
5. CNN Classifier Design
The notebook defines a compact convolutional neural network:
def create_model(input_shape):
model = models.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=input_shape),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dropout(0.5),
layers.Dense(1, activation='sigmoid')
])
This is a reasonable thesis-stage architecture because:
- convolutions can learn local spectral-temporal patterns from rotor noise,
- pooling makes the representation less brittle to small local shifts,
- dropout helps limit overfitting,
- a sigmoid output matches the binary decision target directly.
The important repository fact is that this model is the one saved as:
model.save('drone_sound_model.h5')
That artifact is now the only active sound model described by current runtime code and documentation.
6. Training Procedure
The retained notebook uses:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
history = model.fit(
X_train, y_train,
epochs=20,
batch_size=32,
validation_split=0.2
)
This training procedure is defendable because:
- class balance is preserved in the split,
- test data is held out from fitting,
- validation is monitored during training,
- the epoch count is moderate for a baseline CNN,
- the batch size is conventional and stable.
The notebook also evaluates the model with classification metrics, but the retained file in this repository does not preserve the final numeric outputs in a way that can be cited confidently here. The training process is preserved; the final reported score table is not.
7. Runtime Inference Pipeline
The cleaned inference entrypoint is:
PYTHONPATH=src python -m audio.video_test /path/to/video.mp4
The runtime path is:
- extract audio from the input video with
ffmpeg, - resample to
22.05 kHzmono, - slice audio into overlapping
2.0 swindows, - convert each window into a normalized
128-bin log-Mel spectrogram, - load
drone_sound_model.h5, - run CNN inference per window,
- write timestamped predictions and a text log.
The extraction command in preprocess.py remains explicit:
command = [
"ffmpeg",
"-y",
"-i",
str(video_path),
"-vn",
"-ac",
"1",
"-ar",
str(target_sr),
"-c:a",
"pcm_s16le",
str(wav_path),
]
That is appropriate for a reproducible offline pipeline.
8. Runtime Preprocessing Alignment
The most important sound-module correction made in the current repository state is alignment between training and inference.
The runtime now uses:
TARGET_SR = 22050- default
window_s = 2.0 - default
hop_s = 1.0 - per-window log-Mel normalization
- a single CNN prediction path
This matters because mismatched runtime preprocessing is one of the most common reasons sound classifiers perform poorly after being moved out of notebooks.
9. Prediction Semantics
Each window produces an AudioSegmentPrediction:
@dataclass(frozen=True)
class AudioSegmentPrediction:
start_s: float
end_s: float
model_probability: float
audio_score: float
label: str
model_notes: tuple[str, ...] = ()
This schema is simple and correct for the current branch:
start_sandend_spreserve temporal alignment,model_probabilityrecords the CNN output,audio_scoreremains available for later fusion logic,labelis the thresholded binary decision.
Because there is only one active model, audio_score is currently identical to
the CNN probability. That is a defensible design because it keeps the interface
stable for later multimodal fusion work.
10. Temporal Confirmation
The sound module supports optional M/N temporal confirmation:
PYTHONPATH=src python -m audio.video_test /path/to/video.mp4 --confirm-m 3 --confirm-n 5
This is useful because isolated high-confidence windows should not always become confirmed alerts in a multimodal system.
The implementation preserves the raw per-window probability while allowing the
final label to be stabilized over a sliding decision window.
11. Reporting And Logging
The module produces a plain-text report and writes a timestamped run log under:
Project v1/results/logs/
Each run records:
- video path,
- segment count,
- window and hop settings,
- threshold,
- active model artifact path,
- per-segment probabilities and labels.
This is good engineering practice because it preserves experiment evidence in a reviewable format.
12. Strengths And Limitations
Strengths:
- the branch is now centered on one clearly identified trained artifact,
- training and runtime preprocessing are aligned,
- the package layout is clean and teammate-readable,
- video-based inference is reproducible,
- logs and temporal confirmation make the branch more practical.
Limitations:
- the repository does not preserve final notebook evaluation numbers cleanly,
- the audio branch is still offline rather than live-streaming,
- TensorFlow is required to load the
.h5model, - a dedicated training script has not yet replaced the notebook workflow.
13. Final Defense Of The Sound Module
The sound module is technically justified at this stage of HDDS because it adds an independent sensing branch that can later confirm or reject ambiguous radar or vision events.
The design choices are defendable:
- log-Mel spectrograms are standard for sound classification,
- a compact CNN is appropriate for limited-resource thesis work,
2.0 soverlapping windows give meaningful temporal resolution,- explicit logging and post-processing improve reproducibility,
- the cleaned package makes the branch maintainable by teammates.
The most important project correction is conceptual consistency. The current
sound paper, runtime code, and configuration now point to the same model and
the same preprocessing assumptions: drone_sound_model.h5 trained from
drone_thesis_audio_training.ipynb.