Research Paper: Sound Module For The Hybrid Drone Detection System

Abstract

This paper documents the current sound module used in the Hybrid Drone Detection System (HDDS). The active audio branch is based on a single training notebook, Notebooks/drone_thesis_audio_training.ipynb, and a single retained runtime artifact, Project v1/src/audio/drone_sound_model.h5.

The sound module is an offline video-audio inference pipeline. It extracts audio from a video file, converts each window into a normalized log-Mel spectrogram, applies the trained CNN, and produces timestamped drone probabilities for later multimodal fusion.

1. Problem Definition

The purpose of the sound module is to answer a binary question:

does this time interval contain drone audio or not?

The branch is intentionally offline and file-based. It does not depend on live microphones, audio drivers, or embedded deployment. At the current thesis stage, this is the correct engineering tradeoff because it keeps experiments reproducible and easy to review.

2. Repository Evidence Reviewed

The current sound evidence in the repository is:

notebook: /home/mo/dev/python/HDDS2/Notebooks/drone_thesis_audio_training.ipynb
runtime package: /home/mo/dev/python/HDDS2/Project v1/src/audio/
active model artifact: /home/mo/dev/python/HDDS2/Project v1/src/audio/drone_sound_model.h5

The .h5 model is about 25 MB on disk, which is consistent with a small Keras CNN rather than a lightweight classical model.

3. Sound Module Architecture

The cleaned runtime package is:

src/audio/
├── __init__.py
├── preprocess.py
├── features.py
├── classifier.py
├── persistence.py
├── schemas.py
├── report.py
├── video_test.py
└── drone_sound_model.h5

This separation is technically strong because it keeps:

video-audio extraction in preprocess.py,
spectrogram preparation in features.py,
model loading and prediction in classifier.py,
temporal smoothing in persistence.py,
data contracts in schemas.py,
reproducible reporting in report.py.

That is substantially more defendable than keeping the whole sound branch as a single notebook.

4. Training Notebook Findings

The notebook explicitly trains a binary drone-versus-non-drone classifier.

Its retained preprocessing assumptions are:

SAMPLE_RATE = 22050
DURATION = 2  # seconds
SAMPLES_PER_TRACK = SAMPLE_RATE * DURATION

Feature extraction is based on Mel spectrograms:

mel_spec = librosa.feature.melspectrogram(y=chunk, sr=sr, n_mels=128)
log_mel_spec = librosa.power_to_db(mel_spec, ref=np.max)
log_mel_spec = (log_mel_spec - log_mel_spec.min()) / (
    log_mel_spec.max() - log_mel_spec.min()
)

This means the notebook does not learn directly from raw waveform samples. It learns from a normalized time-frequency image representation of the sound.

The notebook also uses overlapping chunk generation:

clip length: 2.0 s
overlap: 50%
hop: 1.0 s

These assumptions are now reflected in the cleaned runtime defaults.

5. CNN Classifier Design

The notebook defines a compact convolutional neural network:

def create_model(input_shape):
    model = models.Sequential([
        layers.Conv2D(32, (3, 3), activation='relu', input_shape=input_shape),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.Flatten(),
        layers.Dense(64, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(1, activation='sigmoid')
    ])

This is a reasonable thesis-stage architecture because:

convolutions can learn local spectral-temporal patterns from rotor noise,
pooling makes the representation less brittle to small local shifts,
dropout helps limit overfitting,
a sigmoid output matches the binary decision target directly.

The important repository fact is that this model is the one saved as:

model.save('drone_sound_model.h5')

That artifact is now the only active sound model described by current runtime code and documentation.

6. Training Procedure

The retained notebook uses:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

history = model.fit(
    X_train, y_train,
    epochs=20,
    batch_size=32,
    validation_split=0.2
)

This training procedure is defendable because:

class balance is preserved in the split,
test data is held out from fitting,
validation is monitored during training,
the epoch count is moderate for a baseline CNN,
the batch size is conventional and stable.

The notebook also evaluates the model with classification metrics, but the retained file in this repository does not preserve the final numeric outputs in a way that can be cited confidently here. The training process is preserved; the final reported score table is not.

7. Runtime Inference Pipeline

The cleaned inference entrypoint is:

PYTHONPATH=src python -m audio.video_test /path/to/video.mp4

The runtime path is:

extract audio from the input video with ffmpeg,
resample to 22.05 kHz mono,
slice audio into overlapping 2.0 s windows,
convert each window into a normalized 128-bin log-Mel spectrogram,
load drone_sound_model.h5,
run CNN inference per window,
write timestamped predictions and a text log.

The extraction command in preprocess.py remains explicit:

command = [
    "ffmpeg",
    "-y",
    "-i",
    str(video_path),
    "-vn",
    "-ac",
    "1",
    "-ar",
    str(target_sr),
    "-c:a",
    "pcm_s16le",
    str(wav_path),
]

That is appropriate for a reproducible offline pipeline.

8. Runtime Preprocessing Alignment

The most important sound-module correction made in the current repository state is alignment between training and inference.

The runtime now uses:

TARGET_SR = 22050
default window_s = 2.0
default hop_s = 1.0
per-window log-Mel normalization
a single CNN prediction path

This matters because mismatched runtime preprocessing is one of the most common reasons sound classifiers perform poorly after being moved out of notebooks.

9. Prediction Semantics

Each window produces an AudioSegmentPrediction:

@dataclass(frozen=True)
class AudioSegmentPrediction:
    start_s: float
    end_s: float
    model_probability: float
    audio_score: float
    label: str
    model_notes: tuple[str, ...] = ()

This schema is simple and correct for the current branch:

start_s and end_s preserve temporal alignment,
model_probability records the CNN output,
audio_score remains available for later fusion logic,
label is the thresholded binary decision.

Because there is only one active model, audio_score is currently identical to the CNN probability. That is a defensible design because it keeps the interface stable for later multimodal fusion work.

10. Temporal Confirmation

The sound module supports optional M/N temporal confirmation:

PYTHONPATH=src python -m audio.video_test /path/to/video.mp4 --confirm-m 3 --confirm-n 5

This is useful because isolated high-confidence windows should not always become confirmed alerts in a multimodal system.

The implementation preserves the raw per-window probability while allowing the final label to be stabilized over a sliding decision window.

11. Reporting And Logging

The module produces a plain-text report and writes a timestamped run log under:

Project v1/results/logs/

Each run records:

video path,
segment count,
window and hop settings,
threshold,
active model artifact path,
per-segment probabilities and labels.

This is good engineering practice because it preserves experiment evidence in a reviewable format.

12. Strengths And Limitations

Strengths:

the branch is now centered on one clearly identified trained artifact,
training and runtime preprocessing are aligned,
the package layout is clean and teammate-readable,
video-based inference is reproducible,
logs and temporal confirmation make the branch more practical.

Limitations:

the repository does not preserve final notebook evaluation numbers cleanly,
the audio branch is still offline rather than live-streaming,
TensorFlow is required to load the .h5 model,
a dedicated training script has not yet replaced the notebook workflow.

13. Final Defense Of The Sound Module

The sound module is technically justified at this stage of HDDS because it adds an independent sensing branch that can later confirm or reject ambiguous radar or vision events.

The design choices are defendable:

log-Mel spectrograms are standard for sound classification,
a compact CNN is appropriate for limited-resource thesis work,
2.0 s overlapping windows give meaningful temporal resolution,
explicit logging and post-processing improve reproducibility,
the cleaned package makes the branch maintainable by teammates.

The most important project correction is conceptual consistency. The current sound paper, runtime code, and configuration now point to the same model and the same preprocessing assumptions: drone_sound_model.h5 trained from drone_thesis_audio_training.ipynb.