Detailed Research Report For The Radar, Vision, And Sound Modules
Detailed Research Report For The Radar, Vision, And Sound Modules
Abstract
This document records, in research-paper style, what was implemented for the Hybrid Drone Detection System (HDDS) repository at the current stage of the project. It is intentionally detailed and written for teammates who need to understand not only what each module does, but also where the code lives, how the data flows, which assumptions were encoded, and which implementation details may affect future work.
The project currently has two distinct layers:
- A cleaned academic package under
Project v1/, where the radar module is the technical core and the sound module is the first clean multimodal branch. - An integrated dashboard under
App/, where radar, sound, and vision are synchronized for scenario playback and simple decision fusion.
This distinction is important. The cleaned Project v1 package is the best
starting point for reproducible thesis work. The App/ layer is the practical
integration surface that still contains some legacy and compatibility behavior,
especially for vision and audio.
Repository Scope And What Was Actually Built
At the time of writing, the main relevant paths are:
Project v1/
├── configs/
├── docs/
├── src/radar_sim/
└── src/audio/
App/
├── main.py
└── scenario_loader.py
The implemented work can be summarized as follows:
- The radar module is a software-only passive-radar-inspired simulation chain.
- The sound module is an offline inference pipeline that extracts audio from video, slices it into windows, computes learned scores, and emits timestamped predictions.
- The vision module is currently implemented inside the integrated app, where a YOLOv5 detector is applied to video frames and the results are synchronized with audio and radar status.
- The app performs a simple fusion rule over the three modalities.
The remainder of this document is organized as three detailed reports:
- Radar Module Report
- Vision Module Report
- Sound Module Report
Report I: Radar Module
1. Research Goal
The radar module was developed as the most physically grounded part of the project. Its purpose is not to drive a real radar sensor, but to reproduce the core detection physics in software so that the project can be defended as a signal-processing system before hardware integration.
The radar branch studies:
- reference waveform generation,
- surveillance-channel synthesis,
- delay and Doppler target injection,
- direct-path and clutter contamination,
- range-Doppler formation,
- CA-CFAR detection,
- scenario-based validation and reporting.
This module lives in Project v1/src/radar_sim/.
2. Files And Responsibilities
The radar implementation is split cleanly by responsibility:
constants.py: radar, CFAR, scenario, and application configuration objects.waveform.py: OFDM-like reference waveform generation.channel.py: surveillance-channel simulation, including direct path, clutter, target echoes, and noise.processing.py: cyclic-prefix removal, range processing, clutter suppression, and range-Doppler map formation.geometry.py: conversions between delay, range, Doppler, and velocity.detection.py: 2D CA-CFAR thresholding and compact detection extraction.metrics.py: truth generation and truth-matching.runner.py: scenario execution, report printing, and logging.scenarios.py: scenario lookup from configuration.realtime.py: dynamic moving-target simulation mode.tui.py: interactive scenario configuration.
This is a strong design choice because it separates physics, configuration, processing, and presentation instead of mixing them into one script.
3. Radar Configuration And Operating Assumptions
The default operating parameters are stored in
Project v1/configs/default.yaml. The important defaults are:
radar:
carrier_frequency_hz: 2400000000.0
sample_rate_hz: 20000000.0
pri_s: 0.00025
num_pulses: 128
num_subcarriers: 256
cyclic_prefix: 64
cfar:
guard_cells_range: 2
guard_cells_doppler: 2
train_cells_range: 8
train_cells_doppler: 8
pfa: 1.0e-5
min_abs_velocity_mps: 1.0
These parameters define the simulated sensing region:
- carrier frequency is assumed to be
2.4 GHz, - wavelength is derived from the speed of light and carrier frequency,
- slow-time sampling is controlled by
PRI, - fast-time resolution is controlled by
sample_rate_hz, - the OFDM-like symbol uses
256subcarriers plus a64-sample cyclic prefix, - the CFAR detector suppresses near-zero-velocity bins below
1.0 m/s.
That last point matters. The detector is intentionally biased against stationary or nearly stationary returns in order to reduce clutter-like alarms.
4. Waveform Generation
The reference waveform is generated in waveform.py. The core implementation
is short and clear:
def generate_qpsk_symbols(num_subcarriers: int, rng: np.random.Generator) -> np.ndarray:
bits_i = rng.choice((-1.0, 1.0), size=num_subcarriers)
bits_q = rng.choice((-1.0, 1.0), size=num_subcarriers)
return (bits_i + 1j * bits_q) / np.sqrt(2.0)
def generate_ofdm_symbol(config: RadarConfig, rng: np.random.Generator) -> np.ndarray:
freq_symbols = generate_qpsk_symbols(config.num_subcarriers, rng)
time_symbol = np.fft.ifft(freq_symbols)
time_symbol /= np.sqrt(np.mean(np.abs(time_symbol) ** 2))
cyclic_prefix = time_symbol[-config.cyclic_prefix :]
return np.concatenate((cyclic_prefix, time_symbol))
This code shows four important implementation decisions:
- QPSK is used for subcarrier population.
- The symbol is generated in frequency domain and transformed by
IFFT. - The time-domain symbol is normalized to unit average power.
- A cyclic prefix is copied from the end of the useful symbol.
The pulse matrix is then created by repeating that process across
config.num_pulses. In practice, this produces a reference matrix with shape:
(num_pulses, samples_per_pulse)
5. Surveillance-Channel Model
The radar channel model is implemented in channel.py. The code below is one
of the most important excerpts in the entire radar branch:
def synthesize_target_echo(
reference_pulse: np.ndarray,
target: Target,
pulse_index: int,
config: RadarConfig,
) -> np.ndarray:
delayed = apply_fractional_delay(
signal=reference_pulse,
delay_s=target.delay_s,
fs_hz=config.sample_rate_hz,
)
time_axis = np.arange(reference_pulse.size) / config.sample_rate_hz
slow_time = pulse_index * config.pri_s
phase = np.exp(1j * 2.0 * np.pi * target.doppler_hz * (time_axis + slow_time))
return target.amplitude_linear * delayed * phase
This is the target model in concrete form:
apply_fractional_delay(...)produces a delayed version of the reference pulse, which means targets are not limited to integer-sample delays.- the Doppler effect is modeled as a complex exponential phase progression over both fast time and slow time,
- the target amplitude is applied in linear scale through
target.amplitude_linear.
The full surveillance channel is then assembled as:
direct_path_amplitude = db_to_linear_amplitude(config.direct_path_amplitude_db)
clutter_amplitude = db_to_linear_amplitude(scenario.clutter_amplitude_db)
noise_sigma = np.sqrt(db_to_linear_power(scenario.noise_power_db) / 2.0)
and later:
for pulse_idx in range(num_pulses):
pulse = direct_path_amplitude * reference[pulse_idx]
pulse = pulse + stationary_clutter
for target in scenario.targets:
pulse = pulse + synthesize_target_echo(...)
noise = noise_sigma * (
rng.standard_normal(pulse_len) + 1j * rng.standard_normal(pulse_len)
)
surveillance[pulse_idx] = pulse + noise
This means each pulse contains:
- direct-path leakage,
- a fixed stationary clutter realization,
- one or more delayed/Doppler-shifted target echoes,
- additive complex Gaussian noise.
From a thesis perspective, this is a well-structured compromise between full RF hardware realism and mathematically controlled simulation.
6. Range And Doppler Processing
Signal processing is handled by processing.py. The most important processing
step is:
def form_range_profiles(
reference: np.ndarray, surveillance: np.ndarray, config: RadarConfig
) -> np.ndarray:
ref_symbols = remove_cyclic_prefix(reference, config)
surv_symbols = remove_cyclic_prefix(surveillance, config)
ref_freq = np.fft.fft(ref_symbols, axis=1)
surv_freq = np.fft.fft(surv_symbols, axis=1)
matched = surv_freq / (ref_freq + config.epsilon)
return np.fft.ifft(matched, axis=1)
This function is central because it reveals the passive-radar-inspired philosophy used in the project:
- the cyclic prefix is removed first,
- both channels are transformed to frequency domain,
- the surveillance channel is divided by the reference channel,
- then the result is transformed back by
IFFTto produce range profiles.
That division is not a generic DSP step added by habit. It specifically attempts to suppress random subcarrier modulation so that the useful structure left in the matched result is more directly related to delay.
After range processing, stationary content is suppressed:
def suppress_stationary_components(range_profiles: np.ndarray) -> np.ndarray:
stationary_estimate = np.mean(range_profiles, axis=0, keepdims=True)
return range_profiles - stationary_estimate
This is a simple but effective clutter-cancellation step. It assumes that stationary components remain similar across pulses and can therefore be removed through pulse-mean subtraction.
Finally, the range-Doppler map is formed:
def form_range_doppler_map(
filtered_profiles: np.ndarray, config: RadarConfig
) -> np.ndarray:
slow_time_window = np.hamming(config.num_pulses)[:, np.newaxis]
fast_time_window = np.hamming(filtered_profiles.shape[1])[np.newaxis, :]
windowed = filtered_profiles * slow_time_window * fast_time_window
return np.fft.fftshift(np.fft.fft(windowed, axis=0), axes=0)
Two window functions are applied before the slow-time FFT:
- a Hamming window across pulses,
- another Hamming window across fast-time bins.
This reduces sidelobes and produces a cleaner map for CFAR thresholding.
7. Geometry And Physical Interpretation
The radar branch does not keep everything in abstract bins. geometry.py
converts them into physical quantities:
def delay_to_bistatic_range_m(delay_s: float, config: RadarConfig) -> float:
return config.speed_of_light * delay_s
def doppler_hz_to_velocity_mps(doppler_hz: float, config: RadarConfig) -> float:
return doppler_hz * config.wavelength_m / 2.0
This is important for two reasons:
- detections can be discussed in meters and meters per second rather than only in bin indices,
- the system stays explainable to supervisors and teammates who care about the physical meaning of the outputs.
The axis builders:
build_range_axis_m(...)build_doppler_axis_hz(...)build_velocity_axis_mps(...)
ensure that plots and reports use interpretable axes.
8. CFAR Detection
Detection is implemented in detection.py using 2D cell-averaging CFAR. The
core logic is:
noise_estimate = signal.convolve2d(
power_map,
kernel / num_train,
mode="same",
boundary="symm",
)
alpha = num_train * ((config.pfa ** (-1.0 / num_train)) - 1.0)
threshold_map = alpha * noise_estimate
detection_mask = power_map > threshold_map
This produces an adaptive threshold based on local neighborhood energy rather than a fixed global threshold. That is the right choice when clutter and noise conditions are allowed to vary.
The detector then explicitly suppresses near-zero-velocity bins:
stationary_bins = np.abs(velocity_axis_mps) < config.min_abs_velocity_mps
if np.any(stationary_bins):
detection_mask = detection_mask.copy()
detection_mask[stationary_bins, :] = False
This is a deliberate design decision. The project is not just looking for any bright return. It is biased toward moving targets that fit the drone-detection problem.
After thresholding, connected regions are labeled and compacted into one peak per region:
labels, num_regions = ndimage.label(detection_mask)
...
doppler_bin, range_bin = np.unravel_index(
np.argmax(region_power), region_power.shape
)
This reduces fragmented masks into practical output detections with:
- range bin,
- Doppler bin,
- range in meters,
- Doppler in Hz,
- velocity in m/s,
- peak power in dB.
9. Scenario Execution And Evaluation
runner.py provides the clean non-GUI experiment path. The main execution
sequence is:
reference = generate_reference_matrix(config=config.radar, rng=rng)
surveillance = simulate_surveillance_matrix(...)
processing = process_reference_and_surveillance(...)
detections = ca_cfar_2d(...)
truths = scenario_truth(scenario, config.radar)
This design is valuable because it gives a single, readable experiment chain for teammates to reuse in tests or papers.
Truth matching is handled in metrics.py:
range_ok = abs(detection.range_m - truth.range_m) <= range_tolerance_m
doppler_ok = abs(detection.doppler_hz - truth.doppler_hz) <= doppler_tolerance_hz
if range_ok and doppler_ok:
return True
The current comparison tolerances are not embedded in the detector itself. They are evaluation-layer choices, which is the correct separation of concerns.
10. Scenario Definitions
The default scenarios in configs/default.yaml include:
clear_skysingle_slowsingle_fasttwo_targetslow_snr
This gives the radar branch both sanity cases and stress cases:
- no targets,
- single target,
- multi-target,
- weak target at lower SNR.
The integrated app also supports richer JSON-driven scenarios through
App/scenario_loader.py, including timeline-based motion.
11. Strengths, Limitations, And Teammate Notes
Strengths:
- physically interpretable delay/Doppler model,
- clear module separation,
- reproducible seeded simulation,
- adaptive CFAR detection,
- scenario and truth evaluation support,
- multiple front ends: runner, TUI, and realtime.
Limitations:
- simulation-only, not hardware-connected,
- no antenna model, no calibration chain, no RF frontend impairments,
- clutter model is simplified and stationary,
- no track-before-detect or multi-frame tracker,
- zero-velocity suppression may hide slow or hovering targets.
Teammate note:
If the thesis requires the cleanest narrative, build from Project v1/src/radar_sim.
If the thesis requires a multimodal demo, use the same radar package through
App/main.py.
Report II: Vision Module
1. Research Goal
The vision module provides the visual detection branch of the multimodal system. Its present role is not to be a polished standalone package; instead, it serves as the real-time frame-level detector used by the integrated hybrid dashboard.
This is an important architectural fact: the current vision branch is mostly an
integration-layer implementation rather than a clean Project v1/src/vision
package.
2. Where The Vision Code Actually Lives
The primary vision logic is in:
App/main.pyApp/scenario_loader.py
The detector loads YOLOv5 weights from either the active repository root or the archive fallback:
YOLO_DIR = _first_existing_path(
REPO_ROOT / "yolov5",
REPO_ROOT / "archive" / "repo_cleanup_20260505" / "yolov5",
)
YOLO_WEIGHTS = _first_existing_path(
YOLO_DIR / "best.pt",
REPO_ROOT / "archive" / "repo_cleanup_20260505" / "yolov5" / "best.pt",
)
This tells teammates two things immediately:
- the detector depends on a local YOLOv5 codebase and a trained
best.ptartifact, - the app is resilient to repo cleanup because it can fall back to archived assets.
3. Practical Dependency Note
App/main.py imports:
cv2torchtkintertensorflowlibrosa
The cleaned Project v1/requirements.txt includes opencv-python-headless and
ultralytics, but the integrated app still uses torch.hub.load(...) with a
local YOLOv5 repo. Teammates should therefore treat the app as having stricter
runtime dependencies than the cleaned package alone.
4. Scenario Pairing And Input Format
The integrated app expects scenario bundles with matching stems:
- video file,
- audio file,
- radar JSON file.
That logic is implemented in scenario_loader.py:
for stem, parts in by_stem.items():
if {"video", "audio", "radar"} <= parts.keys():
bundles.append(
ScenarioBundle(
name=stem,
video_path=parts["video"],
audio_path=parts["audio"],
radar_path=parts["radar"],
)
)
This means the app is not designed to run vision alone. It is designed to consume synchronized multimodal scenario sets.
5. Model Loading Strategy
At runtime, the app prepares the YOLO detector as follows:
if self.yolo_model is None:
if not YOLO_WEIGHTS.exists():
raise FileNotFoundError(f"YOLOv5 weights not found: {YOLO_WEIGHTS}")
self.yolo_model = torch.hub.load(
str(YOLO_DIR),
"custom",
path=str(YOLO_WEIGHTS),
source="local",
device="cpu",
)
This shows several important implementation choices:
- a custom-trained model is expected, not only a stock pretrained one,
- the model is loaded from local files, not from the network,
- inference is forced onto CPU,
- the app fails early if the trained weights are missing.
From a deployment perspective, CPU execution is convenient for demos but can be slower on high-frame-rate or high-resolution video.
6. Video Processing Loop
The app opens the selected video with OpenCV:
self.cap = cv2.VideoCapture(str(self.bundle.video_path))
It then computes effective timing from the actual file metadata:
fps = self.cap.get(cv2.CAP_PROP_FPS)
self.video_fps = fps if fps and fps > 1 else 25.0
frame_count = int(self.cap.get(cv2.CAP_PROP_FRAME_COUNT) or 0)
This is a useful robustness measure because some video files expose incomplete
metadata. The code falls back to 25.0 FPS when needed.
The per-frame processing logic is not naive. It tries to keep visual playback aligned to wall-clock time:
elapsed_s = max(0.0, time.monotonic() - self.playback_started_at)
target_frame = int(elapsed_s * self.video_fps)
if target_frame > self.frame_index + 1:
self.cap.set(cv2.CAP_PROP_POS_FRAMES, target_frame)
self.frame_index = target_frame
This means the system can skip ahead if processing falls behind real-time playback. That is a practical synchronization mechanism for the multimodal dashboard.
7. Vision Inference Logic
The actual visual detection routine is:
def _run_vision(self, frame: np.ndarray) -> None:
self.last_vision_boxes = []
self.last_vision_detected = False
if self.yolo_model is None:
return
results = self.yolo_model(frame)
for *xyxy, conf, _cls in results.xyxy[0]:
confidence = float(conf)
if confidence < VISION_CONFIDENCE:
continue
x1, y1, x2, y2 = [int(value) for value in xyxy]
self.last_vision_boxes.append((x1, y1, x2, y2, confidence))
self.last_vision_detected = bool(self.last_vision_boxes)
Here the vision module performs:
- full-frame YOLO inference,
- confidence thresholding,
- bounding-box extraction,
- a binary detection decision based on whether any valid boxes remain.
Two constants define the default behavior near the top of App/main.py:
VISION_CONFIDENCE = 0.5
VISION_EVERY_N_FRAMES = 3
This means:
- only predictions with confidence at least
0.5are accepted, - vision inference is run every third frame unless a sync jump forces a refresh.
That second choice reduces CPU cost. The module is therefore designed for stable multimodal playback rather than maximum per-frame detection density.
8. Output Rendering
Accepted detections are rendered back onto the frame:
for x1, y1, x2, y2, confidence in self.last_vision_boxes:
cv2.rectangle(frame, (x1, y1), (x2, y2), (34, 197, 94), 2)
cv2.putText(
frame,
f"drone {confidence:.2f}",
(x1, max(20, y1 - 8)),
cv2.FONT_HERSHEY_SIMPLEX,
0.6,
(34, 197, 94),
2,
)
This part is simple, but it has real value for teammate understanding and demo communication. It turns model outputs into immediately visible evidence.
9. Relation To The Radar And Audio Branches
The vision branch does not operate in isolation. It is a synchronized component inside a common playback loop:
- video frames come from OpenCV,
- audio decisions come from the audio predictor,
- radar status is refreshed according to scenario time,
- all three are combined in the dashboard state.
This is the correct way to think about the current vision module: it is a multimodal decision source, not just a detector script.
10. Current Strengths And Weaknesses
Strengths:
- practical frame-level drone detection,
- uses trained YOLO weights rather than placeholder logic,
- synchronized with audio and radar,
- skips frames intelligently to maintain timing,
- can visualize detections directly.
Weaknesses:
- not yet cleaned into its own
Project v1/src/visionpackage, - depends on external trained weights,
- depends on local YOLOv5 repository layout,
- CPU-only in current app code,
- current binary decision rule is simply “any valid box means detected.”
Teammate note:
If the team wants a publishable software architecture, the next refactor should
extract the vision branch from App/main.py into a standalone module with its
own configuration, schema objects, and batch/offline test entrypoint.
Report III: Sound Module
1. Research Goal
The sound module is the first clean multimodal branch added to Project v1.
Its job is to turn audio extracted from a video into timestamped drone
probabilities that can later be fused with radar and vision signals.
The module is explicitly offline and file-based. It does not stream from a live microphone. This is consistent with the repository’s thesis direction: first achieve a reproducible multimodal pipeline, then consider live hardware later.
2. Files And Responsibilities
The sound implementation lives in Project v1/src/audio/:
preprocess.py: extraction, loading, slicing, and parameter validation.features.py: log-Mel spectrogram preparation for the CNN.classifier.py:.h5model loading and probability prediction.persistence.py: M/N temporal confirmation logic.schemas.py: data structures for windows, predictions, and fusion events.report.py: textual reporting and log generation.video_test.py: main CLI entrypoint for inference on video files.drone_sound_model.h5: the active Keras CNN artifact.
This structure is clean and teammate-friendly. It keeps the sound branch much more maintainable than an all-in-one notebook workflow.
3. Runtime Entry Point
The clean CLI entrypoint is:
PYTHONPATH=src python -m audio.video_test /path/to/video.mp4
The run_video_inference(...) function in video_test.py coordinates the full
pipeline:
loaded_models = load_audio_models()
y, sr = load_audio_from_video(video_path, target_sr=TARGET_SR)
windows = slice_audio(y=y, sr=sr, window_s=window_s, hop_s=hop_s)
predictions = [
predict_window(
window=window,
sr=sr,
loaded_models=loaded_models,
threshold=threshold,
)
for window in windows
]
That block is the sound-module equivalent of the radar runner. It clearly shows the full inference path:
- load compatible models,
- extract audio from video,
- resample and slice into windows,
- run prediction for each window,
- optionally post-process with temporal confirmation,
- optionally plot and log the results.
4. Audio Extraction And Preprocessing
Audio extraction is implemented in preprocess.py and relies on ffmpeg:
command = [
"ffmpeg",
"-y",
"-i",
str(video_path),
"-vn",
"-ac",
"1",
"-ar",
str(target_sr),
"-c:a",
"pcm_s16le",
str(wav_path),
]
This command makes the preprocessing assumptions explicit:
- the video stream is ignored,
- audio is converted to mono,
- audio is resampled to the target rate,
- output is written as PCM WAV for stable downstream loading.
The default target sampling rate is:
TARGET_SR = 22050
This matches the retained notebook used to train the active model.
5. Windowing Strategy
Audio is segmented into overlapping windows by slice_audio(...):
window_samples = max(1, int(round(window_s * sr)))
hop_samples = max(1, int(round(hop_s * sr)))
The function also includes an important practical behavior:
if total_duration - last_end > 0.25 * window_s:
windows.append(
AudioWindow(
start_s=max(0.0, total_duration - window_s),
end_s=total_duration,
samples=samples[-window_samples:],
)
)
This avoids silently discarding a meaningful tail segment near the end of the audio file.
6. Feature Representation
The current sound module uses the notebook-aligned CNN preprocessing path.
features.py converts each audio window into a normalized log-Mel spectrogram:
mel_spec = librosa.feature.melspectrogram(y=samples, sr=sr, n_mels=128)
log_mel_spec = librosa.power_to_db(mel_spec, ref=np.max)
The runtime also matches the notebook window assumptions:
- sample rate:
22.05 kHz - window length:
2.0 s - hop length:
1.0 s - normalization: per-window min-max scaling on the log-Mel image
This is the correct representation to describe because the retained notebook
saved only one active model artifact: drone_sound_model.h5.
7. Model Loading
The loader in classifier.py now has one active responsibility:
cnn_model = tf.keras.models.load_model(model_path, compile=False)
This makes the sound branch clearer and easier to defend. The current package, the notebook, and the saved artifact all point to the same CNN path.
9. Per-Window Prediction Semantics
The final prediction object is built in predict_window(...):
model_probability = _predict_probability(loaded_models.cnn_model, cnn_input)
audio_score = model_probability
label = "drone" if audio_score >= threshold else "no_drone"
The corresponding schema in schemas.py is:
@dataclass(frozen=True)
class AudioSegmentPrediction:
start_s: float
end_s: float
model_probability: float
audio_score: float
label: str
model_notes: tuple[str, ...] = ()
This is one of the best interfaces in the repository because it preserves:
- the time interval,
- the model probability,
- the audio score,
- the final label,
- extra notes about model loading or post-processing.
With only one active audio model, the audio_score is currently the same value
as model_probability. That keeps the schema stable for later multimodal
fusion work.
10. Temporal Confirmation
The sound branch also supports M/N confirmation to suppress isolated spikes:
def apply_m_of_n_confirmation(
predictions: list[AudioSegmentPrediction],
confirm_m: int,
confirm_n: int,
) -> list[AudioSegmentPrediction]:
The core rule is:
window = predictions[max(0, idx - confirm_n + 1) : idx + 1]
positive_count = sum(1 for candidate in window if candidate.label == "drone")
label = "drone" if positive_count >= confirm_m else "no_drone"
This is a useful practical addition because raw audio classifiers often produce isolated high-confidence spikes that should not become confirmed alerts.
11. Reporting And Logging
The sound branch records both summary and per-window results. report.py
builds readable textual output such as:
lines = [
"Audio inference report",
f"Video: {video_path}",
f"Segments: {len(predictions)}",
f"Window (s): {window_s:.2f}",
f"Hop (s): {hop_s:.2f}",
f"Threshold: {threshold:.2f}",
]
Each segment is also printed in detail:
f"{idx:03d}. "
f"{item.start_s:7.2f}-{item.end_s:7.2f}s | "
f"model={item.model_probability:.4f} | "
f"audio={item.audio_score:.4f} | "
f"label={item.label}"
Logs are written through audio.logging_utils.write_run_log(...) into:
Project v1/results/logs/
This is good engineering practice because it preserves evidence from experiments instead of forcing teammates to rely on screenshots or terminal history.
12. Important Difference Between The Clean Audio Module And The App Audio Path
The important repository note is now simpler: the clean audio module and the
active documentation are aligned to the same CNN artifact,
src/audio/drone_sound_model.h5, and the same notebook preprocessing
assumptions.
Teammates should still verify that any integrated app path uses the same sample rate, window size, and threshold before comparing results.
13. Strengths, Limitations, And Teammate Notes
Strengths:
- clean package structure,
- reproducible offline inference pipeline,
- notebook-aligned preprocessing,
- explicit schemas,
- optional temporal confirmation,
- detailed logs and plots.
Limitations:
- file-based only, not live-streaming,
- TensorFlow is required to load the
.h5model, - final retained notebook metrics are not clearly preserved in the repository,
- app integration should still be checked for preprocessing consistency.
Teammate note:
For experiments, documentation, and reproducibility, prefer the clean
audio.video_test path first. Use the app-specific audio backend only when the
integrated dashboard behavior itself is the target of evaluation.
Cross-Module Integration Summary
Although the reports above describe the branches separately, the integrated app
connects them in one synchronized playback loop. The fusion rule in
App/main.py is intentionally simple:
positives = sum([self.last_vision_detected, audio_detected, radar_detected])
if positives >= 2:
fusion_text, fusion_color = "Confirmed", RED
elif positives == 1:
fusion_text, fusion_color = "Watch", AMBER
else:
fusion_text, fusion_color = "Clear", GREEN
This means the current fusion layer is a decision-level voting rule:
Confirmedif at least two modalities are positive,Watchif exactly one modality is positive,Clearif none are positive.
This is not the final scientific fusion solution, but it is a correct and useful integration baseline because:
- it is transparent,
- teammates can inspect it quickly,
- it avoids hidden learned fusion behavior,
- it gives an understandable demo output.
The app also stores time-stamped snapshots for a simple detection history plot, which helps visualize how modality agreement evolves over time.
Reproduction Guide For Teammates
Clean Radar Experiments
cd "/home/mo/dev/python/HDDS2/Project v1"
PYTHONPATH=src python -m radar_sim.runner --scenario single_slow --no-plots
PYTHONPATH=src python -m radar_sim.runner --scenario two_targets
PYTHONPATH=src python -m radar_sim.realtime
Clean Audio Experiments
cd "/home/mo/dev/python/HDDS2/Project v1"
PYTHONPATH=src python -m audio.video_test /path/to/video.mp4
PYTHONPATH=src python -m audio.video_test /path/to/video.mp4 --window-s 2.0 --hop-s 1.0
PYTHONPATH=src python -m audio.video_test /path/to/video.mp4 --confirm-m 3 --confirm-n 5
Integrated Multimodal Demo
cd "/home/mo/dev/python/HDDS2"
python -m App.main
For the integrated app, teammates should verify in advance that:
- matching video/audio/json scenario files exist,
- the YOLOv5 repository and
best.ptweights are available, - the audio model artifacts exist,
- the necessary GUI and ML dependencies are installed.
Final Conclusion
The project already contains a meaningful multimodal foundation.
The radar module is the cleanest and most academically mature part of the system. It has a strong separation of concerns, a physically interpretable processing chain, scenario-based evaluation, and multiple execution modes.
The sound module is the cleanest multimodal branch. It already behaves like a reusable package with explicit schemas, notebook-aligned preprocessing, a single clear model artifact, and practical logging.
The vision module works operationally inside the integrated dashboard and is already useful for multimodal demonstrations, but it is the branch that most needs packaging cleanup and separation from the GUI layer.
Taken together, the codebase shows that the team has already implemented:
- a radar-centered detection backbone,
- a learned audio inference branch,
- a YOLO-based visual branch,
- synchronized multimodal playback,
- and a simple but understandable decision-level fusion layer.
That is the real technical achievement of the current stage of the project.