How It Works

Pipeline Architecture

This page shows how birdbird processes raw video clips into highlights with species identification. The pipeline has two parallel branches: video processing (visual bird detection) and audio processing (song/call identification).

Use the toggles below to switch between Summary and Detail descriptions, and to view either the Video or Audio pipeline.

📹

Raw Video Clips

Motion capture camera records bird activity, producing many video clips.

A typical batch could contain several hundred 10-second clips. Downloading/moving from the capture device is not handled by the app. The entry point for the app is a folder in YYYYMMDD format, named for the date they were taken off the device. The actual clips in that folder could cover several days. Filename timestamps could be unreliable so the folder name is the source of truth. If files are named like DDHHmmss00.avi this will be used to capture the start date of the batch.

🎬 Hundreds of 10-second clips Tested with AVI files (MJPEG 1440x1080@30fps, ~10s each)
⬇️
⚙️

Discard Clips With No Birds

Some clips might not have any birds if the motion capture was triggered by wind movement. Automatically scan each video to detect bird presence. Only keep clips containing birds.

We capture a few frames (4× in first second, then 1fps) and run these through YOLOv8-nano inference for COCO class 14 (bird). Outputs detections.json with confidence scores, bounding box coordinates, and frame timestamps. Clips without bird detections are moved to a separate directory. Typical detection rate in original testing: 20-30% of input clips.

Smaller set of clips Filtered clips, and data file detections.json with confidence scores and timestamps per frame
⬇️
✂️

Make Highlights Reel

Trim each clip to show only the bird activity, then combine all segments into one highlights video.

Binary search algorithm identifies optimal segment boundaries using cached detection timestamps from the filter step. Each clip is trimmed to show only portions with bird activity. FFmpeg concatenates all segments using stream copy (no re-encoding) for fast processing. Output: highlights.mp4 - in our tests typically 5-10 minutes duration.

🎞️ One highlights file highlights.mp4 (H.264, variable duration)
⬇️
🔍

Identify Species

Analyse the highlights to determine which bird species appear, with confidence scores and timing.

We take frame captures from highlights.mp4 at short intervals (every 2 seconds in our testing). This could miss birds which appear very briefly, it's a trade-off with the time-consuming processing. The BioCLIP vision-language model performs zero-shot species identification on the frames. To further speed up the matching, we pass a fixed list of species. The matching requires a GPU for reasonable performance - there's an option to run on remote GPU server. The model scores each match with % confidence.

📊 Timeline, counts, confidence scores Output metadata in species.json includes timestamps, common/scientific names, and confidence scores.
⬇️

Find Best Sections

Rank video segments by species diversity and quality to suggest the most interesting timestamps to view.

Using the frame timeline from the previous step, for each species we identify the "best" short segment for viewing. For example the "best" point in the highlights for robins is scored as the 8 consecutive frames (14 seconds) with the highest cumulative confidence for robins. Later when we select the robins button the video will seek to the start of the identified segment.

🎯 Suggested best timestamps and count per species For each species: counts of frames where it was found, best segment timestamp. File species.json.
🔊

Extract Audio Track

Extract the audio from each video clip for acoustic analysis.

FFmpeg extracts audio tracks from AVI files to temporary WAV files (16kHz mono PCM). This format is required by the BirdNET analyser. Files are processed in batches and temporary WAVs are cleaned up after analysis.

🎵 Audio files for analysis Temporary WAV files (16kHz mono) extracted from each clip
⬇️
🐦

BirdNET Analysis

Identify bird species by their songs and calls using acoustic analysis.

BirdNET-Analyzer processes audio in 3-second segments, detecting bird vocalisations and identifying species. Outputs include common and scientific names, confidence scores, and timestamps. Optional location filtering (latitude/longitude) restricts results to regionally appropriate species. The web viewer selects the highest-scoring clip per species.

📊 Species detections with confidence scores songs.json with species names, confidence scores, source clips, and timestamps
⬇️
🌐

Publish to Web

Upload highlights and data to the cloud. View video with playback controls and visual/audio statistics.

Highlights video and JSON metadata uploaded to Amazon S3 (or compatible stores, in our testing we used Cloudflare R2) object storage, organised per batch. Static HTML viewer fetches content directly from S3/R2 via client-side JavaScript. The latest.json index drives the navigation controls for batch selection. The viewer is deployed to Cloudflare Workers for global CDN distribution.

📱 Interactive web app with player and stats Static HTML viewer + MP4 hosted on S3-compatible object store with latest.json index and timeline data