Skip to content

Commit

Permalink
feat: speaker identification (#672)
Browse files Browse the repository at this point in the history
* add pyannote

* add pyannote

* local pyannote

* fix segment.rs

* working

* Diarize before VAD

* add logging for embedding length

* rebase

* Do segmentation stuff

* update imports

* add speaker embedding to transcription result

* add sqlite-vec and touch migration script

* update migration to include table

* speaker operations in DB

* update tests

* update .gitignore

* update speaker id process

* update speaker id process

* update models

* use channels for segments

* update transcription tests

* delete pyannote example

* clear warnings

* vad then segment

* add test for identification

* cleanup core

* update segment handling to prevent going out of bounds

* update logging

* update version

* fix sample rate in stt

* initialize segmentation models once
  • Loading branch information
EzraEllette authored Nov 18, 2024
1 parent d96d28d commit 478fb05
Show file tree
Hide file tree
Showing 27 changed files with 114,327 additions and 341 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
data/
text_json/

src-tauri/deno-*

# Ignore package-lock.json in the vercel-ai-chatbot example
examples/typescript/vercel-ai-chatbot/package-lock.json

Expand Down
2 changes: 1 addition & 1 deletion screenpipe-app-tauri/src-tauri/Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions screenpipe-audio/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,9 @@ dirs = "5.0.1"
lazy_static = { version = "1.4.0" }
realfft = "3.4.0"
regex = "1.11.0"
ndarray = "0.16"
ort = "2.0.0-rc.5"
knf-rs = { git = "https://github.com/thewh1teagle/knf-rs.git" }


[target.'cfg(target_os = "windows")'.dependencies]
Expand Down
67 changes: 30 additions & 37 deletions screenpipe-audio/examples/stt.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
use futures::future::join_all;
use screenpipe_audio::stt::stt;
use screenpipe_audio::stt::{prepare_segments, stt};
use screenpipe_audio::vad_engine::{SileroVad, VadEngine};
use screenpipe_audio::whisper::WhisperModel;
use screenpipe_audio::{AudioInput, AudioTranscriptionEngine};
Expand All @@ -23,38 +23,23 @@ async fn main() {
let test_cases = vec![
(
"test_data/accuracy1.wav",
r#"yo louis, here's the tldr of that mind-blowing meeting:
- bob's cat walked across his keyboard 3 times. productivity increased by 200%.
- sarah's virtual background glitched, revealing she was actually on a beach. no one noticed.
- you successfully pretended to be engaged while scrolling twitter. achievement unlocked!
- 7 people said "you're on mute" in perfect synchronization. new world record.
- meeting could've been an email. shocking.
key takeaway: we're all living in a simulation, and the devs are laughing.
peace out, llama3.2:3b-instruct-q4_K_M"#,
r#"yo louis, here's the tldr of that mind-blowing meeting. bob's cat walked across his keyboard 3 times. productivity increased by 200%. sarah's virtual background glitched, revealing she was actually on a beach. no one noticed. you successfully pretended to be engaged while scrolling twitter. achievement unlocked! 7 people said "you're on mute" in perfect synchronization. new world record. meeting could've been an email. shocking. key takeaway: we're all living in a simulation, and the devs are laughing. peace out, llama3.2:3b-instruct-q4_k_m"#,
),
(
"test_data/accuracy2.wav",
r#"bro - got some good stuff from screenpipe here's the lowdown on your day, you productivity ninja:
- absolutely demolished that 2-hour coding sesh on the new feature. the keyboard is still smoking, bro!
- crushed 3 client calls like a boss. they're probably writing love letters to you as we speak, make sure to close john tomorrow 8.00 am according to our notes, let the cash flow in!
- spent 45 mins on slack. 90% memes, 10% actual work. perfectly balanced, as all things should be
- watched a rust tutorial. way to flex those brain muscles, you nerd!
overall, you're killing it! 80% of your time on high-value tasks. the other 20%? probably spent admiring your own reflection, you handsome devil.
PS: seriously, quit tiktok. your FBI agent is getting bored watching you scroll endlessly.
what's the plan for tomorrow? more coding? more memes? world domination?
generated by your screenpipe ai assistant (who's definitely not planning to take over the world... yet)"#,
r#"bro - got some good stuff from screenpipe here's the lowdown on your day, you productivity ninja: absolutely demolished that 2-hour coding sesh on the new feature. the keyboard is still smoking, bro! crushed 3 client calls like a boss. they're probably writing love letters to you as we speak, make sure to close john tomorrow 8.00 am according to our notes, let the cash flow in! spent 45 mins on slack. 90% memes, 10% actual work. perfectly balanced, as all things should bewatched a rust tutorial. way to flex those brain muscles, you nerd! overall, you're killing it! 80% of your time on high-value tasks. the other 20%? probably spent admiring your own reflection, you handsome devil. ps: seriously, quit tiktok. your fbi agent is getting bored watching you scroll endlessly. what's the plan for tomorrow? more coding? more memes? world domination? generated by your screenpipe ai assistant (who's definitely not planning to take over the world... yet)"#,
),
(
"test_data/accuracy3.wav",
r#"again, screenpipe allows you to get meeting summaries, locally, without leaking data to OpenAI, with any apps, like WhatsApp, Meet, Zoom, etc. and it's open source at github.com/mediar-ai/screenpipe"#,
r#"again, screenpipe allows you to get meeting summaries, locally, without leaking data to openai, with any apps, like whatsapp, meet, zoom, etc. and it's open source at github.com/mediar-ai/screenpipe"#,
),
(
"test_data/accuracy4.wav",
r#"Eventually but, I mean, I feel like but, I mean, first, I mean, you think your your vision smart will be interesting because, yeah, you install once. You pay us, you install once. That that yours. So, basically, all the time Microsoft explained, you know, MS Office, long time ago, you just buy the the the software that you can using there forever unless you wanna you wanna update upgrade is the better version. Right? So it's a little bit, you know"#,
r#"eventually but, i mean, i feel like but, i mean, first, i mean, you think your your vision smart will be interesting because, yeah, you install once. you pay us, you install once. that that yours. so, basically, all the time microsoft explained, you know, ms office, long time ago, you just buy the the the software that you can using there forever unless you wanna you wanna update upgrade is the better version. right? so it's a little bit, you know"#,
),
(
"test_data/accuracy5.wav",
r#"Thank you. Yeah. So I cannot they they took it, refresh because of my one set top top time. And, also, second thing is, your byte was stolen. By the time?"#,
r#"thank you. yeah. so i cannot they they took it, refresh because of my one set top top time. and, also, second thing is, your byte was stolen. by the time?"#,
),
// Add more test cases as needed
];
Expand Down Expand Up @@ -85,24 +70,32 @@ async fn main() {
device: Arc::new(screenpipe_audio::default_input_device().unwrap()),
};

let mut vad_engine_guard = vad_engine.lock().await;
let mut segments = prepare_segments(&audio_input, vad_engine.clone())

Check failure on line 73 in screenpipe-audio/examples/stt.rs

View workflow job for this annotation

GitHub Actions / test-ubuntu

this function takes 5 arguments but 2 arguments were supplied
.await
.unwrap();
let mut whisper_model_guard = whisper_model.lock().await;
let (transcription, _) = stt(
&audio_input,
&mut *whisper_model_guard,
Arc::new(AudioTranscriptionEngine::WhisperLargeV3Turbo),
&mut **vad_engine_guard,
None,
&output_path,
true,
vec![Language::English],
)
.await
.unwrap();
drop(vad_engine_guard);

let mut transcription = String::new();
while let Some(segment) = segments.recv().await {
let (transcript, _) = stt(
&segment.samples,
audio_input.sample_rate,
&audio_input.device.to_string(),
&mut whisper_model_guard,
Arc::new(AudioTranscriptionEngine::WhisperLargeV3Turbo),
None,
&output_path,
true,
vec![Language::English],
)
.await
.unwrap();

transcription.push_str(&transcript);
}
drop(whisper_model_guard);

let distance = levenshtein(expected_transcription, &transcription);
let distance = levenshtein(expected_transcription, &transcription.to_lowercase());
let accuracy = 1.0 - (distance as f64 / expected_transcription.len() as f64);

(audio_file, expected_transcription, transcription, accuracy)
Expand All @@ -123,7 +116,7 @@ async fn main() {
println!("expected: {}", expected_transcription);
println!("actual: {}", transcription);
println!("accuracy: {:.2}%", accuracy * 100.0);
println!();
// println!();

total_accuracy += accuracy;
total_tests += 1;
Expand Down
Binary file not shown.
Loading

0 comments on commit 478fb05

Please sign in to comment.