Speaker Identification #672

EzraEllette · 2024-11-13T05:44:43Z

description

This PR adds speaker identification to screenpipe. Audio is segmented by speaker then transcribed. transcriptions now have a speaker_id column. new table speakers was added with name and metadata columns. speaker_embeddings table was created with a one-to-many relationship for speaker and embeddings.

related issue: /claim #306

type of change

new feature

how to test

Run the speaker_identification test. run screenpipe-server/src/db.rs tests.
Use screenpipe.

vercel · 2024-11-13T05:44:48Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
screenpipe	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 18, 2024 9:02pm

louis030195 · 2024-11-13T17:51:40Z

might be unrelated to this PR, keep testing

(running two screenpipe at once might be related)

louis030195 · 2024-11-13T19:14:27Z

1 GB / 37 GB), Total CPU: 17%, NPU: N/A 2024-11-13T19:00:44.580527Z INFO screenpipe_audio::stt: Preparing segments 2024-11-13T19:00:44.580547Z INFO screenpipe_audio::stt: device: MacBook Pro Microphone (input), resampling from 48000 Hz to 16000 Hz 2024-11-13T19:00:45.288077Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute. 2024-11-13T19:00:45.288101Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: Failed to compute speaker embedding: The frames array is empty. No features to compute. 2024-11-13T19:00:45.288174Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute. 2024-11-13T19:00:45.288177Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: Failed to compute speaker embedding: The frames array is empty. No features to compute. 2024-11-13T19:00:45.330388Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute. 2024-11-13T19:00:45.330408Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embeddi

louis030195 · 2024-11-13T19:14:50Z

 INFO screenpipe_server::resource_monitor: Runtime: 4028s, Total Memory: 2% (1 GB / 37 GB), Total CPU: 16%, NPU: N/A
2024-11-13T19:00:14.578918Z  INFO screenpipe_audio::stt: Preparing segments    
2024-11-13T19:00:14.578965Z  INFO screenpipe_audio::stt: device: MacBook Pro Microphone (input), resampling from 48000 Hz to 16000 Hz    
2024-11-13T19:00:14.998152Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute.
2024-11-13T19:00:14.998174Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: Failed to compute speaker embedding: The frames array is empty. No features to compute.
2024-11-13T19:00:15.040711Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute.
2024-11-13T19:00:15.040735Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: Failed to compute speaker embedding: The frames array is empty. No features to compute.
2024-11-13T19:00:15.128138Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute.
2024-11-13T19:00:15.128158Z ERROR scr

louis030195 · 2024-11-13T19:15:53Z

unrelated but fun:


2024-11-13T18:58:16.221108Z  INFO screenpipe_audio::multilingual: detected language: "fr"    
2024-11-13T18:58:16.532145Z  INFO screenpipe_server::resource_monitor: Runtime: 3917s, Total Memory: 2% (1 GB / 37 GB), Total CPU: 31%, NPU: N/A
2024-11-13T18:58:18.758847Z  INFO screenpipe_audio::whisper:   0.0s-0.0s:     
2024-11-13T18:58:18.758879Z  INFO screenpipe_audio::whisper:   0.0s-6.0s:  Parce que c'est tellement abstrait pour les gens dans la tech.    
2024-11-13T18:58:18.758891Z  INFO screenpipe_audio::whisper:   6.0s-8.0s:  Dans la startup.    
2024-11-13T18:58:18.758911Z  INFO screenpipe_audio::whisper:   10.0s-14.0s:  Si on se retrouve dans YC, qu'est-ce qui va passer avec l'évaluation ?

whisper detect my voice in french (i spoke english)

louis030195 · 2024-11-13T19:17:27Z

a/MacBook Pro Microphone (input)_2024-11-13_19-16-47.mp4"    
2024-11-13T19:16:48.483339Z  INFO screenpipe_audio::multilingual: detected language: "en"    
2024-11-13T19:16:49.955738Z  INFO screenpipe_audio::whisper:   0.0s-0.0s:     
2024-11-13T19:16:49.955759Z  INFO screenpipe_audio::whisper:   0.0s-1.8s:  Well, like we were previously    
2024-11-13T19:16:49.977495Z  INFO screenpipe_server::core: device MacBook Pro Microphone (input) received transcription Some(" Well, like we were previously\n")    
2024-11-13T19:16:49.978565Z  INFO screenpipe_server::core: Detected speaker: Speaker { id: 90, name: "", metadata: "" }    
2024-11-13T19:16:49.978582Z  INFO screenpipe_server::core: device MacBook Pro Microphone (input) inserting audio chunk: "/tmp/spp/data/MacBook Pro Microphone (input)_2024-11-13_19-16-49.mp4"    
2024-11-13T19:16:50.616236Z  INFO screenpipe_audio::multilingual: detected language: "en"    
2024-11-13T19:16:51.325953Z  INFO screenpipe_audio::whisper:   0.0s-0.0s:     
2024-11-13T19:16:51.325974Z  INFO screenpipe_audio::whisper:   0.0s-30.0s:  Thank you.    
2024-11-13T19:16:51.402875Z  INFO screenpipe_server::core: device MacBook Pro Microphone (input) received transcription Some(" Thank you.\n")    
2024-11-13T19:16:51.403004Z ERROR screenpipe_server::core: Error processing audio result: error returned from database: (code: 1) zero-length vectors are not supported.

there are a few Thank you (something with VAD i suppose) but maybe not more than on main

EzraEllette · 2024-11-13T20:28:16Z

Okay there are some bug fixes to make.

EzraEllette · 2024-11-15T23:24:24Z

@louis030195 I was able to identify the source of the bug and fix it.

louis030195 · 2024-11-17T00:16:58Z

looks great! @EzraEllette

i want to merge this ASAP i think there might be some things that we don't know yet changed, so make sense to merge and ask a few people to test it out and see if it works as before roughly

one last thing to fix before merging though:

https://youtu.be/vk711s6h8W4

there is an issue with the audio data encoded to disk for some reason, speed or something is changed, check the video

EzraEllette · 2024-11-17T00:21:58Z

Okay, I don't have my computer with me right now but I have experienced something similar to this before. If you want to take a look at the sample rate that is passed to the stt function that's probably wrong because we have to use a 16000hz rate for segmentation and I'm probably not reflecting that change when STT is called. Sent from my phone at a concert so pls forgive the grammar

louis030195 · 2024-11-18T17:13:39Z

any news?

EzraEllette · 2024-11-18T20:24:53Z

any news?

@louis030195 Making the UI today

EzraEllette · 2024-11-18T21:02:23Z

This should be safe to merge once tested again. UI can come soon.

EzraEllette · 2024-11-18T21:02:57Z

I fixed the audio storage issue.

louis030195 · 2024-11-18T21:22:47Z

amazing

/approve

algora-pbc · 2024-11-18T21:22:52Z

@louis030195: The claim has been successfully added to reward-all. You can visit your dashboard to complete the payment.

louis030195 · 2024-11-18T21:23:26Z

@EzraEllette any suggestion next steps?

EzraEllette · 2024-11-18T21:29:59Z

@EzraEllette any suggestion next steps?
Some ideas:

Update UI to include speakers (meetings, search, etc)
add a UI to manually identify speakers
- Provide audio recording in UI and allow updating of name or searching and selecting a previously identified speaker.
- Since one speaker can have multiple embeddings in the database, when a previously identified speaker is selected, we should update the speaker embeddings to point to the selected speaker, and remove the old speaker from the database. (this allows us to have lower thresholds for speaker search.)
Attempt to use LLM for identification through meeting context
utilize metadata

louis030195 · 2024-11-18T21:34:36Z

lets continue here @EzraEllette

#695

EzraEllette added 25 commits November 12, 2024 22:10

add pyannote

a9007aa

add pyannote

de915d9

local pyannote

915c706

fix segment.rs

41e8006

working

ff8278a

Diarize before VAD

035b120

add logging for embedding length

853ddd3

rebase

2aaf9a0

Do segmentation stuff

a45c8b0

update imports

1ca0ad5

add speaker embedding to transcription result

c02c739

add sqlite-vec and touch migration script

2661cfd

update migration to include table

2cef3f6

speaker operations in DB

a7ebfd4

update tests

5bc1495

update .gitignore

00e39db

update speaker id process

3781c34

update speaker id process

1c3e9de

update models

067ec07

use channels for segments

6d567a3

update transcription tests

d46515b

delete pyannote example

9f657ec

clear warnings

b037610

vad then segment

1269935

add test for identification

f031d21

vercel bot deployed to Preview November 13, 2024 05:45 View deployment

cleanup core

5634e91

vercel bot deployed to Preview November 13, 2024 05:47 View deployment

louis030195 mentioned this pull request Nov 15, 2024

identify persons in audio #306

Closed

update segment handling to prevent going out of bounds

5821cb7

vercel bot deployed to Preview November 15, 2024 23:24 View deployment

update logging

10a30dc

vercel bot deployed to Preview November 16, 2024 21:43 View deployment

algora-pbc bot added the 🙋 Bounty claim label Nov 16, 2024

louis030195 mentioned this pull request Nov 17, 2024

Save window focus #691

Open

4 tasks

update version

bd63bb6

fix sample rate in stt

feb8542

vercel bot deployed to Preview November 18, 2024 20:39 View deployment

initialize segmentation models once

273dbf2

vercel bot deployed to Preview November 18, 2024 21:02 View deployment

louis030195 merged commit 478fb05 into mediar-ai:main Nov 18, 2024
3 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker Identification #672

Speaker Identification #672

EzraEllette commented Nov 13, 2024 •

edited

Loading

vercel bot commented Nov 13, 2024 •

edited

Loading

louis030195 commented Nov 13, 2024 •

edited

Loading

louis030195 commented Nov 13, 2024

louis030195 commented Nov 13, 2024

louis030195 commented Nov 13, 2024

louis030195 commented Nov 13, 2024

EzraEllette commented Nov 13, 2024

EzraEllette commented Nov 15, 2024

louis030195 commented Nov 17, 2024

EzraEllette commented Nov 17, 2024

louis030195 commented Nov 18, 2024

EzraEllette commented Nov 18, 2024

EzraEllette commented Nov 18, 2024

EzraEllette commented Nov 18, 2024

louis030195 commented Nov 18, 2024

algora-pbc bot commented Nov 18, 2024

louis030195 commented Nov 18, 2024

EzraEllette commented Nov 18, 2024

louis030195 commented Nov 18, 2024

Speaker Identification #672

Speaker Identification #672

Conversation

EzraEllette commented Nov 13, 2024 • edited Loading

description

type of change

how to test

vercel bot commented Nov 13, 2024 • edited Loading

louis030195 commented Nov 13, 2024 • edited Loading

louis030195 commented Nov 13, 2024

louis030195 commented Nov 13, 2024

louis030195 commented Nov 13, 2024

louis030195 commented Nov 13, 2024

EzraEllette commented Nov 13, 2024

EzraEllette commented Nov 15, 2024

louis030195 commented Nov 17, 2024

EzraEllette commented Nov 17, 2024

louis030195 commented Nov 18, 2024

EzraEllette commented Nov 18, 2024

EzraEllette commented Nov 18, 2024

EzraEllette commented Nov 18, 2024

louis030195 commented Nov 18, 2024

algora-pbc bot commented Nov 18, 2024

louis030195 commented Nov 18, 2024

EzraEllette commented Nov 18, 2024

louis030195 commented Nov 18, 2024

EzraEllette commented Nov 13, 2024 •

edited

Loading

vercel bot commented Nov 13, 2024 •

edited

Loading

louis030195 commented Nov 13, 2024 •

edited

Loading