This project provides a starter kit for building applications that interact with the Gemini API in real-time. It supports audio and video input and provides a set of function tools for interacting with the user's system.
-
Install the required dependencies:
pip install -r requirements.txt
-
Rename the
.env.examplefile to.env -
Obtain a Gemini API key from Google AI Studio
-
Replace
your_api_key_herein.envwith your actual API key.
Important: Use headphones when running the script to prevent audio feedback loops.
To run the script:
python main.pyThe script takes a video-mode flag --mode, which can be "camera", "screen", or "none". The default is "screen". To share your screen, run:
python main.py --mode screenYou can also specify the modality to use with the --modality flag, which can be "AUDIO" or "TEXT". The default is "AUDIO".
The function_tools directory contains a set of Python scripts that provide various functionalities for interacting with the user's system. These tools can be called by the Gemini model to perform actions such as:
click_mouse.py: Performs a mouse click at the current cursor position.copy_and_paste.py: Inputs text to the screen by simulating typing.copy_to_clipboard.py: Copies text to the system clipboard.execute_js_in_brave.py: Executes JavaScript code in the currently active chromium based browser window.function_hub.py: Manages and executes the available function tools.get_clipboard.py: Retrieves the current text from the system clipboard.move_mouse.py: Moves the mouse cursor to specified coordinates.output_text_to_screen.py: Displays a message on the screen using an alert box.press_keys.py: Simulates pressing a single key or a combination of keys.