PROJECT
AI Smart Assistant
Project Overview
This project isn’t for beginners, you should have an idea of how to code to some extent before trying this.
There are a lot of variable in this project which can go wrong and require troubleshooting.
This project is to create my own AI-powered smart assistant in Python entirely for free.
Project Outline
I intend to bring together and augment various AI tools that are self-hosted and free with a Python script to create the full package that is my own smart assistant. I’m inspired by the F.R.I.D.A.Y and J.A.R.V.I.S AI systems in the MCU, so male or female voice, that’ll be the name and tone.
I want something I can simply talk to, give it commands and it works without the need to type anything in or boot it before-hand.
There are a few specifics this product needs to have:
- It needs to be able to perform speech recognition
- It needs to be able to speak back to me in a convincing human voice
- It needs to run on boot
- It needs to trigger on a certain keyword
- It needs to be able to interact with an LLM
- It should be able to open web pages and launch applications
- It should link up to Notion for verbal note-taking
1 - Voice and Ears
A smart assistant is no good unless it can both hear you and talk back, so that’s step one.
TTS is nothing new, but an AI-powered TTS that sounds like a person is huge.
After all of the initial set up, there’s no reason that the text can’t be received from an LLM.
Here are the modules that we’ll need. num2words is something I’ve added to convert numbers to the written version of the number (1 to one for example) so that the TTS doesn’t fall over.
I createded a function called vocalise. It receives input text and then vocalises it.
This TTS also has an issue with very long strings so I’ve limited the length to 500. If you want more than 500 words, return multiple strings to vocalise.
To tie it all together, this code is required to vocalise the text.
After you run this and pull the model, you’ll notice that it takes a long time to run. This is handled in the larger overall script.
I run this on system boot and keep the synthesiser, embedding_dataset and speaker_embedding variables stored in memory so they never need to be recalled. This means the long wait only happens once.
Next, it’s time to get speech recognition working and looping, listening for keywords. Then we can reliably execute the script, it will listen to what we say, and depending on what we tell it to do or say, we will get a human response back.
We have to import one module for this.
Next we have to loop over and over to see if the trigger word is said. Note: This requires the vocalise command from earlier. If you want to test this in isolation to the TTS, just use a print statement.
The full script should now look like this.
2 - LLM Integration
Now that we can speak to our assistant and get responses back, it’s time to hook it up to a large language model for some very smart assisting!
First, download and install ollama: https://ollama.com
When it’s all working in a command prompt, run the model of choice to pull it to your PC.
The below code is all you need to interact with ollama in Python. In my llama_call function, prefix and message describe the required response (prefix) and the actual message (message).
Here’s an example of how this might work. If you have a need for a particular style of response, then you can code in as many of these as you like. user_request is a hard coded string to demonstrate what I would have said to F.R.I.D.AY.
In case you were wondering: The Goliath beetle (Goliathus goliatus) is considered the largest insect alive, reaching lengths of up to 11.5 cm (4.5 inches). In the actual script, user_request will be based on what we say and not a question about insects.
3 - Configuration
Next, I want my smart assistant to be able to open certain webpages, perform various web operations like google searches, opening folders and opening applications.
To do this, firstly I’ll be creating a config file system. I want this to be read on startup but also be self-updatable. Say you have your smart assistant open and want to add a new webpage to open or a new app to start, you probably want to be able to update the config file without closing the assistant and reloading all of the code.
For this I will be using a very simple text file system that splits a key and a value with a delimiter.
Firstly I’ll build in the creation of it if it doesn’t exist, then reading it into memory, then re-reading it when prompted to update the script variables.
Here I set the start up path as the initial directory to host the files, then set their filenames as variables.
To finish this bit off, websites, folders, apps and any CLI arguments are stored as dictionaries.
This function will simply create the empty config files if they don’t already exist.
Here I created two functions: one to read all lines of a config file, then the function to populate the global dictionaries.
Once a key (the trigger word) is accompanied by a value (what to execute), the sky is more or less the limit on what we can make F.R.I.D.A.Y do.
The open function is what ties it all together. To activate this code, you will have said “Friday, open ….”. The code will loop through all dictionaries for the string after “open” and if it appear in a dictionary, it will execute the open command accordingly. With how CLI in Python works, you can execute conceivably any CMD command you like here, making this function very powerful.
This line is added in the process requests function. If I said “Friday, reload”, it will execute the above function and the result is that the dictionary updates.
For good measure, I added in the trigger words ‘Google’, ‘YouTube’ and ‘Shop’ which will open Google, YouTube and Amazon searches based on the request. The way that sites create their search URL’s is predictable, making it codable.
Now the smart assistant can listen to us, execute code upon various trigger words, speak back intelligently using an LLM and execute commands on the computer to open and run certain functions.
4 - Notion Integration
I’m big on Notion. I really like productivity software and this one in particularly nice. It has an API that we can use to read and write to notion pages and databases. The functionality of this can be implemented very simply or become quite in-depth.
For me, I want to be able to record notes in a notion page and add tasks to a database for later.
You will need to install the pip module notion_client.
List the core information about your notion: the api key, the page ID you want to append and the database you want to append.
If you don’t know how to find these things, the API documentation is here: https://developers.notion.com.
You might have noticed that I imported friday_notion_api and not notion_client – that’s because I wrote the API code in a different py file.
Create a new file next to the friday script: friday_notion_api.py.
Import the notion module before doing anything. The first function isn’t required, it just makes things a little tidier.
These two functions are to firstly read the data in a notion page, and secondly to read the lines in that page.
The append page function may burn your eyes with how much wasted space there is, but I did it like this to demonstrate how nested the data is. I don’t really get along with JSON data if it’s not structured in a way that helps me read it, so preserving the structure visually helps me read the code.
This section of code is to read the data of a Notion database and return the row data.
Finally, this is how the notion API script is integrated into the main F.R.I.D.A.Y process request function.
Sub commands:
Note -> Append the page with the remaining input
Read -> Read all notes on the Notion page
Paste -> Past the contents of the clipboard on a new line on the page
Plan -> Write a new entry into the database
Read Plans -> Read the rows in the database
You can of course, go much more in depth, specifying categories or date ranges but I don’t really need or want that.
5 - PC Control and Process Request
Now that our assistant can hear us, speak back to us, use an LLM for more complex and dynamic conversation, populate its own custom config files and read/write Notion pages and databases, we’re almost there.
Next, I want it to be able to shut down my PC, and I’ll share the script main code to execute everything, as well as the process request function.
Nice and easy! This can be executed in the process request function, which is directly below this.
Firstly, we take the full input and separate by spaces. The first list item will be the ‘command’ and everything else will be the user_request.
We print that to then check the input properly.
Command: dictate
This will copy the user_request to your computers clipboard. You can combine this with “notion paste” to dictate your words into Notion.
Lower-case enforcement will help keep the code consistent.
If I say “ignore me” as the final two words and stop talking, it ignores me. Nice and simple.
Next comes the various open commands. You can add as many custom options as you like, I just added Google, YouTube and Amazon (as buy).
The following code are all of the llama calls.
Quickly and explain are short and long forms of question & answer interactions.
Question is also a long form question & answer interaction, it just allows you to start a request slightly differently.
Write is a request to llama to write something for you.
Reformat will take your clipboard text, pass that into llama and reformat it to the style you ask for.
A really cool application for this is:
Dictate your thoughts to F.R.I.D.A.Y with the ‘dictate’ command. That will copy to your clipboard. Then you can use F.R.I.D.A.Y’s ‘reformat’ to turn that into whatever form you like – viral Instagram post, professional email, or whatever you like. Then you can use ‘notion paste’ to dump it into Notion for later.
The power down PC request comes next. Note: ‘shut down’ is what I’ve coded as F.R.I.D.A.Y’s own close command; ‘power down’ is for the actual PC.
Reload is to reload the config files so you can update them while the script is live.
Here we have all of the Notion code. It’s also in the Notion section but here it is in order in relation to the whole function it’s within.
Finally, this is the main script that executes all of the other functions. In order:
Log the start time
Set the language tool
Create the required objects for the Microsoft AI TTS
Create the config files if they don’t exist
Populate the config data
Initialise the speech recognition module
Set the room audio baseline levels
Then it loops around while friday_online = True and listens to your microphone input
Additionally, if it heard “and” it will split up the request into multiple requests by splitting the string by “and” and will process each part of the list separately as if it was a standalone request. For example: friday, “open youtube and google spiders” would be treated as “open youtube” “google spiders”.
That’s the end (or is it?). This is where my script naturally stopped being coded. I struggle to think what else I’d like to integrate into this thing besides calls to Stable Diffusion. If I implement more, this won’t be the end of this page, but until then it will. Happy coding, and make something awesome!