\chapter{Implementation}
\label{ch:implementation}


This chapter is going to focus on the details of the process of creating the playable game
that used voice as a potential input modality. The first step in the process was to identify the type of tools
that were going to be needed in the game. 
First of all there needed to be the video game itself. 

% 2D and 3D
The most fundemental choice for the game prototype is if it should use 2 dimentional or 3D graphics. This 
choice might seem trivial but the consequences are very far fetching. 

% complexity of implementation
Adding the third dimention in game envirornment is adding additional layer of complexity. Instead of dealing with $x$ and $y$ 
there's also $z$ creating a 3 dimentional system. That in itself makes it that the assets in the game have to be also three dimentional.
Tha in itself means usually longer time needed to create projects since the process of creating assets becomes longer and more advanced.
Another thing that makes making 3D games more difficult is physics. The vast majority of games have some sort of physics 
implementation, be it collision detection, particle systems or force vectors. Making those calculations is something that becomes much harder when adding
a third dimension. 
To be able to verify this thesis a quick iteration and implementation are key, being arguments for choosing 2D.



% precision
for this project. When it comes to input in games, one of the most important factors is the precision of the controls - 
since the player actions should align with the expectations of the player.
Given that precision of input is such an important factor the 2D system seems to be making more sense in this scenario,
since it allows to precisely describe things such as movement or rotation using natural language. 
The difference between 2D static camera and 3D first person movement commands for example:

2D: Rotate to the right and move 10 steps forward

3D: Look to the upper left a bit and then move 10 stept forward

This example illustrates that it might be difficult for the player to describe camera motions since 3 dimentional transforms 
are not intuitive and might require repetetive attempts

% why 2D
Of course not all of 3D games rely on a player controllable camera. Many games employ a fixed camera or don't let the player
control the camera at all, but given the time frame of this thesis and the fact that it includes exploring different potential 
prototypes I made the decision that 2D is the preferable option.
After making that choice the next logical step is to think of potential suitable game ideas that would differ from each other
and also would make it possible to utilise the voice input.


% generes a
There's a lot of different video game generes. I set out to find out what type of game would be the most suitable


% Engines
Games are usually built using some sort of game engine, 
allowing for standing on the shoulders of giants so to speak. Engines vary greatly in their complexity, supported platforms,
pricing, licensing, and many other technical factors. For this work I decided to  


\section{WhisperCPP}

How to download custom whispercpp models
\url{https://github.com/ggml-org/whisper.cpp/blob/master/models/README.md}


$large-v3-turbo-q5_0$

The trick to remove silence in ffmpeg

\begin{lstlisting}[style=myStyle, language=bash]
ffmpeg -i fox.mp4 -ar 16000 -ac 1 -c:a pcm_s16le -af silenceremove=1:0:-50dB fox.wav
\end{lstlisting}



whisper cpp flag - can be 32 or 0 too but quality suffers
\begin{lstlisting}[style=myStyle, language=bash]
    --max-context 64 --entropy-thold 2.8
\end{lstlisting}


Running Whispercpp on my CPU proved to be remarkably slow


\url{https://old.reddit.com/r/LocalLLaMA/comments/1fyvc60/how_to_improve_whisper_translation_it_keeps/}


Discussion about whisper cpp 
\url{https://old.reddit.com/r/LocalLLaMA/comments/1hc1qzi/is_whispercpp_still_the_king_of_stt/}

whisper cpp wav file to txt file example
\begin{lstlisting}[style=myStyle, language=bash]
./build/bin/whisper-cli -m models/ggml-large-v3-turbo-q5_0.bin -nt samples/jfk.wav > test.txt
\end{lstlisting}

whisper cpp server run command
\begin{lstlisting}[style=myStyle, language=bash]

./build/bin/whisper-server -m models/ggml-large-v3-turbo-q5_0.bin -nt --host 0.0.0.0 -debug --port 8008 -t 7
\end{lstlisting}

Document the process of building whisper cpp 
There's a screenshot in the notes folder
\section{LLAMACPP}

Running LLAma CPP server
\begin{lstlisting}[style=myStyle, language=bash]
    ./llama-server -m ../../models/ToolACE-2-Llama-3.1-8B-Q4_K_M.ggml --n-gpu-layers 60 --host 0.0.0.0 --jinja
\end{lstlisting}

The docs \url{https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md}

The models themselves need to be quantified \url{https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/README.md}

There's a note called Quanting the model with images


Theoretical navigation
Action + object = navigation
Free flowing conversation = talking to NPC

Though, to be said this would require some more testing into what's actually feasible here, but it could be a good starting point, especially the first system.

I made a smimple bash script to automate the deployment of the AI backend when run.

\section{Tool Calling}

Best model that fits on my machine
\url{https://huggingface.co/Team-ACE/ToolACE-2-Llama-3.1-8B
}

There's a nice benchmark for tool calling
\url{https://gorilla.cs.berkeley.edu/leaderboard.html}

\section{Game prototype}

Multiple engines were considered
\begin{itemize}
    \item Unity3D
    
    This engine while very powerful has a huge moving mass of all of it's complexity. This means that while 
    it allows for creating of very advanced titles, for the ideating stage it doesn't allow for as fast prototype
    creation as other choices

    \item Godot
    
    Godot is a universally praised engine with a cult-following. The main reason for that is it's licensing giving 
    the studios a stable pillar to stand on compared to it's biggest direct competitor being Unity3D. 
    Another benetfit that it has is the fact that it seems to be preferable by some when it comes to 2D game development.
    However, it still suffers from how complex it is, making creating prototypes and ideating more difficult and ultimately
    longer than the engines listed below. 

    \item Pico8
    
    The complete anti-thesis of the above choices. This engine allows for the quickest game creation, offering 
    built in tools, uisng very simple lua language and through creative limitations such as 128x128 resolution forcing
    the game developing process to be simple and streamlined. It's biggest weakness is that it doesn't allow for simple 
    http requests rendering it unusable for this work despite the author's personal preference. 
    
    \item LÖVE
    
    Another simple 2D lua engine, this time with actual support for http requests. It handles higher resolution 
    and makes it possible to quickly create prototypes or even fully fledged games such as Balatro. Hovever due to 
    how niche it is it doesn't have nearly as big of a community making support and learning resources scarce in comparasion
    to bigger engines.

    \item pure CLI
    
    If the game is set to be a text based adventure then maybe an engine isn't needed at all? That would potentially allow
    for huge flexibility when it comes to actual implementation but at a cost of no guidance and best practices. Games are
    rarely written to be exprienced via terminal these days, and even if being text based they still render their text via 
    some kind of graphical front end. Another point speaking against using CLI would be the inablity to use sound or other 
    modalities that are not purely text based. On top of that terminal windows often change their size and are hardly standarised
    which could prove creating UI elements difficult. 
    
    \item PyGame
    
    The choice against this engine comes down mostly due to lacking support and how it seemed to have been replaced by 
    Pico8 and LÖVE in the quick game development community, especialyl looking at their release notes and online activity 
    of users. It offers more capabilities than those engines at the cost of added complexity. It sits somewhere between Pico8
    and Unity3D and for those reasons it wasn't selected for this thesis. 


    \item p5js
    
    The final selection. This engine has the advantages of Pico-8 and LÖVE, allowing for lightning fast development while
    also adding suport for high resolution and http requests. This compbined with big community support and many learning
    resources + with how it runs on Javascript sealed it as the ultimate choice for this thesis.
\end{itemize}

\section{server setup}

backednd whisper fastapi runner
uvicorn main:app --host 0.0.0.0 --port 8000



\section{Game implementation}

\subsection{Prototype ideation}

This phase of the work has an exploratory meaning. In Design Thinking this would be the Ideation part of the Double Diamond approach \todo{cite}
The reason for following that approach instead of going straight into building the finished game is to verify if the 
game idea is worth executing in the first place. More often than not it turns out that games that work in the head
of their creators are not very enjoyable when executed. 
This phenomenon called Expectation confirmation theory \todo{cite} plays out in many fields of life including shopping, 
planning for vacations or in this case game design.

One additional thing that the prototypes can acomplish is to increase the cerative potential for making even better solutions.
Mind often sees connections while experiencing, meaning that after the prototypes are played through there's a chance of a 
new idea, a variation or some profound insight. 

All of these reasons are valid for making some prototypes. Fail quickly and iterate would be the motto here. 
Because of this three games got created:

\subsubsection{Prototype 1 - Maze} \todo{images}

Since the first idea that the author of the paper had about the voice system was to determine if it can control the player
movements on a 2D grid he came to a conclusion that a simple 2D maze game would prove to be a good testing ground for this
hypothesis. The reason being that it strips away all of the superfluos things and focuses on movement as a primary game
mechanic. 

The process of building the prototype itself has proven to be challenging mostly due to the backend infrastructure. 
During the devlopment process it turned out that sending audio recordings to the backend was poorly documented in the P6JS
framework that was selected. On top of that it has proven to be buggy and unreliable at first. 
Thankfully it has proven to be possible in the end. 

The process of building the prototype uncovered a problem in the WhisperCPP server itself. Due to undocumented bug
the audio processing server would start hallucinating after processing the first 3 requests while using GPU for processing.
This technical problem would persist until full restart of the WhisperCPP server.
The solution to this problem that was in scope of this thesis turned out to be turning the whisper server into a systemd 
deamon and then writing a separate API that would listen for requests to process audio and automatically restart the WhisperCPP
instance based on those requests. The following resulted in proper processing at a cost of ~2 seconds for the WhisperCPP server
to restart itself. This tradeoff is still very much worth the effort since the processing time for WhisperCPP on GPU has proven to be about 20 times
faster than on the CPU.

After fixing the backend itself another issue became apparent making the prototype not suitable for further development.
The movement mechanic in itself worked, but only up to the point. When the user would say "I want to move two units up"
the WhisperCPP would very often interpret it as "I want to move to units up" resulting in no action at all. 
The same edge case appeared with interpreting "four" as "for", making the sentence "I would like to move four units up"
also appear as an incorrect instruction. Of course these edge cases could be handled and hardcoded into the tools definitions 
for the Funtion calling LLM, but as the name implies edge cases are often unpredictable and impossible to fully work around 
especially since the input being the natural language. One potential solution here would be to try to make the LLM to guess 
the correct command based on what sounds close to the given instruction but that solution also seems like something liable to 
make errors, given how LLMs tend to hallucinate when asked to generate "creative" output. The validity of this approach didn't 
sound good enough in theory to warrant further testing in the author's personal opinion, especially given the next problem described below.

Next problem in the prototype is the broad scope in which navigation can be approached. 
The player may ask to move by using the directions such as up, down, left etc, by using positional system in X Y coordinates, by 
sides of the world, by describing the desired relationship to other elements of the envirornment (such as 2 spaces from the north wall),
or by any other abstract way of describing location. The usage of natural language incentivises people to use their natural
way of speech and there's bound to be a lot of ways that people demonstrate navigation. The prototype can be of course limited
to just a few of those ways but then what would be the difference between this voice input system and pre-existing solutions
such as Talos? \todo{Cite talos} Not to mention the dent in the immersion that would follow if the natural way of player
expression would not be supported? It would feel closer to piloting a machine with a set of commands instead of guiding
your own body.

To take this argument to extreme in this prototype the player has to reach the exit door to get out of the maze,
there is nothing stopping the player from just
commanding the game to move to the finish location. 


\subsubsection{Prototype 2 - Text Adventure}

% intro
The second idea for testing was the protoytype of a classical text adventure game, akin to games
in the MUD genre \todo{cite}
The game itself would allow the player to travel through different interconnected locations and preform different actions 
in those places. This concept in itself has a potential for higher immersion due to higher engagement of the player's 
imagination, since they'd have to imagine the game world as it's described in text to them. This effect can be compared
to reading a book and imagining all of it's action vs watching a movie and seeing all of the action presented in it.

On top of that the prototype emplyed the first person control, meaning that the player can speak in the first person 
while giving out actions. This in itself can lead to higher immersion, similar to how ttrpg players command their
characters in first person report higher immersion compared to those that speak in third person about
 their character \todo{source is just a guess}.


% implementation
This prototype emplyed basic OOP principles, creating a class for a Place and Action, and then defining 2 locations with a 
few actions that can be taken in them, including travelling from one location to another. 
The technical implementation turned out not to be too time consuming due to the fact that the most difficult part of audio 
sending could be reused from the previous prototype, as all of the prototypes were built in p6js. The possibility of 
pure python console output was briefly considered but the workflow of recording and sending audio would have proven to be 
difficult from a pure console environment. This combined with code reusing potential had led the author to decide to 
stick to p6js. On top of that P6js could potentially provide sound effects or background music in the way of further 
deepening immersion in the player.  

% outcome
Compared to the first prototype the audio transcription proved almost no issues or edge cases. That can be accredited to the fact that 
the actions in the prototype eg. "Pick up the key" are rather specific making the tool calling LLM interpret them without
much issue. The only issue was the fact that the Travel function expected capitalised location names as parameters,
but the WhisperCPP would transcribe the location names as lowercase. The fix turned out to be a very simple instruction 
in the tools.json file that made the LLM aware of this capitalisation.




\subsubsection{Prototype 3 - Flag Guessing}

% intro
The last idea for the prototypes was a guessing game in which the user would be presented with a picture of a flag
and they would have to try to guess multiple facts about the presented country including:
\begin{enumerate}
    \item Country name
    \item Capitol name
    \item Country's continent
    \item The population
\end{enumerate}

Each good answer would grant the player with extra points and if they couldn't answer no more questions about the country
the user would be able to skip to the next country. This prototype idea was to create the experience that's closer to 
TV trivia shows in which people answer difficult questions in hopes of reaching prizes. This feeling of being on stage
could potentially be enchanced by the usage of voice recognition compared to traditional keyboard input. 
% implementation

The implemention of this prototype didn't go as far as the other ones since while building it the author tested the 
voice processing backend to account for the typical natural language responses such as guessing the country or the population
and numerous edge cases similar to those in the Prototype 1 emerged. The prototype ended up in a half finished state,
at the same time further testing was not needed, since it served it's purpouse as it demonstrated that this should not
be the direction of further works.


After playtesting all of the above mentioned games and discussions with the thesis supervisor
the second prototype  stood out amongst all of the 
prototypes and the author decided to choose it for further work. The reasons for that would come down
to the second prototype having the combined strength of a potential for a deeper immersion of the player (due
to the first person narration) with the lack of before mentioned issues with voice processing on edge cases.
The prototype also exibited the highest ammount of personal satisfaction and interest on the few people that
the author showed all of the prototypes in an casual manner (not to hold and scientific weight). 


\subsection{Game Development}