Introduction to mergen

mergen is a package which employs artificial intelligence to convert data analysis questions into executable code, explanations, and algorithms. The self-correction features allow the generated code to be optimized for performance and accuracy. mergen features a user-friendly chat interface, enabling users to interact with the AI agent and extract valuable insights from their data effortlessly.

This document introduces you to mergens basic set of tools, and shows you how to apply them to answer data analysis related questions and generate relevant R code.

Setting up an AI-agent

To be able to interact with an AI agent and use this agent for subsequent tasks, mergen contains the setupAgent function for setting up a framework for the agent.Mergen allows you to set up an agent for the openai API platform as well as for the replicate API platform.

Setting up an openai agent

For setting up an agent for the openai API platform, you can make use of the setupAgent function by setting the name="openai" argument. Let’s look how to setting up an agent works:

myAgent <- setupAgent(name="openai",type="chat",model="gpt-4",ai_api_key = "your_key")
#> Warning in getModels(ai_api_key): Request for available models failed. Check
#> your API-key.
myAgent
#> $name
#> [1] "userAgent"
#> 
#> $type
#> [1] "chat"
#> 
#> $API
#> [1] "openai"
#> 
#> $url
#> [1] "https://api.openai.com/v1/chat/completions"
#> 
#> $model
#> [1] "gpt-4"
#> 
#> $headers
#>      Authorization       Content-Type 
#>  "Bearer your_key" "application/json" 
#> 
#> $ai_api_key
#> [1] "your_key"

the setupAgent function returns a list containing all the agent information which can be used by subsequent functions. mportant to note is that the ai_api_key should be your OpenAI API key, provided as a string.

Setting up an agent for replicate

setupAgent also contains functionality for setting up an agent for replicate AIs. Let’s look at how this works:

myAgent <- setupAgent(name="replicate",type=NULL,model="llama-2-70b-chat",ai_api_key="my_key")
myAgent
#> $name
#> [1] "userAgent"
#> 
#> $type
#> NULL
#> 
#> $API
#> [1] "replicate"
#> 
#> $url
#> [1] "https://api.replicate.com/v1/predictions"
#> 
#> $model
#> [1] "llama-2-70b-chat"
#> 
#> $headers
#>      Authorization       Content-Type 
#>     "Token my_key" "application/json" 
#> 
#> $ai_api_key
#> [1] "my_key"

Sending a prompt

Once you have set up an agent, it is time to ask some questions to your AI model of choice! For this, you can make use of the sendPrompt function, or the selfcorrect function. The choice of which one to use depends on whether you want possible errors in the answered code to be corrected by sending another request to the model or not.

Using the sendPrompt function

Sending a prompt with the sendPrompt function is very easy. The function takes the arguments agent, prompt, return.type and context. By default the context is set to rbionfoExp. This tells your model of choice to act as a bioinformatics expert, and return any code as R code in triple backticks. Your prompt must be given as a string, but can contain any question and additional information that you want to send. The return value is a string containing the models answer.

answer <- sendPrompt(myAgent,
                     "how do I perform PCA on data in
                     a file called test.txt?",return.type = "text")
answer
#> [1] "\n\nThe following R code will read the file called \"test.txt\", normalize the table and do PCA. First, the code will read the file into an R data frame: \n\n```\ndata <- read.table(\"test.txt\", header = TRUE, sep = \"\\t\")\n```\n\nNext, the data will be normalized to the range of 0 to 1:\n\n```\nnormalized.data <- scale(data, center = TRUE, scale = TRUE)\n```\n\nFinally, the normalized data will be used to do a Principal Component Analysis (PCA):\n\n```\npca <- princomp(normalized.data)\n```"

Using the selfcorrect function

Sending a prompt with the selfcorrect function, will allow the possible generated code to be optimized for performance and accuracy. If the code that is returned by the model is not excecutable, the selfcorrect function will send the prompt back to the agent together with a list of errors and warnings, so that the code can be optimized. The amount of rounds of possible selfcorrect can be set by the user using the attempts = n argument. The return value is a list containing the initial answer of the agent and the final answer after n rounds of selfcorrection.

answer <- selfcorrect(myAgent, prompt="How do I perform PCA?",attempts=3)
print(answer)
#> $init.response
#> [1] "\n\nThe following R code will read the file called \"test.txt\", normalize the table and do PCA. First, the code will read the file into an R data frame: \n\n```R\ndata <- read.table(\"test.txt\", header = TRUE, sep = \"\\t\")\n```\n\nNext, the data will be normalized to the range of 0 to 1:\n\n```{r}\nnormalized.data <- scale(data, center = TRUE, scale = TRUE)\n```\n\nFinally, the normalized data will be used to do a Principal Component Analysis (PCA):\n\n```{R}\npca <- princomp(normalized.data)\n```"
#> 
#> $init.blocks
#> $init.blocks$code
#> [1] "\ndata <- read.table(\"test.txt\", header = TRUE, sep = \"\\t\")\n\n\nnormalized.data <- scale(data, center = TRUE, scale = TRUE)\n\n\npca <- princomp(normalized.data)\n"
#> 
#> $init.blocks$text
#> [1] "\n\nThe following R code will read the file called \"test.txt\", normalize the table and do PCA. First, the code will read the file into an R data frame: \n\n\n\n\nNext, the data will be normalized to the range of 0 to 1:\n\n\n\n\nFinally, the normalized data will be used to do a Principal Component Analysis (PCA):\n\n\n"
#> 
#> 
#> $final.response
#> [1] "\n\nThe third response.The following R code will read the file called \"test.txt\", normalize the table and do PCA. First, the code will read the file into an R data frame: \n\n```{r}\nplot(1:10)```\n\nNext, the data will be normalized to the range of 0 to 1:\n\n"
#> 
#> $final.blocks
#> $final.blocks$code
#> [1] "\nplot(1:10)"
#> 
#> $final.blocks$text
#> [1] "\n\nThe third response.The following R code will read the file called \"test.txt\", normalize the table and do PCA. First, the code will read the file into an R data frame: \n\n\n\n\nNext, the data will be normalized to the range of 0 to 1:\n\n"
#> 
#> 
#> $code.works
#> [1] TRUE
#> 
#> $exec.result
#> [1] "path/to/html/file"
#> 
#> $tried.attempts
#> [1] 3

Running the code

Extracting code blocks

Once you have sent a prompt and recieved the answer, mergen features a function extractCode that allows the user to extract code blocks from the given text. Before using this, however, the code blocks need to be cleaned up, as every agent will return its answer in a slightly different way. This can be done with the help of the clean_code_blocks function. Below is an example of what clean_code_blocks does with the answer returned by our agent above:

code_cleaned <- clean_code_blocks(answer$final.response)
cat(code_cleaned)
#> 
#> 
#> The third response.The following R code will read the file called "test.txt", normalize the table and do PCA. First, the code will read the file into an R data frame: 
#> 
#> ```
#> plot(1:10)```
#> 
#> Next, the data will be normalized to the range of 0 to 1:

As you can see above, clean_code_blocks ensures that all code is stripped from extra symbols such as {r}, R, r and {R}. This ensures that the function extractCode can extract the code blocks properly. The extractCode function takes as input a string, and also allows the user to set a delimiter used to enclose the code blocks (default is three backtics). Now lets have a look at what the extractCode function returns:

final_code <- extractCode(code_cleaned,delimiter = "```")
print (final_code)
#> $code
#> [1] "\nplot(1:10)"
#> 
#> $text
#> [1] "\n\nThe third response.The following R code will read the file called \"test.txt\", normalize the table and do PCA. First, the code will read the file into an R data frame: \n\n\n\n\nNext, the data will be normalized to the range of 0 to 1:\n\n"

As shown above, extractCode returns a list containing the actual code and the associated text. The code block can then be tested for execution using the executeCode function.

Running the code

mergen features functions that make it easy for the user to run the code returned by an AI agent. Once code blocks are cleaned up and extracted, code blocks can be executed using the executeCode function. Before doing that, however, it is advised to run the extractInstallPkg function. This function extracts package names and installs any missing packages needed for the code to run. Finally, the executeCode function can be used. Lets see what the executeCode function does:

executeCode(final_code$code)

#> NULL

As shown above, the code runs as it should! It is important to note that the executeCode function will not change the global environment. Any variables that might be created while executing the code will be deleted as the function completes.