LLM 102

This is LLM 102! LLM 101 is what I did for my blog. It was simply an API call where a text is passed to Google LLM’s model. The text contains an instruction and the data on which that instrcution should be appied. Now, I am going to use LLM in a more complicated case. Let’s say I have a data, and I want to query information from it. Think of it like this: I have a person and I ask them to read a TEXT. Then, I will go ahead and ask a bunch of questions, and my goal is to get the best answers from that TEXT. This is how it looks like:

user: “THIS IS MY QUERY” –> Trained-Model <– DATA

One of the question that came to my mind was: “OK! But, why shouldn’t I just do f"THIS IS MY QUERY. use this data: {DATA}”? What can go wrong? One limitation seems to be the window size that can be passed as single prompt. Another thing is that you do not have to pass data again and again! You just pass it once and ask different questions from it. Let’s take a close look to what is happening:

Provided text data -> a certain set of vectors on n-dim Euclidean space (Embedding data)
User: “THIS IS MY QUERY” -> converted to one (or more?) vectors (Embedding query)
Find the similarities between (1) and (0), and return top-k matches
Based on those matches, passed a new prompt to LLM. For instance: “Based on these points: {similar-to-query items}, answer this: {query of user}”
Response is generated and shown to uesr

Let’s get started! Link to heading

First thing first! We need LLM (or SLM) to help us handle processing and generatign text data. Instead of using that trained model locally, I can use an API to just send data to another server, and then get back the response. Google’s Gemini has a free-tier API. The assumption is that user is able to just read the doc and understand how to send a query to API and get a response (See: ). Now, the question is: “Ok! How can/should I handle the embedding part?” There are two options:

Use the API’s embedding model directly.
Use another package, like llama_index. You get convenience but you lose the flexibility

NOTE: The goal is to use LLM to handle a certain task with high accuracy (??) I know that LLM has lots of parts that need to be considered carefully. Therefore, it is better to go with the first option as it might give me the flexibility I need to improve my LLM-powered tool later. Also, I will still work with API in the first option. So, it is not like developing the whole model! I think that is a rational decision. If I realize I cannot handle Option (1), I will try Option (2)! Also, note that users, who are not familiar with (2), they usually go with (1), and that makes sense. And as users make progress, they may try different models from different providers. I think that should be the moment to try option (2). Anyway, long story short, we just go with option (1) to better understand it.

Google Gemini Doc! Link to heading

Let’s start with find a webpage to help me setup a simple retrieval system. As explained earlier, we need to use embedding. I found the Embeddings Webpage. The following is based on the information provided in the doc:

from google import genai

client = genai.Client()

result = client.models.embed_content(
        model="gemini-embedding-001",
        contents="What is the meaning of life?")

print(result.embeddings)

As you read the lines, my suggestion is to look-up the doc and read about them. Understanding each part might help you later if you encounter any issue! This is also part of learning process. You should understand how each component you uses work. This is not about understanding how each component works under the hood (which, btw, in some cases, can be helpful if you try to do understand that). This is about understanding each component that you, as the user, uses in your script / process. Ok. Let’s check the doc for that. After a quick search, I found this webpage. As stated in the doc, the first step is to create a “client” for that API.

# install google-genai

from google import genai

client = genai.Client(api_key='your-api-key')

Note: You can also save the key as env variable. export GEMINI_API_KEY='your-api-key'. A few related items:

You can save it in .env, and run it with source .env (Question: Do we have to that again? If yes, how to not make that permanent?)
How can I check if it is set up properly? You can do: import os; print(os.environ)
What is env var? Here is a nice article on enviornment variables

As I read the doc, I also notice a couple of things:

Apparently we have synch & asynch clients. What are they? This requires its own blog :) But for now, a short definiton should suffice. In synch client, API is called. The execution is on hold (waiting…) till the response is received.
How to close client. We can do: client.close(), or use context manager.

Let’s try this part in a script:

import os
from google import genai

# constant
MODEL_ID = "gemini-2.5-flash"

# set up environment variable
os.environ['GOOGLE_API_KEY'] = 'MY_GEMINI_API_KEY'

with genai.Client() as client:
    response_1 = client.models.generate_content(
        model=MODEL_ID,
        contents='Hello',
    )
    print(response_1.text)
    
    response_2 = client.models.generate_content(
        model=MODEL_ID,
        contents='Ask a question',
    )
    print(response_2.text)

Cool! It’s time to move to the next part…. where embedding happens! So, let’s try the example but with one change…why not use gemini-2.5-flash for the model?

import os
from google import genai

# constant
MODEL_ID = "gemini-2.5-flash"

# set up environment variable
os.environ['GOOGLE_API_KEY'] = 'MY_GEMINI_API_KEY'

with genai.Client() as client:
    result = client.models.embed_content(
        model=MODEL_ID
        contents="What is the meaning of life?")

print(result.embeddings)

And this gives us error! After reading error and some search, you realize that a model may not support everything! In this case, we can see “gemini-2.5-flash” does NOT support embedding. Let’s take a look at the available models and their description. One approach is doc, but another approach (discovered after reading error) is to list models as follows:

import os
from google import genai

# set up environment variable
os.environ['GOOGLE_API_KEY'] = 'MY_GEMINI_API_KEY'

with genai.Client() as client:
    # Print all models or look for "embedding" in the name/description
    for model in client.models.list():
        print(f"Name: {model.name}, Description: {model.description}")
        print('=' * 50)

And we can find gemini-embedding-001 there. This is the exact model name used in the documentation, and the description says it can do embeddings. Let’s try then:

import os
from google import genai

# set up environment variable
os.environ['GOOGLE_API_KEY'] = 'MY_GEMINI_API_KEY'

EMBEDDING_MODEL_ID = "gemini-embedding-001"

with genai.Client() as client:
    result = client.models.embed_content(
        model= EMBEDDING_MODEL_ID,
        contents="What is the meaning of life?")

    print(result.embeddings)

and it gives:

[ContentEmbedding(
  values=[
    -0.022374554,
    -0.004560777,
    0.013309286,
    -0.0545072,
    -0.02090443,
    <... 3067 more items ...>,
  ]
)]

Nice! print(len(result.embeddings[0].values)) returns 3072, meaning our content is converted into a vector in 3072-dim space. Here are a few questions a keen reader may ask:

Can I convert the vector back to text?
Does the i-th element of vector represent a particular word or text?

The answer is no! Regarding the first question, cannot convert back because the mapping function is not a 1-1 relationship. Regarding the second question, there is no meaning for one element as the embedding is basically a distributed representation. Now, the question is: Ok, if I retrieve similar vectors to my query, how do I know the text? Well, because we embed them in the first place, we should know!

import os
from google import genai

# set up environment variable
os.environ['GOOGLE_API_KEY'] = 'MY_GEMINI_API_KEY'

EMBEDDING_MODEL_ID = "gemini-embedding-001"
contents=[
    "What is the meaning of life?",
    "what do we exist?",
    "my car is toyota",
]

with genai.Client() as client:
    result = client.models.embed_content(
        model= EMBEDDING_MODEL_ID,
        contents=contents)


arrs = []
for item in result.embeddings:
    arrs.append(np.array(item.values))


# calculate cos sim
for i in range(len(arrs)-1):
    for j in range(i+1, len(arrs)):
        a = arrs[i]
        b = arrs[j]
        sim_score = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

        print(contents[i])
        print(contents[j])
        print('sim score: ', sim_score)

And this gives:

What is the meaning of life?
what do we exist?
sim score:  0.682277787947564
What is the meaning of life?
my car is toyota
sim score:  0.5160473638649792
what do we exist?
my car is toyota
sim score:  0.5444432123076686

Note that we tried cosine similarity here. But…two questions come to my mind:

How to know cosine-similarity is the correct distance metric?
What if embeddings are based on semantic meaning…. then does it still make sense to use cosine-similarity?

Maybe they are wrong questions! I am walking away from it for now…but will try to get back to it at later blogs.

Okay…that was a good journey! As you go through the doc, you can see that you can modify the model’s process with config.

import os
from google import genai

# set up environment variable
os.environ['GOOGLE_API_KEY'] = 'MY_GEMINI_API_KEY'

EMBEDDING_MODEL_ID = "gemini-embedding-001"

contents=[
    "What is the meaning of life?",
    "what do we exist?",
    "my car is toyota",
]

genai_config = genai.types.EmbedContentConfig(
    task_type="SEMANTIC_SIMILARITY", 
    output_dimensionality=768
)
with genai.Client() as client:
    result = client.models.embed_content(
        model= EMBEDDING_MODEL_ID,
        contents=contents,
        config=genai_config
        )


arrs = []
for item in result.embeddings:
    arrs.append(np.array(item.values))


# calculate cos sim
for i in range(len(arrs)-1):
    for j in range(i+1, len(arrs)):
        a = arrs[i]
        b = arrs[j]
        sim_score = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

        print(contents[i])
        print(contents[j])
        print('sim score: ', sim_score)

and now we get the following result:

What is the meaning of life?
what do we exist?
sim score:  0.890914164184162
What is the meaning of life?
my car is toyota
sim score:  0.670119826227376
what do we exist?
my car is toyota
sim score:  0.7026134389582424

It is interesting to see that the score between What is the meaning of life? and what do we exist? are higher now!

Now, we have the embedding part. So, let’s put pieces together and see how we can create a simple retreival system. We need a way to track the data we pass for embedding as we need to return text from it. So, using a class is desirable. We need to embed records once. Then, for a given query, we find its emnbedding, then find top-k closest neighbors. And finally, return those.

import numpy as np
import os

from google import genai

os.environ['GOOGLE_API_KEY'] = 'MY_GEMINI_API_KEY'

class retrieval_system:
    def __init__(self, data):
        """
        data: list of values, each of type string.
        """
        self.MODEL_ID = "gemini-2.5-flash"
        self.EMBEDDING_MODEL_ID = "gemini-embedding-001"
        self.GEMAI_TASK_TYPE = "SEMANTIC_SIMILARITY"
        self.OUTPUT_DIM = 768
        self.data = data

        self.genai_config = genai.types.EmbedContentConfig(
            task_type=self.GEMAI_TASK_TYPE, 
            output_dimensionality=self.OUTPUT_DIM
        )
        with genai.Client() as client:
            result = client.models.embed_content(
                model= self.EMBEDDING_MODEL_ID,
                contents=self.data,
                config=self.genai_config
                )

        self.embedded_data = [
            np.array(item.values) for item in result.embeddings
        ]

    
    def search(self, query, top_k=1):
        """
        query: string
        top_k: int, number of top similar items to return
        """
        with genai.Client() as client:
            result = client.models.embed_content(
                model= self.EMBEDDING_MODEL_ID,
                contents=[query],
                config=self.genai_config
                )

        query_emb = np.array(result.embeddings[0].values)

        sim_scores = []
        for idx, data_emb in enumerate(self.embedded_data):
            sim_score = np.dot(query_emb, data_emb) / (np.linalg.norm(query_emb) * np.linalg.norm(data_emb))
            sim_scores.append((idx, sim_score))

        # sort by sim score
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

        top_k_indices = [idx for idx, score in sim_scores[:top_k]]

        return [self.data[idx] for idx in top_k_indices]


    def response(self, query, top_k=1):
        result = self.search(query, top_k=top_k)
        prompt = f"Based on the context {result}, please answer the following question: {query}"
        with genai.Client() as client:
            response = client.models.generate_content(
                model=self.MODEL_ID,
                contents=prompt,
            )
        return response.text



if __name__ == "__main__":
    data = """
    The Collatz conjecture[a] is one of the most famous unsolved problems in mathematics. The conjecture asks whether repeating two simple arithmetic operations will eventually transform every positive integer into 1. It concerns sequences of integers in which each term is obtained from the previous term as follows: if a term is even, the next term is one half of it. If a term is odd, the next term is 3 times the previous term plus 1. The conjecture is that these sequences always reach 1, no matter which positive integer is chosen to start the sequence. The conjecture has been shown to hold for all positive integers up to 2.36×1021, but no general proof has been found.
    It is named after the mathematician Lothar Collatz, who introduced the idea in 1937, two years after receiving his doctorate.[4] The sequence of numbers involved is sometimes referred to as the hailstone sequence, hailstone numbers or hailstone numerals (because the values are usually subject to multiple descents and ascents like hailstones in a cloud),[5] or as wondrous numbers.[6]
    """
    data = [item.strip("\n").strip() for item in data.split(".")] 

    retrieval_sys = retrieval_system(data)
    query = "what is Collatz conjecture?"

    top_k = 5
    response = retrieval_sys.response(query, top_k=top_k)

    print("Query: \n", query)
    print('=' * 50)
    print("Response: \n", response)