Sophie Nguyen

Bioinformatics Specialist & Aspiring Data Engineer

StackOverflow Retrieval with BigQuery and RAG | Sophie Nguyen

StackOverflow Retrieval with BigQuery and RAG

September 27, 2025

StackOverflow Retrieval with BigQuery and RAG is a solution designed to accelerate and automate the process of finding relevant answers from StackOverflow. Traditional methods involve manually browsing multiple questions, comparing content, and verifying answers. This process is tedious, time-consuming and delays producitvity.

The goal of this project is to streamline this experice using retrival-based AI approach with no latency. This project leverages Google BigQuery for public data access, embeddings for semantic encoding of questions, vector search to find similar questions, and Retrieval-Augmented Generation (RAG) to generate accurate, context-aware answers.

BigQuery approach

This project follows a Retrieval-Augmented Generation (RAG) pipeline: rag_architecture“image from google

Instruction to set up

The same as in instruction to set up for Sentiment Analysis of GitHub Comments project go to here

Steps

1. Create remote model

CREATE OR REPLACE MODEL `certain-voyager-470919-p7.mydataset.embedding_model`
REMOTE WITH CONNECTION  `projects/certain-voyager-470919-p7/locations/us/connections/vertexAI_connection`
OPTIONS (ENDPOINT = 'text-embedding-005');

CREATE OR REPLACE MODEL `certain-voyager-470919-p7.mydataset.text_generation_model`
REMOTE WITH CONNECTION  `projects/certain-voyager-470919-p7/locations/us/connections/vertexAI_connection`
OPTIONS (ENDPOINT = 'gemini-2.0-flash-001');

1. Generate Vector Representations of Questions Using Embeddings

CREATE TABLE `certain-voyager-470919-p7.mydataset.embedding_questions` (
  embedding ARRAY<FLOAT64>,
  id INTEGER,
  content STRING,
  title STRING, 
  body STRING, 
  tags STRING, 
  accepted_answer_id INTEGER
);
INSERT INTO `certain-voyager-470919-p7.mydataset.embedding_questions` 
(id,embedding, content, title, body, tags, accepted_answer_id )
SELECT id,ml_generate_embedding_result AS embedding, content, title, body, tags, accepted_answer_id
  FROM ML.GENERATE_EMBEDDING(
    MODEL  `certain-voyager-470919-p7.mydataset.embedding_model`,
    (SELECT id,  title, body, tags, accepted_answer_id , CONCAT(title, '\n', body, ' ', tags) AS content
    FROM `bigquery-public-data.stackoverflow.posts_questions`
    WHERE REGEXP_CONTAINS(tags, 'python') AND accepted_answer_id IS NOT NULL 
    ),
    STRUCT(TRUE AS flatten_json_output,
          'RETRIEVAL_DOCUMENT' as task_type)
  ) AS e
WHERE e.id NOT IN (SELECT id FROM `certain-voyager-470919-p7.mydataset.embedding_questions` );

3. Use a vector index to enable efficient similarity search on the embedding column

CREATE OR REPLACE VECTOR INDEX `question_index`
ON `certain-voyager-470919-p7.mydataset.embedding_questions`(embedding)
OPTIONS(distance_type='COSINE', index_type='IVF');

4. Search for Similar Questions Using Vector Search + RAG

DECLARE prompt, input_question STRING;
SET input_question = "what is different between list and tuple";
SET prompt = CONCAT("Propose a solution to the question '", input_question, "' using the question and answers from the database");

SELECT input_question, ml_generate_text_llm_result AS solution, 
FROM ML.GENERATE_TEXT(
  MODEL `certain-voyager-470919-p7.mydataset.text_generation_model`,
  (
  SELECT 
      CONCAT(prompt, STRING_AGG(FORMAT("question: %s, %s, answer: %s", base.title, base.body, a.body ))) AS prompt
  FROM VECTOR_SEARCH(
    TABLE `certain-voyager-470919-p7.mydataset.embedding_questions`,
    'embedding',
    (SELECT ml_generate_embedding_result,content
    FROM ML.GENERATE_EMBEDDING(
      MODEL  `certain-voyager-470919-p7.mydataset.embedding_model`,
      (SELECT input_question AS content ),
      STRUCT()
    )),
    top_k => 5,
    distance_type => 'COSINE',
    options => '{"fraction_lists_to_search": 0.005}'
  ) AS q
  JOIN  `bigquery-public-data.stackoverflow.posts_answers` a 
  ON a.parent_id = q.base.id
  ),
  STRUCT(600 AS max_output_tokens, TRUE AS flatten_json_output)
);

Example output

The solution I got from the question how to convert tuple to list is:

def tuple_to_list(input_tuple):
  """
  Converts a tuple to a list.

  Args:
    input_tuple: The tuple to convert.

  Returns:
    A list containing the elements of the tuple.
  """
  return list(input_tuple)