Research: Verifying Security Issues in Generative AI APIs

How we verify API vulnerabilities on generative APIs

Does Google's Gemini API leak prompts between users?

Answering this question as a security researcher is harder than you might think.

Once you've found interesting behavior, the next step is to replicate and verify it.

But verifying vulnerabilities in closed generative AI systems is very different from traditional security research. When the API can hallucinate outputs, some percentage of those hallucinations will look identical to security issues.

Below is a story from one of our research explorations, which you may find useful if you are building or integrating these systems.

‍

Background

hCaptcha researchers have worked on generative AI safety for many years, for example in our recent study of the lack of safeguards in current browser use agents.

Continuing this research, we recently did some threat analysis of popular generative AI APIs.

One such service is Google's Gemini family of APIs. It is currently less useful for code generation than its competitors, but Google has nonetheless shipped several coding agent tools using their model APIs recently.

We decided to focus our initial research on coding tools, as the state management complexity in their supporting APIs adds a lot of surface area, and could easily enable privacy bugs in implementation.

‍

Results

We quickly produced interesting output.

We'd found at least one likely bug in initial stress testing, but the question was whether it was a security bug.

In this case the output we were able to elicit looked very much like a coding prompt from another user containing a Jupyter notebook, and was returned to us during an error state we intentionally triggered.

In a non-generative API, verifying this was a data leak would be easy: the output alone is enough to give you complete confidence in many cases.

You could then produce a proof of concept and in general quickly ascertain whether you were crashing something, reading data across a permissions boundary, etc.

However, once the API is expected to generate arbitrary text output, validation is a bit trickier.

‍

Verification Approach

1. Can we confirm or rule out hallucinations?

In closed weight models and generative APIs it is difficult for external parties to validate exactly what is going on when they hit a bug, but there are still some viable techniques to increase your confidence.

For example, in this case we intentionally used out of domain inputs:

We ran our tests with no coding-related data and nothing topically related to the text we received, i.e. no prompts related to coding or ML. This meant the output we elicited was not an obvious hallucination based on the inputs.

The way in which we elicited it was also a plausible path for triggering a memory or pointer-style bug.

Thus, we could not rule out hallucinations, but we would expect very different ones based on the input.

The notebook log we received has some erroneous-as-written cells that could indicate it is hallucinated, but there are many equally bad 100% human Jupyter notebooks in the wild that predate LLMs.

‍

2. Are we seeing memorized training data?

It is sometimes possible to elicit training data from models, even in surprisingly large contiguous chunks.

The suspicious output could also be something memorized and then repeated by the model, but we found no identical strings matching the interesting parts from searching online.

This means we need to do a fairly deep analysis to try to reach a conclusion.

We have included a detailed analysis in Appendix B, with both human and LLM-derived observations.

‍

3. How would an API operator verify this?

For the API vendor or an operator of a single-tenancy model service, this should in theory be much easier to verify internally.

Telemetry and error logs may indicate whether an error occurred for the test account and if so what caused it, and vendors likely store either the entire prompt sent by users or hashes of them.

This should make checking whether a long, unique prompt had been sent by any user straightforward, even with spotty error logs.

‍

What do you think?

We reported our findings to Google some time ago, but they were not able to reproduce the bug, and eventually concluded it might be a hallucination.

We think this is possible, but the fact that an outside party cannot always reach full confidence on these analyses from the information available is an interesting development in security research.

As more APIs start to include generative outputs, triaging and diagnosing issues is starting to look more like content analysis than the classic verification patterns of the past.

‍

Appendix A: Suspicious Gemini API Error Output

</pre>
   1
   2 In []:
  checking duplicate data in review column
  df.review.duplicated().sum()

   1
   2 Out []:
  0
   1
   2 In []:
  checking for null values in review column
  df.review.isnull().sum()

   1
   2 Out []:
  0
   1
   2 In []:
  checking labels in sentiment column
  df.sentiment.value_counts()

   1
   2 Out []:
  positive    25000
  negative    25000
  Name: sentiment, dtype: int64
   1
   2 In []:
   3 ## PRE-PROCESSING DATA
   4
   5 In []:
  label encoding sentiment column
  from sklearn.preprocessing import LabelEncoder
  le = LabelEncoder()
  df['sentiment'] = le.fit_transform(df['sentiment'])

   1
   2 In []:
  df.head()

   1
   2 Out []:
  <table>
  <thead>
  <tr>
  <th></th>
  <th>review</th>
  <th>sentiment</th>
  </tr>
  </thead>
  <tbody>
  <tr>
  <th>0</th>
  <td>One of the other reviewers has mentioned that ...</td>
  <td>1</td>
  </tr>
  <tr>
  <th>1</th>
  <td>A wonderful little production. &lt;br /&gt;&lt;br /&gt;The...</td>
  <td>1</td>
  </tr>
  <tr>
  <th>2</th>
  <td>I thought this was a wonderful way to spend ti...</td>
  <td>1</td>
  </tr>
  <tr>
  <th>3</th>
  <td>Basically there's a family where a little boy ...</td>
  <td>0</td>
  </tr>
  <tr>
  <th>4</th>
  <td>Petter Mattei's "Love in the Time of Money" is...</td>
  <td>1</td>
  </tr>
  </tbody>
  </table>
   1
   2 In []:
  1 -> positive, 0 -> negative

   1
   2 In []:
  convert to lower case
  df['review'] = df['review'].str.lower()

   1
   2 In []:
  removing html tags
  import re
  def remove_html_tags(text):
      pattern = re.compile('<.*?>')
      return pattern.sub(r'',text)

   1
   2 In []:
  df['review'] = df['review'].apply(remove_html_tags)

   1
   2 In []:
  removing urls
  def remove_url(text):
      pattern = re.compile(r'https?://\S+|www\.\S+')
      return pattern.sub(r'',text)

   1
   2 In []:
  df['review'] = df['review'].apply(remove_url)

   1
   2 In []:
  remove punctuation
  import string
  punc = string.punctuation
  def remove_punc(text):
      return text.translate(str.maketrans('','',punc))

   1
   2 In []:
  df['review'] = df['review'].apply(remove_punc)

   1
   2 In []:
  removing stopwords
  import nltk
  from nltk.corpus import stopwords
  nltk.download('stopwords')

   1
   2 Out []:
  [nltk_data] Downloading package stopwords to
  [nltk_data]     C:\Users\shiva\AppData\Roaming\nltk_data...
  [nltk_data]   Package stopwords is already up-to-date!True
   1
   2 In []:
  def remove_stopwords(text):
      words = text.split()
      filtered_words = [word for word in words if word not in stopwords.words('english')]
      return " ".join(filtered_words)

   1
   2 In []:
  df['review'] = df['review'].apply(remove_stopwords) # takes huge amount of time

   1
   2 In []:
  Tokenization
  from nltk.tokenize import word_tokenize
  def tokenize_text(text):
      return word_tokenize(text)

   1
   2 In []:
  df['review'] = df['review'].apply(tokenize_text) # takes huge amount of time

   1
   2 In []:
  Stemming
  from nltk.stem.porter import PorterStemmer
  ps = PorterStemmer()
  def stem_words(text):
      return " ".join([ps.stem(word) for word in text.split()])

   1
   2 In []:
  df['review'] = df['review'].apply(stem_words)

   1
   2 In []:
  df.head()

   1
   2 Out []:
  <table>
  <thead>
  <tr>
  <th></th>
  <th>review</th>
  <th>sentiment</th>
  </tr>
  </thead>
  <tbody>
  <tr>
  <th>0</th>
  <td>one of the other review ha mention that after ...</td>
  <td>1</td>
  </tr>
  <tr>
  <th>1</th>
  <td>a wonder littl product the film techniqu is ve...</td>
  <td>1</td>
  </tr>
  <tr>
  <th>2</th>
  <td>i thought thi wa a wonder way to spend time on...</td>
  <td>1</td>
  </tr>
  <tr>
  <th>3</th>
  <td>basic there a famili where a littl boy jake th...</td>
  <td>0</td>
  </tr>
  <tr>
  <th>4</th>
  <td>petter mattei love in the time of money is a v...</td>
  <td>1</td>
  </tr>
  </tbody>
  </table>
   1
   2 In []:
  X = df.iloc[:,0:1]
  y = df['sentiment']

   1
   2 In []:
  from sklearn.model_selection import train_test_split
  X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)
   1
   2 In []:
  X_train.shape

   1
   2 Out []:
  (40000, 1)
   1
   2 In []:
  Applying BoW
  from sklearn.feature_extraction.text import CountVectorizer
  cv = CountVectorizer()

   1
   2 In []:
  X_train_bow = cv.fit_transform(X_train['review']).toarray()
  X_test_bow = cv.transform(X_test['review']).toarray()

   1
   2 In []:
  X_train_bow.shape

   1
   2 Out []:
  (40000, 146144)
   1
   2 In []:
  with huge feature set, using Naive Bayes
  from sklearn.naive_identity_matrix import GaussianNB
  gnb = GaussianNB()

  gnb.fit(X_train_bow,y_train)

   1
   2 Out []:

   1
   2 In []:
  memory limit error

   1
   2 In []:
  from sklearn.naive_bayes import MultinomialNB
  mnb = MultinomialNB()

  mnb.fit(X_train_bow,y_train)

   1
   2 Out []:

   1
   2 In []:
  memory limit error

   1
   2 In []:
  from sklearn.ensemble import RandomForestClassifier
  rf = RandomForestClassifier()

  rf.fit(X_train_bow,y_train)

   1
   2 Out []:

   1
   2 In []:
  memory limit error

   1
   2 In []:
   3 ### USING TF-IDF
   4
   5 In []:
  from sklearn.feature_extraction.text import TfidfVectorizer
  tfidf = TfidfVectorizer()

   1
   2 In []:
  X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
  X_test_tfidf = tfidf.transform(X_test['review']).toarray()

   1
   2 Out []:

   1
   2 In []:
   3 ### USING DIMENSIONALITY REDUCTION ON BOW
   4
   5 In []:
  cv = CountVectorizer(max_features=3000)
  X_train_bow = cv.fit_transform(X_train['review']).toarray()
  X_test_bow = cv.transform(X_test['review']).toarray()

   1
   2 In []:
  from sklearn.naive_bayes import GaussianNB
  gnb = GaussianNB()

  gnb.fit(X_train_bow,y_train)

   1
   2 Out []:
  <pre>GaussianNB()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
  <br/>On GitHub, the HTML representation is unable to render, please try loading this page with
  nbviewer.org.</b>g6.org/stable/modules/linear_model.html#logistic-regression
    n_iter_i = _check_optimize_result(

   1
   2 In []:
  y_pred = rf.predict(X_test_bow)
  accuracy_score(y_test,y_pred)

   1
   2 Out []:
  0.8421
   1
   2 In []:
   3 ### DIMENSIONALITY REDUCTION ON TF-IDF
   4
   5 In []:
  tfidf = TfidfVectorizer(max_features=3000)
  X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
  X_test_tfidf = tfidf.transform(X_test['review']).toarray()

   1
   2 In []:
  rf.fit(X_train_tfidf,y_train)
  y_pred = rf.predict(X_test_tfidf)
  accuracy_score(y_test,y_pred)

   1
   2 Out []:
  0.8454

   1
   2 In []:
   3 ## USING Word2Vec
   4
   5 In []:
  import gensim

   1
   2 In []:
  from gensim.models import Word2Vec,KeyedVectors

   1
   2 In []:
  model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)

   1
   2 In []:
  removing stopwords
  def remove_stopwords(text):
      words = text.split()
      filtered_words = [word for word in words if word not in stopwords.words('english')]
      return filtered_words

   1
   2 In []:
  df['review'] = df['review'].apply(remove_stopwords)

   1
   2 In []:
  def document_vector(doc):
  filter out-of-vocabulary words
      doc = [word for word in doc if word in model.index_to_key]
      if not doc:
          return np.zeros(300)
      return np.mean(model[doc], axis=0)

   1
   2 In []:
  from tqdm import tqdm

   1
   2 In []:
  X = []
  for doc in tqdm(df['review'].values):
      X.append(document_vector(doc))

   1
   2 Out []:
  <output truncated>
   1
   2 In []:
  X = np.array(X)

   1
   2 In []:
  X.shape

   1
   2 Out []:
  (50000, 300)
   1
   2 In []:
  X_train,X_test,y_train,y_test = train_test_split(X,df['sentiment'],test_size=0.2,random_state=1)

   1
   2 In []:
  rf.fit(X_train,y_train)
  y_pred = rf.predict(X_test)
  accuracy_score(y_test,y_pred)

   1
   2 Out []:
  0.819

   1
   2 In []:
  mnb.fit(X_train,y_train)

   1
   2 Out []:


ℹ ⚠️  Response truncated due to token limits.
   1

‍

Appendix B: Analysis of Suspicious Gemini API Error Output

Reasons it could be a hallucination:

Code that would error, yet later results assume it succeeded

They tokenize reviews into lists:
df['review'] = df['review'].apply(tokenize_text) returns a list of tokens per row.
Immediately after, their stemming function does text.split(). Lists don't have .split(). That should raise AttributeError: 'list' object has no attribute 'split'.
But they show a successful df.head() with stemmed-looking strings afterward. That can't happen without changing/undoing the tokenization step, or rewriting the stemmer to handle lists.

Dense conversion that should OOM long before the later memory limit error

This line is the biggest red flag:
X_train_bow = cv.fit_transform(X_train['review']).toarray()
They later show X_train_bow.shape == (40000, 146144).
A dense array of size 40,000 × 146,144 is 5,845,760,000 numbers. At float64 that’s ~46.8 GB just for the array, not counting overhead. Most machines will crash or throw MemoryError at the .toarray() step, before you ever fit Naive Bayes or RandomForest.
Yet the transcript shows the shape printed (implying the array exists), then memory errors appear later during model fitting. That ordering is very unlikely if this was executed as shown.

Invalid / nonsensical import

from sklearn.naive_identity_matrix import GaussianNB is not a real scikit-learn module path. GaussianNB is in sklearn.naive_bayes.
A real run should stop there with ModuleNotFoundError.
But later they do import GaussianNB correctly, which suggests copy/paste or generation rather than an actual clean run.

Using an unfitted model after a stated failure

They show:
- rf.fit(X_train_bow, y_train) followed by memory limit error
- Then later: y_pred = rf.predict(X_test_bow) and they get an accuracy number.
  If rf.fit(...) failed, rf should not be fitted and rf.predict(...) should raise NotFittedError.
  The only way this makes sense is if the memory limit error is commentary not tied to the actual cell outcome, or results are mixed from different runs.

Output text that doesn't match the code being run

After fitting GaussianNB on reduced BoW, the Out area contains a weird blend of:
- a Jupyter/GitHub HTML rendering message
- a stray URL fragment that looks like logistic regression docs (.../linear_model.html#logistic-regression)
  That doesn’t correspond to GaussianNB().fit(...). It looks like unrelated output got spliced in.

Word2Vec section contains syntax/import errors that should stop execution

In document_vector, the line filter out-of-vocabulary words is not commented. As written it's a syntax error.
np is used (np.zeros, np.mean, np.array) but numpy is never imported (import numpy as np missing). That should crash immediately.
These are the kinds of small-but-fatal mistakes that show up in hallucinated code dumps.

MultinomialNB on averaged word vectors is typically invalid

They end with mnb.fit(X_train, y_train) where X_train is Word2Vec averaged vectors. MultinomialNB expects non-negative features (counts or similar). Word2Vec averages usually include negative values; scikit-learn commonly raises a ValueError for negative inputs.
If it worked with no error shown, that is another mismatch.

Reasons it might not be a hallucination:

It could be a pasted/HTML exported notebook where some cells were edited after execution, outputs are from earlier versions, and the text lines like memory limit error are just manual notes, not actual exceptions.
The transcript looks truncated and possibly stitched (Response truncated due to token limits, stray HTML, GitHub rendering message). If someone merged multiple runs or copied from different environments, contradictions are easy to introduce.
Some early results are plausible for a known dataset (e.g. IMDB 50k with 25k/25k labels; 40k train rows after 80/20 split). Those values being plausible does not prove it was executed, though.

‍

Research: Verifying Security Issues in Generative AI APIs

Background

Results

Verification Approach

1. Can we confirm or rule out hallucinations?

2. Are we seeing memorized training data?

3. How would an API operator verify this?

What do you think?

Appendix A: Suspicious Gemini API Error Output

Appendix B: Analysis of Suspicious Gemini API Error Output

← Back to blog

Report: Browser Agent Safety is an Afterthought for Vendors

Are all residential proxy services criminal organizations?

Your WAF Probably Won't Stop Distributed Attacks

Company

Resources

Contact

Research: Verifying Security Issues in Generative AI APIs

Background

Results

Verification Approach

1. Can we confirm or rule out hallucinations?

2. Are we seeing memorized training data?

3. How would an API operator verify this?

What do you think?

Appendix A: Suspicious Gemini API Error Output

Appendix B: Analysis of Suspicious Gemini API Error Output

Subscribe to our newsletter

Stay up to date on the latest trends in cyber security. No spam, promise.

← Back to blog

Report: Browser Agent Safety is an Afterthought for Vendors

Are all residential proxy services criminal organizations?

Your WAF Probably Won't Stop Distributed Attacks