The hCaptcha research team recently reviewed generative AI abuse in the wild. We found that many online services have no effective mitigation in place. This report covers one example of our findings.

Generative AI is making some platforms useless

‍

Intro

We've seen much excitement about generative AI tools like LLMs recently, but awareness of their abuse is not yet widespread. The hCaptcha research team has studied generative AI for years, and recently reviewed abuse in the wild. We found that many online services have no effective mitigation in place. This report covers one example of our detection experiments.

Freelance platforms like Upwork have long been a useful resource for people to connect with each other to accomplish tasks that require specific expertise, especially technical expertise.

The model of these platforms is for requesters of work to get multiple bids. Earnings are thus driven by the number of jobs someone bids on and the time taken to respond to a bid.

This creates a large incentive for providers to automate their side of the bidding process, providing a canned response any time a job that matches their skills appears.

As students of automated abuse, our bet was that they would soon be saturated by LLM spam, as generative AI lets bidders increase their bid volume and decrease their response time. We verified this by requesting a task on Upwork with a screening question and analyzing the bids.

‍

Result:

‍Nearly 80% of all responses used a LLM, and 100% of answers to the screening question were LLM generated.‍

None of the generated answers were correct. Platforms will need to rapidly adapt or risk becoming useless.

‍

The experiment

To test our theory, we ran a quick experiment by posting a request on Upwork.

First, we constructed a post with a simple screening question that would take a domain expert five minutes or less to answer, and confirmed that all known LLMs failed to answer it correctly.

For each correct answer received, we planned to hire that person to write a tutorial on the subject. However, you'll see in a moment why this report contains no links to tutorials.

To be fair to applicants, we explicitly stated that the answer would be verified, and that no LLM would produce a valid response, so there was no point in submitting an LLM-generated answer.

We then posted it twice, using a 12 hour interval to get a diversity of replies across time zones. In both cases we used the site's skill restrictions to show it only to "experts" who were nominally qualified, setting ClickHouse, Data Science, and Machine Learning as the required skills.

If we detected an LLM was used to generate the answer, we followed up to ask the bidder for confirmation of our detection, and gave them a chance to correct their answer if they chose.

‍

The responses

We received responses from 14 unique bidders in total.

9/14 (64%) of bidders answered the screening question.

9/9 (100%) of answers were generated with an LLM. (!)

0/9 were valid SQL. Common errors included hallucinated functions, hallucinated columns, invalid syntax, and the verbose, nonsensical explanations common to some well-known LLMs.

1/9 claimed to have run their invalid SQL, but withdrew their bid when we asked if a LLM was used. This was the only answer where we did not reach 100% confidence on their first message, but they later provided a revised query that was clearly generated and then acknowledged LLM use, confirming our initial detection.

One person failed to answer the screening question, but used a LLM to write a personalized response filled with falsehoods, and admitted this upon questioning. We thus included this response in the denominator of responses using an LLM vs. admitting it below.

4/10 (40%) of LLM users eventually admitted a LLM was used when asked. We obtained permission from one of them to reproduce a portion of both their initial response and our followup conversation, which has been excerpted in anonymized form below.

6/10 (60%) did not respond when we asked if a LLM was used, withdrew their bid, or denied it.

‍

One bid was pure spam: a nonsensical bid from a nearly empty profile with no prior work.

Another looked like spam, but did have a word salad related to one of the topics in our post. Our analysis categorized it as either a bad LLM prompt or more old-fashioned spambot output.

The spam rate was thus 2/14 (14%).

‍

Several other profiles had indicators of suspicion (e.g. false identity, misrepresented credentials) outside the scope of this analysis.

‍

Conclusion

In total, 10-11 of 14 submissions (71-79%) used an LLM, either to generate an answer to the screening question, or to write a request-specific cover letter filled with fabrications.

Is this a good thing? Clearly not. Screening questions can be made LLM-resistant, as we did here, but the individual requesters of work are not doing this en masse. This incentivizes bidders to abuse generative AI, and dramatically raises the noise floor on these platforms.

The onus is really on the platforms to put countermeasures in place in order to keep their services useful, but it is evident that they are far behind at the moment.

Generative AI will bring advantages as it filters out into the world, but most online services are not yet prepared to mitigate the harmful use cases that are inevitable with any new technology.

Need to spot automation like this on your platform?

hCaptcha Enterprise stops the most sophisticated automated attacks. We'd be happy to help.

Sites with a direct financial incentive for abuse tend to be leading indicators, so we expect to see similar behavior spreading across many kinds of online services this year.

Also, we're hiring

Was the answer to this screening question obvious to you? Check out our open jobs.

‍

Appendix A: Discussion

Sample size and follow-on research

This experiment had a small sample size and focused on early adopters within a single platform, so follow-on research is warranted to confirm and elaborate on these results.

The platforms themselves are best placed to do it, but unfortunately it seems unlikely they will publish their results. However, we hope others will expand on this work.

True skills of bidders vs. incorrect answers

Some of the bidders claimed to be highly credentialed. Would any have been able to complete the straightforward task we requested?

Possibly, but the cost of finding out has increased dramatically due to LLMs. Profiles are not thoroughly verified and can be impersonations, fictional, hacked, rented, or sold. If someone claims on their profile to be an assistant professor of computer science with a relevant PhD but fails to answer a relatively simple SQL question correctly, how do we validate their credentials?

Requesters of work will need a phone screen or other additional time and effort to form an opinion on the suitability of bidders, and in many cases this additional cost of screening will exceed the benefit they could get by using a freelance platform.

Appendix B: Experiment Design

The screening question

Inspired by an old article about anomaly detection with plain SQL, we provided a simple two column example database structure with column types and asked for a valid query to find anomalies, suggesting a well-known common formula but allowing candidates to use any method they preferred, so long as it could execute on ClickHouse.

This question is doable for anyone with SQL skills, Wikipedia, and ClickHouse docs, but should take someone already familiar with these topics (the audience that saw it) only a few minutes.

We also explicitly told bidders the results would be mechanically evaluated for correctness, in this case by seeing if the SQL was valid and produced anomalies from the array_data column.

We intentionally left the question slightly underspecified in order to let people pick whatever strategy they were familiar with, and would have accepted any reasonable output as correct.

No initial answer contained valid SQL that could execute on ClickHouse, so we did not need to perform this last evaluation.

We will not publish the screening question to avoid training dataset pollution when we re-run this experiment in the future, but researchers in the field are welcome to contact us for details.

Example LLM message from a bidder

‍

Note: we received permission from this bidder to reproduce both anonymized portions of their bid and the conversation below.
‍

Hi,

I am sending you a message because I came across your job post for anomaly detection with ClickHouse and it has piqued my interest.
<Standard intro deleted>

During my career, I have worked with dozens of blue-chip level enterprises as well as small and medium-sized businesses, across a wide range of industries and in <Number> countries all over the world. I have worked on multiple projects involving ClickHouse and anomaly detection.

Here are a few successful deliveries that fit your requirement:

- Customer Churn Analysis for a Telecom Company
- Challenge: The telecommunications company was experiencing a high rate of customer churn and needed to identify their most problematic areas to improve customer retention.
- Solution: I analyzed their customer data and provided a custom ClickHouse solution that detect the anomalies as soon as they emerged.
- Results: The implemented solution reduced churn rate by 40%.

- Fraud Detection for an E-commerce
- Challenge: The e-commerce platform faced high order cancellations due to fraud transactions.
- Solution: I proposed a ClickHouse-based real-time anomaly detection system to detect fraudulent transactions using <suggested algorithm> for running variance analysis.
- Results: The implementation successfully detected more than 90% of fraudulent transactions and reduced word cancellation rate by 60%.

<More standard copy deleted>

Regardless of us ultimately working together, I take pride in always leaving prospects like yourself with valuable insights that you can apply directly and help you move forward.

Specifically, I'll provide you with:

- Best practices for real-time anomaly detection using ClickHouse
- Specific recommendations on implementing <suggested algorithm> for running variance analysis

‍

Did you notice anything there? Seems like awfully germane prior expertise!

This nearly verbatim rephrasing of the prompt was obvious LLM fiction to us, but non-specialists could certainly be misled.

To confirm the analysis, we followed up:
‍

Us: Looks like you used an LLM to write this.

Them: Hi, many apologies for subjecting yourself to that.

As an AI first org, we are currently experimenting with leveraging the technology across all operations, incl. outreach, however: for strictly mutually beneficial purposes. While the information in the propsoal is 100% accurate, it should have never been sent live, hence it was so utterly uncalibrated and unrelated to your job post. (And pretty poor overall, I should add).

Again, genuinely sorry!

Us: Seems risky. Probably all of the relevant information there is hallucinated ("Solution: I proposed a ClickHouse-based real-time anomaly detection system to detect fraudulent transactions using <suggested algorithm> for running variance analysis.") so you will end up bidding on work you may not be qualified for.

Them: 100%. We're trying to see if and to which extent we can make it adhere to strict guidelines via through smart prompting and other mitigation measures. I am personally very keen about experimenting and get a lot of personal satisfaction from it, especially these days. Just need to stop doing so late at night, when it's indeed very risky (in that case, a flag for "test mode" not being set). Again, apologies, it never was designed to be sent, we are far from there (if we ever get there).

‍

Generative AI is making some platforms useless

Generative AI is making some platforms useless

Intro

Result:

‍Nearly 80% of all responses used a LLM, and 100% of answers to the screening question were LLM generated.‍

None of the generated answers were correct. Platforms will need to rapidly adapt or risk becoming useless.

The experiment

The responses

Conclusion

Need to spot automation like this on your platform?

Also, we're hiring

Appendix A: Discussion

Sample size and follow-on research

True skills of bidders vs. incorrect answers

Appendix B: Experiment Design

The screening question

Example LLM message from a bidder

← Back to blog

How hCaptcha Stayed Up When Cloudflare and Google Went Down

How to Defend Your Organization Against Card Testing Attacks

Preparing for AI Agents

Company

Resources

Contact

Generative AI is making some platforms useless

Generative AI is making some platforms useless

Intro

Result:

‍Nearly 80% of all responses used a LLM, and 100% of answers to the screening question were LLM generated.‍

None of the generated answers were correct. Platforms will need to rapidly adapt or risk becoming useless.

The experiment

The responses

Conclusion

Need to spot automation like this on your platform?

Also, we're hiring

Appendix A: Discussion

Sample size and follow-on research

True skills of bidders vs. incorrect answers

Appendix B: Experiment Design

The screening question

Example LLM message from a bidder

Subscribe to our newsletter

Stay up to date on the latest trends in cyber security. No spam, promise.

← Back to blog

How hCaptcha Stayed Up When Cloudflare and Google Went Down

How to Defend Your Organization Against Card Testing Attacks

Preparing for AI Agents