r/MachineLearning · · 1 min read

Is it allowed to use OpenAI API outputs to create a silver code dataset or benchmark for a specific Python library? [d]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

Hello everyone,

Is it allowed to use OpenAI API outputs to create a silver code dataset or benchmark for a specific Python library?

I am working on a project idea related to library-specific code generation. The concrete case is a specific Python library used in a technical/scientific domain. The goal would be to improve and evaluate how well code-generation models can use this library correctly.

I am trying to understand the legal / Terms of Service boundary around using OpenAI API outputs in two different scenarios:

Scenario 1: Silver dataset for fine-tuning an OSS model

Use the OpenAI API to generate programming tasks, reference solutions, and verification tests for the specific Python library.

Then human-review, filter, and validate the generated examples. Then use this silver dataset to fine-tune an open-source code model, with the goal of improving its performance on this specific library.

My question: would this violate OpenAI’s terms because the API outputs are being used to train/fine-tune another coding model, even if the scope is narrow and library-specific?

Scenario 2: Benchmark only, not training

Use the OpenAI API to generate programming tasks, reference solutions, and verification tests.

Human-review and validate them. Then use the resulting dataset only as an evaluation benchmark to compare different models. The benchmark would not be used to fine-tune or train any model.

My question: is this generally considered allowed under OpenAI’s terms, assuming the benchmark is properly reviewed and documented as AI-assisted?

I understand that Reddit is not legal advice, and I would still contact OpenAI or legal counsel for a definitive answer. However, I thought new ideas could come up from people who have already faced similar situations in practice.

submitted by /u/ororo88
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning