Explain how/where you got your code Vs Explainable AI
These are the same problem.
To solve this would require a lot of work in the initial setup of the training data.
You will not be able to solve this using training data that is not completely vetted and cleansed.
The training data needs to be correct and complete and fully attributed.
Unfortunately not something that is effectively done in today's mad scramble to create cool toys.
Code sample x made by author y does z using language w
some piece of code
What this code does:
Some description of what this code does,
description of inputs,
description of outputs,
description of purpose
What language is this in?
Author of this code, Available licenses for this code.
For creating the code generator you need to generate suggested x given z and w
Description of what this code needs to do
What inputs are available?
What outputs are required?
How should it behave?
What Language are we generating in?
This produces generated code.
Questions about this provided code are:
What code samples from the training data are most likely to have contributed to the output.
What are the authors and what is the licensing terms of the code that most likely contributed to the generation of the output.
Generating code samples we seem to know a bit on how to do.
Identifying what code samples most likely contributed to the generation of the output is a second challenge that could be met by further research. This would likely require extensive supervised training using sample Training data-->generating sample output that human evaluators could then compare to the training data and identify the likely contributors Enough of this could train a model at identifying likely contributing code samples given generated code a a training corpus just like models can be developed for attributing purposes. Like identifying if some piece of prose might have been written by Shakespeare.
This can then be linked to the author and license information linked to the training code samples.
This is a vastly greater amount of work than just training a model on any tom dick and harry code set scraped from an essentially random source.
There is no easy shortcut way of doing this unless you are happy with the untoward consequences of not caring about any consequences.