Well-established, proven dictionaries and encyclopedias
You see, companies like OpenAI prefer to annotate using random texts mined from nowhere, that is practically creating dictionaries and encyclopedias (for annotating) from scratch. For this they use astronomical volumes of raw texts, gigabytes and gigabytes. I tried this method during my preparation for NIST TREC QA and came to the conclusion that standard dictionaries and encyclopedias are more suitable, not least because they are compact. For example, only 25 and nice indexed megabytes for Merriam.
In particular, general dictionaries and encyclopedias are very good because they contain absolute minimums of bias. Thus, if you want to avoid completely or minimize the manifestations of AI racism, you must inevitably use well-established, proven dictionaries and encyclopedias, and not the random texts.