Lexical noise as another argument why you shouldn't trust OpenAI
There is a sentence "Alice and Bob exercise merrily, she trains a lot."
The word "merrily" can be an example of lexical noise; where it's typically superfluous patterns that do not explain the central themes contained within the digital textual information and, accordingly, removal of such noise often results in an improvement in the quality of the structured data.
Suppose somebody doubts if Merrily the athlete who trains a lot being exercised by both Alice and Bob - it's a name. Or is it an adjective?
One wrong pattern can radically change everything!
With my patented lexical noise deletion AI-parsing gets these FOUR patterns, from the sentence:
- Alice exercise merrily - 0.25
- Bob exercise merrily - 0.25
- she trains a lot - 0.5
- Alice trains a lot - 0.5
AI sees that the word "merrily" is an adjective from its context and subtext, from its dictionary-encyclopedia definitions.
Without the purging, not knowing if "merrily" is a name or an adjective, AI-parsing gets FIVE patterns:
- Alice exercise merrily - 0.1(6)
- Bob exercise merrily - 0.1(6)
- merrily exercise merrily - 0.1(6)
- she trains a lot - 0.5
- Alice trains a lot - 0.5
With the purging AI gets TWO synonymous clusters, without - THREE!
OpenAI cannot delete lexical noise and its results are not trustworthy - OpenAI sees no separate words but patterns, doesn't determine parts of speech and uses dictionary-encyclopedia definitions.
My AI database can, therefore, be 100% trusted.