Re: You could duplicate it ...
But how do you (or even can you) embed some kind of Rosetta Stone that bridges the gap between this material and the advanced content?
That's not what the Rosetta Stone provides - the analogy isn't apt. The Rosetta Stone is useful because it presented the same message in multiple encodings (languages), some of which were known to the readers. Obviously it's possible to do that with DNA storage.
The larger problem - making the message recognizable to readers who aren't familiar with the encoding - is underspecified. More-complete versions are unsolvable, moderately difficult, or relatively easy, depending on the constraints.
The unsolvable version: Make a message such that arbitrary readers, with just the assumption that they possess technology sufficient to sequence DNA, can with high probability recover a large portion of the message, and interpret it to a meaning that's close (within some epsilon) to the meaning intended by the author. That's simply not achievable. There's no guarantee that such a reader will recognize the DNA structure; that it will have an understanding of communication such that it perceives a message; that it will be capable of comprehending human psychology; etc. It's even conceivable that such a reader wouldn't recognize Reed-Soloman. Any sufficently alien intelligence is ineffable.
The simple version: Make a message that can be decoded by a reader with a good understanding of human communication and the culture of the present moment, and access to a broad corpus of cultural products from the present. Reduce the interpretation requirement to recovering at least a substantial portion of the message with high probability and good accuracy. This is simple: write a message using a straightforward encoding such as ASCII,1 using one of the dominant natural languages (eg English). Write it using straightforward prose with a lot of redundancy. You can also employ a Rosetta-Stone mechanism by translating it into other natural languages. Here we can safely assume Reed-Soloman2 will be recognized. Even if your language(s) of choice is no longer in general use, the corpus of cultural materials should suffice to make the message intelligible.
Between those two you have the moderately difficult versions. You can assume an alien intelligence with relatively little access to human cultural materials, etc. So add information and redundancy to the message, by including more data and by making better use of the channel. For example, if we assume the reader has, or can recognize, some form of visual two-dimensional communication, we can DNA-encode monochrome line drawings with a procedure like this:
- Pick two "flag" sequences for delimiting drawings and rows. Make them clearly artificial and low-information-entropy, such as a thousand adenines (let's call this "AK") and a thousand cytosines ("CK"). Those should attract the attention of anyone sequencing the string.
- Within a picture, use the other two bases (T and G) for your two "colors". (We can assume this is prior to ECC without loss of generality, since we've already assumed the reader recognized and decoded the ECC layer.)
- Start and end the image with AK.
- Sequence a row of (one-bit) "pixels" at a time, delimited by CK. All rows are the same length, so that the reader can recognize the rectangular shape.
Now, this image may be interpreted under various affine transformations. The reader won't know which base is "black" and which is "white" (which is irrelevant), or what the orientation of the "rows" are. So the picture may be reflected in either or both dimension, and so on.3
There are all sorts of ways of increasing the total information entropy and decreasing the entropy density in the message, both of which make it easier to derive an interpretation. Some encodings (pure-binary numbering systems, for example) are more "natural" than others because they don't depend on arbitrary language or cultural features and can be derived from basic mathematical forms, so you use those where feasible. And so on.
1Even if the requirements for this scenario didn't more or less ensure that ASCII is recognizable, it's easy for a reader with access to print text to decode ASCII by symbol frequency.
2Or any other purely mathematical ECC, such as group codes. We wouldn't want to use an ECC mechanism that depends on features of the input, such as a predictive matcher primed for the language in question (which is how humans correct for typographical errors and the like).
3A perverse reader might start with the assumption that it's written bostrophedon. But I think such a reader would also try a consistent row ordering, and pick the correct one based on preferential edge appearance.