I wondered what that 'Sailfish gene thingy' was about.
The world has built DNA genomes for a long time, but applying what we know about genetics to everyday medicine is a tough ask. For example, readers might remember that the business of crafting treatments from genes is so complex that IBM recently entered a partnership to get its Watson megabrain learning to help medicos craft …
Wednesday 23rd April 2014 11:53 GMT Grikath
not bad.... but..
It's only accelerating one step in understanding the entire process, which is, mildly put, complicated. And it wouldn't show the actual level of expression, nor any of the post-translation steps, nor any of the potential interactions with other proteins, nor....
( IT equivalent: the original code concept has to go through about 5 committees to get to Alpha, then another 5-10 or so to get to Beta and/or Gold. Then top-level users can still edit bits of it. In the meantime the committees themselves are suffering from input from both Users and Management. Try and predict the outcome of that....)
It may help to negatively select the data stack though. It's much easier to disqualify candidates than to positively prove the viability of a candidate. And it will make those hour-runs more efficient if you can feed them the Right Stuff.
Wednesday 23rd April 2014 20:35 GMT MondoMan
Perhaps this will help clarify the researchers' work:
DNA serves as the blueprint/instruction manual for the cell. However, the information in the DNA must be converted to other forms in order for it to perform any function, that is, to be "expressed". Thus, "gene expression" is just what function(s) are performed by an individual gene, and *how much* of that function is performed. The term is also used to refer to all the genes in the cell: all the functions being performed, and the level/amount of each function.
The cell uses many ways to regulate gene expression (to adjust the levels of different functions), most of which are difficult to measure and ascribe to a specific gene function when looking at all the functions of a cell. This paper is concerned with only the initial step of gene regulation -- "transcription". Transcription is simply the copying of modest stretches of DNA into sequence-identical working copies made of the similar molecule RNA. Essentially, it is like photocopying specific blueprint pages from the manual. Just as with a paper manual and photocopying, one simple way to get more use out of a single blueprint page in the manual is to make many photocopies of it. One can get a rough idea of what's being done on a project just by looking at all the photocopies in use there -- many copies of a given page or group of pages suggest much work on the function related to those pages, while a complete absence of a given page suggests no work currently being done on that function.
Similarly, collecting all the RNA transcripts from a group of cells, determining the sequence of each, and matching those up with the entire DNA genomic sequence allows one to infer which gene functions are in heavy use, light use, or unused. (Of course, given the other types of gene expression regulation, this will be an important, but partial, description of the gene expression in the cells.)
The researchers' new technique comes in the "...matching those up with the entire DNA genomic sequence..." step. Current bulk transcript sequencing techniques produce sequence fragments ("reads") on the order of 500 "letters" or "bases" long (DNA and RNA each have 4 possible letters). Since the entire genomic sequence in mammalian cells is about 3G bases long, and each cell typically has two copies, each transcript read must be aligned with 6G of genomic sequence. A simplistic sliding-comparison alignment algorithm would involve on the order of 10 to 1000 x 10^9 comparisons for just *one* transcript read; a proper experiment would involve mapping 10^6 or more reads so the number of comparisons (and compute time) rises rather rapidly!
By contrast, my understanding of the researchers' new technique is that they use a k-tuple approach which in principle requires compute time only linearly proportional to the total transcript read sequence lengths plus the 6 x 10^9 genomic sequence. The insight is that by breaking down the transcript reads into overlapping "tuples" of length k, and just counting how many times each k-tuple appears in the entire set of transcript reads, an expression "bar chart" can quickly be constructed on the genomic sequence.
For example, if one chose k=10, one would have 4^10 = 1048576 or approximately 1M possible 10-base-long tuple sequences. One would then go through all the (say, one million) transcript read sequences, producing about 500 tuples from each (500 bases average length, tuples offset by 1), or about 500M total. If the transcript read sequences were completely random, each of the 1M possible 10-tuples would have about 500 "hits". However, we know the sequences are NOT random, and in particular much of the genomic DNA will not be transcribed at all, so we expect many (most?) 10-tuples will show zero/few hits, and some more than 500.
To "read out" the results, one would just move along the genomic DNA sequence, shifting over one base at a time, and look up the transcript read hit count for the 10-tuple beginning at that position in the genomic sequence. While there might be short spurious peaks and valleys, in general, an expressed gene fragment would show up as a contiguous stretch of DNA where most positions showed roughly-equal levels of 10-tuple hit counts. The average hit count level would indicate the relative expression of that gene fragment.
Thursday 24th April 2014 08:24 GMT John Smith 19
It's a classic string pattern matching problem
The trouble is that most algorithms that do that like big alphabets so a failure can skip the pattern string further along the main string.
But DNA is a 4 character alphabet unless you can assume that the pattern and main string will only align every 3 positions (IE a codon), giving you a 64 set alphabet. But I'm not sure if that assumption is always reliable as it's been a while since I read on this.
DNA is fascinating in that some amino acids have several codes. They are therefor "robust" in transcription errors (mutations), while others are fragile and any mistake knocks out a different amino acid.
The question is did this code assignment develop early and never change or has it also evolved? Likewise does the "fragility" of certain amino acids transcription have some sysetmatic effect on the compounds they form?
Sadly that's all above my pay grade.