The genome unveiled

Written by      

In February 2001, initial results from the human genome sequencing project—the largest and most complex project ever undertaken in the field of biology—were published to an international media fanfare.

Over the years, physicists have spent trillions of dollars on telescopes, underground particle accelerators, fusion reactors and more—all vast undertakings involving many collaborating groups of scientists and hundreds of individuals—in their quest to understand the fabric of the universe. Biologists, probing the more intimate mysteries of life, have probably spent just us much (especially on medical research), but it has been on myriad small projects.

However, the task of analysing the human genome was of such magnitude that it could never be accomplished in reasonable time by even dozens of people. A consor­tium established in 1990 organised the effort, which has been carried out inten­sively in 20 centres through­out the northern hemi­sphere. A large private company, Celera, also started sequencing the genome in competition with the publicly funded project.

Before considering what has been learned, it may be helpful to review what the genome is. In secondary school, although the word itself was probably not used, many of us would have been taught the relevant back­ground information. For example, that the human egg and sperm contain genes (from mother and father respectively) which carry the traits of each parent into the amalgam which becomes a unique new person. We would have learned the genes are actually a chemical called DNA (deoxyribonu­cleic acid), that each of us has many thousands of genes, and that all these genes, plus other non-genetic DNA, makes up our genome.

Non-genetic DNA? That’s something that perhaps we missed in school—and it brings us to the first result from the Human Genome Project (HUGO). More than 98.5 per cent of our DNA is not in the form of genes, and biologists still don’t know what, if anything, it does.

Remember that a gene is a piece of DNA which contains the instructions for making a protein or a fragment of a protein—a peptide. What, then, is DNA when it isn’t a gene? As far as we know (and elbowing aside a few caveats), just a long, stringy, meaningless molecule.

An analogy may help. Letters of the alphabet can be arranged to form real words or gobbledygook, and a line of type can be either all meaningless letters, or actual words but in a meaningless jumble, or coherent sentences, or some mixture of sense and nonsense. What HUGO has found is that most human DNA is the equivalent of meaningless letters and jumbled words—junk DNA—and less than two per cent of it forms sentences, or, in the language of the cell, genes.

DNA is made up of a very simple alphabet of only four letters: A, T, G and C, each letter standing for a chemical known as a nucleotide. A cell’s DNA consists of very long unbranched strands of DNA made up of millions of As, Ts, Gs and Cs. A strand might have the nucleotide sequence AATGCCCT­GCTACCT . . . and so on. This is known as the sequence of the DNA. It makes little sense written down, but what does it mean to the cell? called DNA (deoxyribonu­cleic acid), that each of us has many thousands of genes, and that all these genes, plus other non-genetic DNA, makes up our genome.

Non-genetic DNA? That’s something that perhaps we missed in school—and it brings us to the first result from the Human Genome Project (HUGO). More than 98.5 per cent of our DNA is not in the form of genes, and biologists still don’t know what, if anything, it does.

Remember that a gene is a piece of DNA which contains the instructions for making a protein or a fragment of a protein—a peptide. What, then, is DNA when it isn’t a gene? As far as we know (and elbowing aside a few caveats), just a long, stringy, meaningless molecule.

An analogy may help. Letters of the alphabet can be arranged to form real words or gobbledygook, and a line of type can be either all meaningless letters, or actual words but in a meaningless jumble, or coherent sentences, or some mixture of sense and nonsense. What HUGO has found is that most human DNA is the equivalent of meaningless letters and jumbled words—junk DNA—and less than two per cent of it forms sentences, or, in the language of the cell, genes.

DNA is made up of a very simple alphabet of only four letters: A, T, G and C, each letter standing for a chemical known as a nucleotide. A cell’s DNA consists of very long unbranched strands of DNA made up of millions of As, Ts, Gs and Cs. A strand might have the nucleotide sequence AATGCCCT­GCTACCT . . . and so on. This is known as the sequence of the DNA. It makes little sense written down, but what does it mean to the cell?

The language of the cell turns out to be very simple, and the basics have been understood since the 1960s. All valid words are only three letters long. No gaps or “commas” mark the ends of words. Each valid word (called a codon) specifies one of the 20 amino acids from which cells make proteins. In effect, a gene is the equiva­lent of a sentence—one that consists solely of three-letter words, where the order of the words corresponds to and specifies the order of amino acids in a protein.

Unlike in English, DNA sentences may be very long, because proteins contain anywhere from a couple of dozen amino acids up to several thousand (the average length of a human protein is about 460 amino acids). As in written language, there are synonyms: some amino acids are specified by several different codons. All sentences—and this is important—must start with something termed an open reading frame (equivalent to a capital letter) plus the codon which specifies the amino acid methionine, and there must also be a “stop” codon at the end of each gene.

Why this emphasis on proteins? Because although DNA holds the blueprints, protein is the real stuff of life. There are other bio­chemical components, certainly—bone, fats, sugars—but these are all organised through the activities of proteins. Proteins are also major structural elements in our bodies in their own right. Our skin and hair are largely protein; our eyes, nerves, hearts, digestive enzymes and hormones likewise. Just about everything that goes wrong in our bodies can be traced to a protein. In congenital conditions, medical researchers often find an error in a gene sequence that changes a codon, producing a wrong amino acid in a protein. That protein may then misbehave, and we may suffer as a result.

This takes us back to HUGO. One of the long­term objectives of HUGO is to uncover mutations in our genes that give rise to medical problems. Before you can do this, it is neces­sary to determine the normal sequence of Ts, As, Gs and Cs in human DNA. This is what sequencing the genome is all about.

DNA is typically organ­ised within cells into large but manageable blocks termed chromosomes, of which each human cell contains 23 pairs, totalling 3.2 billion nucleotides—an incomprehensibly enormous length of DNA to decipher. Until the 1980s, it was felt that DNA was such a large molecule (each chromosome is one DNA molecule) made from constituents that were chemically so similar it would be quite impossible to analyse. A few people tried to determine the sequence of amino acids in some pro­teins, but this proved very difficult also.

In the late 1970s and early 1980s, however, ingenious new analytical techniques were introduced, along with clever ways to manipulate and multiply DNA, so that suddenly it became easier to sequence DNA than analyse protein. HUGO itself provided the impetus for the development of improved, automated DNA-sequencing methods which were inconceivable when the project was initiated. Even in 1990, people spoke in terms of decades to get the job done.

Although crossing all the Ts (so to speak) will prob­ably take another two years, and analysing all the data much longer than that, the huge sequencing job is 90 per cent complete. Some, probably many, small gaps remain (especially in the junk DNA), but we know the shape of what we are dealing with. In point of fact, most (and probably all) of the genome has now been sequenced four times over, but before the project is considered finished, it will probably be sequenced a further four times.

So what has been learned—apart from the fact that almost 99 per cent of our DNA is apparently worthless? Before departing that point, it is intriguing to note that human DNA does seem to be significantly less meaningful than the DNA of the handful of other organ­isms sequenced to date. We have something like 7-15 genes per million nucleotides, yeast has 483, and species of worm, plant and fruitfly have 197, 221, and 117 respectively. From this meagre data, there seems to be a trend towards more non-coding DNA with increasing biological complexity.

Yet the human genome seems to be less compli­cated than conceit might have predicted. For the past 10 years the number of human genes has been estimated at 100,000. But so far only about 26,000 have been discovered, and the final total is likely to be around 30,000. At 26,000, we are on a par with Arabidopsis, a weedy plant, and it is unlikely that we will finally end up with more than twice the number of genes that a fruitfly has, or 50 per cent more than the much-studied tiny nematode worm Caenorhabditis elegans.

There may be some consolation in the fact that humans seem to combine pieces of proteins—like building blocks—to prodcue more final proteins from our repertoire of genes than can lower organisms. Perhaps we could get as many as 60,000 to 70,000 proteins.

Human genes do not appear to show a great deal of originality. Computers can easily compare gene sequences, uncovering similarities between seg­ments of DNA from different species and phyla. Most genes have been found to belong to “families” that share sequence similarities. Yet a mere 94 gene families of the 1278 recognisable in human DNA are confined to vertebrates. Put differently, the great majority of human genes are recognisably related to the ancient genes of invertebrates, fungi and bacteria.

Even our junk DNA seems to be lacking in quality. More—perhaps much more—than half of it consists of repeated se­quences, many of them short. Most of these repeats derive from what are termed transposable elements, stretches of DNA that act almost parasitically, making multiple copies of part or all of themselves, inserting the new DNA nearby or elsewhere in the genome. Genomes of other organisms sequenced to date contain much less in the way of repeats.

Repeats make the job of sequencing appreciably more difficult. Assembling a sequence is akin to trying to navigate your way through a city where most blocks look identical.

Nobel laureate David Baltimore has commented:”In humans, virtually all of the parasitic [junk] DNA repeats seem old and enfeebled, with little evidence of continuing reinsertions. However, there has been very little evolu­tionary scouring of these repeats from the human genome, making it a rich record of evolutionary history. The mouse genome, by contrast, has many actively reinserting parasitic sequences and is scoured more intensely, making it a much younger and more dynamic genome. This difference might reflect the shorter generation time of mice or something about their physiology, but I find it an intriguingly enigmatic observation.”

Repeated junk DNA is especially prominent near the ends of chromosomes and about the centromeres­—those constrictions where pairs of chromosomes are joined just before cell division.

Human genes also contain much more extensive in trons than genes from other genomes. In trons are stretches of junk DNA actually within a gene sentence. The cell ignores them in making proteins.

Although the great bulk of human DNA does not code for proteins, genes are not evenly distributed throughout this sea of nonsense, but seem to be clustered into compact areas of our chromosomes. For example, production of mRNA—a precursor to protein—can be 20- to 200-fold higher in these active regions of chromosomes compared with average regions. Strangely,chromosomes 4, 13, 18,and 21 show few if any of these gene-rich regions(see diagram). All this is rather unexpected, since yeast shows no such clustering.

While many of our genes have evolved from bacterial genes, several hundred seem to have come directly from bacteria. These gene shave no equivalents in flies, worms and the like. It is presumed that bacterial DNA must have become admixed with ours during periods of infection, and some genes were then incorporated into our DNA—a kind of natural genetic engineering, making us naturally transgenic!

At first glance, human DNA contains nothing that obviously accounts for human complexity. There are no large, distinctively human, information-rich tracts that might encode consciousness, musical appreciation, spirituality, morality, intelligence, language or any other human traits. Although the paradigm that everything about us derives from our genes has prevailed for the past 40 years, it is ironic that much that makes us human is not immediately apparent from our DNA.

More by