9 Module 7: Genomic Analyses

9.1 Module 7.1 – Sexual Reproduction in Ascomycota Genomic Analysis

9.1.1 Introduction

Fungal sexual reproduction is complicated. Unlike animals, where usually only 2 opposite mates are observed, some fungi can have hundreds to thousands of different mates (Casselton & Olesnicky, 1998)! Beyond granting the ability to mix genetic material, sexual reproduction has additional importance in fungi since reproductive structures are often more hearty and capable of overwintering, meaning their presence can seriously complicate disease management practices. Undergoing sexual reproduction also provides an important cleaning mechanism for some fungi, enabling them to treat their genomes and remove transposons using processes like Repeat Induced Point mutations, or RIP (Cambareri et al., 1991). In this exercise, we will learn how to determine mating types in Ascomycota using MycoCosm. Compared to most other fungi, Ascomycetes have a relatively simple mating system, where only two types are observed. Here, we will focus on the genus Neurospora where the mating types are simply named “A” and “a” (Metzenberg & Glass, 1990).

What we know for this exercise: Several mating type-related genes have already been identified. These control the initiation of sexual reproduction. They are mostly transcription factors, which you can think of as highly specialized workers: each has a unique ‘skillset’ (genes) that they control (turn on or off) to perform some task (in this case, regulate reproduction-related machinery).

How are mating types determined in fungi? Are the same genes found in both mating types, but regulated differently? Or, are there genes unique to each mating type that are responsible for this? Using comparative genomic approaches, let’s try and find out!

Beginning with known mating type-related genes collected from NCBI, we will use MycoCosm to branch out into related fungi, identify opposite mates, then formulate hypotheses and learn about the genetics underlying fungal sexual regulation. This approach can be directly applied to predict mating types in other fungi, or used to explore presence/absence of other types of genes!

9.1.2 References:

Casselton, L.A., Olesnicky, N.S., 1998. Molecular Genetics of Mating Recognition in Basidiomycete Fungi. Microbiol Mol Biol Rev., 62(1): 55–70. doi: 10.1128/mmbr.62.1.55-70.1998
Cambareri, E.B., Singer, M.J., Selker, E.U., 1991. Recurrence of repeat-induced point mutation (RIP) in Neurospora crassa. Genetics, 127(4):699-710. doi: 10.1093/genetics/127.4.699.
Metzenberg, R.L., Glass, N.L., 1990. Mating type and mating strategies in Neurospora. Bioessays. 1990 Feb;12(2):53-9. doi: 10.1002/bies.950120202.

9.1.3 Finding mat A

9.1.3.1 Locate mat A locus in the Neurospora crassa OR74A genome

To guide our exploration, let’s create a testable hypothesis. Being informative and straightforward, let’s hypothesize that mating type is determined by the presence/absence of specific genes. Given what we know about mating in Neurospora, a good starting point would be the mat A locus. It is composed of 3 genes, mat A-1, mat A-2, and mat A-3.

Start by visiting NCBI (https://www.ncbi.nlm.nih.gov/) and collecting a reference sequence. Go to the nucleotide database and search for “Neurospora crassa OR74A mating-type protein A-1 (matA-1), mRNA”.
Find the matching record, “Neurospora crassa OR74A mating-type protein A-1 (matA-1), mRNA” and collect the corresponding nucleotide sequence by clicking the ‘FASTA’ link (see below), then select and copy the sequence.
Let’s find the location of this gene in the genome! Go to the Neucr2 MycoCosm portal (https://mycocosm.jgi.doe.gov/Neucr2) and search for it using the BLAST tool.
Click the BLAST tab, paste your sequence into the query sequence box. Using blastn, select the filtered model transcripts set and click the ‘Run blastn’ button.
You should see two hits, each shown in a separate row.
- The first column shows the portal ID and transcript ID (since we are searching against nucleotides) of the hit. Portal IDs are unique identifiers assigned to each genome. For example, Neucr2 is the portal ID for Neurospora crassa OR74A, and it can be used to access the portal (e.g. https://mycocosm.jgi.doe.gov/Neucr2). This unique identifier is also displayed across various portal pages to label data from a particular organism.
- The second column shows an alignment of that sequence against your query, which is helpful for seeing how well hits cover your query sequence.
- The score and EValue columns are also useful when assessing hit quality. Higher scores are better, and conversely, lower e-values are better.
The first hit looks better. You can tell because the hit covers more of our query and the score is higher. Let’s explore the protein page for the best hit and look for evidence that this is actually mat A-1. Click on the identifier in the first column to access this page.
Let’s explore the genome browser by clicking on the genome coordinates at the top of the page, in the ‘Location’ section (Above image, red arrow).
After loading is complete (this might take a little while, lots of data are on this browser!), click on the -10x in the top middle portion of the page to get a bigger view (see image below).
The genome browser will show you information we have available for a particular stretch of DNA sequence in your assembly. It contains details on whether we observed the same (or similar) stretch of DNA in a related organism (Vista tracks), predicted genes in that region/locus, and any other information we have. Other information can include messenger RNA expression (mRNA) levels, locations of blast hits for proteins from other fungi, and more! All of these other tracks can provide valuable information to understand fungal biology as well as evidence for validating gene predictions.
You just found the mat A locus in N. crassa! In the position box in the upper left corner, you can see your location: Supercontig_1:1853928-1868957. Remember, if you are lost, every MycoCosm page has a HELP! tab you can access from the menu bar.
Now that we are at the mating type locus, let’s look at the genes. In this case, genes were predicted at the Broad Institute instead of the JGI. Click on the small ‘+’ to the left of the ‘BroadModels’ track to view them.
Note that some regions have 2 overlapping genes. These are called ‘isoforms’, which are variants of the same gene where RNA can be sliced in different ways to produce alternative transcripts and proteins (these are rare in fungi).
In this region of the genome, you can see 5 genes (2 with isoforms). To learn more about what these genes might do, we can look at their protein pages. You can get to the protein page by clicking on the gene, then selecting ‘feature’ web page (or center mouse click on the gene to open it in a new tab). For convenience, here are links to the protein pages for all 5 genes in this region: NCU01961T0, NCU01960T0, NCU01959T0, NCU01958T0, NCU01957T0.
Using the information on the protein page, can you find mat A-2 and mat-A3?
At the mating type locus, do you notice anything interesting when exploring the read depth coverage plots (pink tracks) from related Neurospora genomes?
Based on this information, let’s revisit our hypothesis about how fungal mating type is determined. Given that many related Neurospora crassa genomes are completely missing this region, this appears to support our hypothesis that fungal mating type is determined by gene presence/absence, and specifically suggests that this locus might be what we are looking for! This would mean that the fungi without this locus are of the opposite mating type. Let’s continue to test this hypothesis and see if it holds up to further scrutiny!

9.1.3.3 Finding mat a

Let’s repeat this exercise with the mat a locus to confirm our suspicions, but instead of using nucleotide sequence, let’s try another approach: using protein sequence.

Start by searching the NCBI protein database for “mating-type protein Mat a-1, partial Neurospora discreta” and collect the protein sequence.
Let’s find it on the mat a genome portal you found earlier for N. tetrasperma strain FGSC 2509, (https://mycocosm.jgi.doe.gov/Neute_mat_a1).
Can you find the corresponding protein using blastp against the ‘Neurospora tetrasperma mat a v1.0 filtered gene models (proteins)’ set?
The best hit is to N. tetrasperma protein ID 124723, can you explain why?
Let’s revisit the Neucr2 mcl clustering page and search for this, again by typing portalID.proteinID (e.g. Neute_mat_a1.124723) in the search field.
Do you notice anything different about this cluster compared to the mat-A cluster? Explore this family to learn more - click the on the cluster number (#771) and take a look at the synteny view again.
It looks like our mat a gene is found in a region that is not present in any of the others, which are all the opposite mating type (mat A)!
With this information, we can say that all our current evidence supports our hypothesis that fungal mating types are driven by presence/absence of specific genes, mat a and A in particular!

9.2 Module 7.2 – Fungal Natural Product Production Genomic Analysis

9.2.1 Introduction

Fungi are famous producers of natural products that become new drugs, including anticancer agents (Kumar et al., 2019), cholesterol-reducing products (Seenivasan et al., 2008), and a variety of diverse antibiotics. One of the most famous examples is penicillin, the first antibiotic. Discovered by Alexander Fleming in 1929 as a fungal contaminant on his bacterial cultures, his observation that the fungus (Penicillium rubrum; reclassified later as P. chrysogenum) could kill bacteria (Fleming, 1929) led to the development of penicillin, saving countless lives during World War II and beyond. Like most of these interesting fungal products, they are secondary metabolites. Secondary metabolites are typically produced by organisms to provide a competitive advantage in a given environment (e.g., antibacterials for protection, toxins for causing disease, etc).

Core to the production of any secondary metabolite is a ‘backbone’ gene. These backbone genes can be very strange, such as Non-Ribosomal Peptide Synthetases (NRPSs), which are specialized machines that enable the formation of non-standard products. This can even include amino acids outside of the standard 20 essential for life! NRPSs are large proteins that contain repeated modules, each including an Adenylation (A), Peptide Carrier Protein (PCP), and condensation (C) domain (Streiker et al., 2010). Combined, these allow the NRPS to add amino acids to a growing chain, producing unique molecules. These molecules are then tweaked by decorating enzymes (Keller, 2019) that modify whatever the backbone gene produces, allowing fungi to produce a multitude of bizarre molecules that can wreak havoc in other cells! Decorating genes are physically located next to backbone genes, allowing us to define Biosynthetic Gene Clusters (BGCs) responsible for the production of specific metabolites.

In this exercise, let’s follow up on Alexander Fleming’s discovery and find the BGC that controls penicillin production. We will then use comparative genomic approaches to discover other fungi that are likely producers of penicillin and variants thereof (which could be candidates for creating exciting new antibiotics), then apply our knowledge to characterize new, unexplored BGCs!

What we know for this exercise: From Fleming’s work, we know penicillin is produced by Penicillium chrysogenum. An NRPS is responsible for penicillin production and in P. chrysogenum the penicillin BGC contains three genes: pcbAB, pcbC, and penDE. Which of these is the backbone gene? Can we find it in related Penicillium genomes? Can you identify any fungi that might produce novel penicillin variants? Using comparative genomics approaches, let’s begin by identifying the penicillin BGC in P. chrysogenum, then formulate hypotheses and learn about its distribution across fungi. The same approach can be used to characterize diversity and distribution of other BGCs, even ones where we don’t know the metabolite they produce yet!

9.2.2 References

Kumar, P., Singh, B., Thakur, V., Thakur, A., Thakur, N., Pandey, D., Chanda, D., 2019. Hyper-production of taxol from Aspergillus fumigatus, an endophytic fungus isolated from Taxus sp. of the Northern Himalayan region. Biotechnol Rep (Amst). doi: 10.1016/j.btre.2019.e00395
Seenivasan, A., Subhagar, S., Aravindan, R., Viruthagiri, T., 2008. Microbial Production and Biomedical Applications of Lovastatin. Indian J Pharm Sci. 2008, doi: 10.4103/0250-474X.49087
Fleming, A., 1929. On the Antibacterial Action of Cultures of a Penicillium, with Special Reference to their Use in the Isolation of B. influenzæ. Br J Exp Pathol. 1929 Jun; 10(3): 226–236
Strieker, M., Tanovic, A., Marahiel, M.A., 2010. Nonribosomal peptide synthetases: structures and dynamics. Current Opinion in Structural Biology 2010, 20:234–240
Keller, N., 2019. Fungal secondary metabolism: regulation, function and drug discovery. Nature Reviews Microbiology, 17(3): 167–180. doi: 10.1038/s41579-018-0121-1

9.2.3 Finding the Penicillin Biosynthetic Gene Cluster

9.2.3.1 Locate the known cluster in P. chrysogenum

To start, let’s retrieve a sequence from NCBI that can help us locate the Penicillin BGC. We know there are 3 genes in this cluster: pcbAB, pcbC, and penDE. Go to NCBI (https://www.ncbi.nlm.nih.gov/) and search the ‘Nucleotide’ database for pcbAB
Select the first record, titled “Penicillium chrysogenum PcbC (pcbC) gene, partial cds; pcbC-pcbAB intergenic spacer, complete sequence; and PcbAB (pcbAB) gene, partial cds”. This corresponds mostly to the DNA in between pcbC and pcbAB, which will be a great tool for identifying very similar regions in other genomes!
Click the ‘FASTA’ button to see the nucleotide sequence, then copy the sequence.
Now, let’s find this region in the Penicillium chrysogenum genome in MycoCosm using BLAST. Here, we can browse the genome and conduct comparative analyses across related fungi. Go to the Penicillium chrysogenum genome portal (https://mycocosm.jgi.doe.gov/Pench1) and click on the BLAST tab.
Paste in your nucleotide sequence and conduct a blastn analysis against the assembly. Remember, we are using the sequence that mostly falls between two genes here to locate our region of interest, so we need to search the assembly instead of the available gene model sets.
You should see a single hit! Click on the coordinates to view this region of the genome.
Of course, we are searching for DNA that is mostly found in between these penicillin-related genes. Click -10x to get a bigger view of this locus.
Excellent! You just found the Penicillin Biosynthetic Gene Cluster (scaffold_1:6815459-6827128)! You will notice that there is no nucleotide conservation with the genomes presented in the Vista tracks at the top of the page. Let’s expand the GeneCatalog by clicking on the ‘+’ next to the track and look at a few of these genes.
In this region of the genome, you can see 4 genes. To learn more about what these might do, we can look at their protein pages. You can get to the protein page by clicking on the gene, then selecting ‘feature’ web page (or center mouse click on the gene to open it in a new tab). For convenience, here are links to the protein pages for all 4 genes in this region, named by their protein IDs (which are the main identifiers used within MycoCosm): 67014, 32451, 67324, 22458
As the backbone gene is central to the production of a specific secondary metabolite, let’s use it to discover other fungi that might make similar products! On the 22458 protein page, click on the green bar underneath the domain information section (see figure immediately above). Copy this sequence to your clipboard, we will use it in the next section.

9.2.3.2 Find candidate Penicillin BGCs in related fungi

With our NRPS bait in hand, let’s try to fish out similar genes in related fungi! But before we do, do you have any hypotheses about how often it will be found?
On MycoCosm, you can view individual genomes, or analyze groups of them simultaneously. There are several group page categories, including: Project groups, which contain genomes associated with specific research projects. For example, our own MycoEd project group! EcoGroups, which contain genomes of organisms with shared ecology, e.g. plant pathogenic fungi. PhyloGroups, which contain genomes of taxonomically related organisms.
You can access groups either using the Groups pulldown menu on the MycoCosm homepage, or if you know the group of interest you can type it into the url. For this exercise, let’s visit the Penicillium phylogroup page.
You can see we have over 70 Penicillium genomes available for comparative analysis! Genomes are colored differently based on their status, red = private, only available to you and your colleagues, white = publicly available, green = published.
With so many genomes available, we have a great dataset with which we can test our hypothesis that all Penicillium produce penicillin! There are several routes we can take to explore this, but for this exercise let’s again use BLAST. On group pages, this tool allows us to search all these genomes simultaneously. Click on the BLAST tab to get started.
Paste the protein sequence you copied from the P. chrysogenum Penicillin NRPS into the query sequence box. Select the blastp alignment program and the Gene Catalog protein database. Confirm that your search criteria are correct, then click ‘Run blastp’. Note that this is a large protein and we are searching a lot of genomes, so this might take a few minutes.
After a few minutes, you should see results. We get a lot of hits (1,929 of them!). The top hit should almost always be to the original backbone gene you selected. To assess quality take a look at the alignment, which shows how well the hits cover your query sequence. Also take a look at the ‘Score’. The higher the score, the better.
Can you find where hit quality begins to drop off? You can see by just looking at the top 10 that quality drops off really fast!
Where to define cutoffs is somewhat arbitrary and depends on your research question. For this exercise, let’s keep it simple and set this cutoff by calculating a Blast Score Ratio (BSR; Rasko et al., 2005). By dividing a given score by the score of the query against itself, we can get a feel for how good a match is compared to the best possible match.
With this new information, let’s revisit our hypothesis. Given that we see only 7 really good matches to the penicillin NRPS out of the ~70 available genomes, it seems that our first hypothesis was incorrect, meaning not all Penicillium species produce penicillin.
Let’s revise it! Maybe the Penicillium genus is too diverse and only a small subset of closely related fungi can produce this product! Perhaps a compelling new hypothesis would be that all penicillin producers are closely related evolutionarily.
Record the names in the organism column from our BLAST output for the top 7 hits. We can use the precomputed phylogenetic tree from the ‘TREE’ tab to continue our exploration and test whether they are closely related or not.
Open the ‘TREE’ tab, either in a separate tab or a new window so you can easily see both your blast results and the phylogeny.
Here, we will see a pre-computed phylogenetic tree showing evolutionary relationships within this PhyloGroup. The node colors on the tree indicate support values, green: 1.0, orange: ≥ 0.7, red: < 0.7, white: no support. This phylogenetic tree is generated based on amino acid sequence alignments from hundreds to thousands of proteins across these fungi. If a particular node is well supported, it means that a lot of the underlying sequence alignment agrees with the relationships shown in that part of the tree. In contrast, if support is low, we should be suspicious of the relationships depicted there.
At the top of the page is a search box where we can type in some of our organisms of interest to see where they are on the tree. Try typing in Penicillium chrysogenum. Hit enter (or the ‘Filter’ button) and see what happens!
You will see that the tree compresses and highlights the lineages that match your search, in this case, strains of P. chrysogenum. Everything else will be grayed out. Click ‘Expand All’ and you can see where they are in the larger Penicillium tree. Click the clear to restore everything to default colors.
Looking over this tree, can you find the 7 fungi from our BLAST results that contain our intact Penicillin NRPS? Are they found in the same part of the tree, or all over the place?
We can point to 3 clades where this NRPS is found, with most being present in Clade 2 (see above image). However, in all cases there are very close relatives which lack it. Overall, this suggests that the ability to produce penicillin is pretty variable across the Penicillium genus. While phylogeny can provide some clues, it alone cannot be used to predict the ability to produce penicillin.
This tells us that our hypothesis that penicillin producers are all closely related evolutionarily is incorrect (again!) - and that is OK! Based on our current results, we can conclude that penicillin production is both dispensable (as clearly not all Penicillium can make it) and not phylogenetically driven.
This knowledge raises a ton of new hypotheses, such as: i) The Penicillin BGC arose in these fungi due to horizontal gene transfer, ii) The Penicillin BGC is associated with specific ecological niches, iii) The Penicillin BGC is restricted to specific geographic locations, iv) etc… Using MycoCosm, NCBI and other online bioinformatics tools, feel free to explore some of these ideas!

9.2.4 Characterize Myco-Ed BGCs

9.2.4.1 Find interesting BGCs using Secondary Metabolite (SM) clusters page

Penicillin is a famous example of a fungal natural product. However, fungi are some of the most prolific producers of secondary metabolites worldwide, and there is a lot to discover here! Let’s do some exploratory work and characterize secondary metabolites produced by MycoEd fungi.

Pick a genome from our MycoEd group portal. For these MycoEd genomes, you can add your own text to describe what you discover about any particular gene. This will be of great value to the scientific community, especially those studying fungal natural product diversity!
Keep in mind, what you are doing here is completely new - you might be the first person ever to look at these fungal natural product genes! If you observe something unexpected or cannot find similar genes in related genomes that is OK, and could reveal something very exciting! Documenting your work, manually curating genes (see ‘Annotate them!’ section, below), and sharing it with your instructor will be a big help to future investigators.
Let’s set a goal to describe the distribution of fungal BGC backbone genes, following the same workflow you just did for penicillin.
Each MycoCosm portal has a secondary metabolite clusters page, which can be used to study these BGCs in greater detail. On the menu bar for any portal, you will see an ‘Annotations’ pulldown. Click it, then select ‘Secondary Metabolism Clusters’
The secondary metabolism clusters page shows a table of BGC counts across a set of closely related genomes. The columns separate different types of backbone genes and the rows are individual genomes. For your MycoEd genome, click on the number in the ‘Total’ column to see the BGCs found only in this genome, or click on the number within the backbone gene column you are interested in (e.g. NRPS).
The following page will show details on all BGCs found. This includes the type of cluster, the location on the genome and its architecture, including the backbone gene (in color) and nearby predicted decorating enzymes (gray). Note that some clusters might be missing decorating enzymes - this could be real, or we failed to predict them.
Hovering your mouse over the genes in the cartoon will provide information on those proteins. Pick a cluster and repeat our previous exercise to characterize its distribution. You can start by clicking on the backbone gene in the cartoon, which will bring you to the corresponding protein page. Keep this page open in a separate tab for the rest of the exercise, we will revisit it later.
Collect the protein sequence, then search a related MycoCosm Phylogroup. You can find a related group by going to the portal Info page. At the bottom of this screen is a ‘Links’ section. Select the PhyloGroup containing this organism’s closest relatives, which is at the end of the list. If there are only a few genomes (less than 5), try going one taxonomic level up by returning to the info page and selecting the second to last link.
Start exploring by conducting a BLAST search! If you have a lot of good hits, make sure to increase the number of alignments shown until you see low-quality hits. Click ‘Configure this screen’ and increase the number of hits shown to a larger number (e.g. 100); see below.
Select any hit that passes the BSR threshold of 0.9 we defined earlier (see Step 9 in the ‘Find candidate Penicillin BGCs’ section)’ and let’s see where those fungi are on the interactive tree. The interactive tree is produced only once or twice a year, so your chosen MycoEd genome might not be there depending on when you conduct this analysis. That is OK since we are mostly interested in where these other fungi are found.

9.2.4.2 Annotate them!

Once you have completed your analysis, return to the protein page for your backbone gene and click ‘view/modify manual annotation’.
Next, add your description. You should see ‘add/edit’ buttons on the next page. For the ‘Description’ attribute, select ‘add’ then enter details on what you discovered about that gene and click ‘save item’.
Your name will be associated with your entry and eventually, it will become searchable across the portal! This means anyone who visits this genome can easily retrieve these details (via the search page), and know who to thank (you!) for this work.