[Aphidgenomics] using NCBI Gnomon models

Murphy, Terence (NIH/NLM/NCBI) [C] murphyte at ncbi.nlm.nih.gov
Tue Jun 17 08:28:02 EDT 2008


I had a question about the best approach for using the NCBI gene set and
Gnomon set to identify Your Favorite Gene(s) (YFGs), so I would like to
recommend several approaches depending on your interests and level of
expertise: 

----------
Simple approach:

1) BLAST the NCBI gene set, which includes those genes classified as
protein-coding and supported
2) For referring to specific genes, use the gene symbol (LOC#########)
or RefSeq accession (NM_/NP_, XM_/XP_) as indicated in the defline

PROS: simple; only includes genes with at least some experimental
support (transcripts or homology to other proteins); easy to refer to
LOC######### and RefSeq accession in publications
CONS: may miss some genes that are classified as pseudogenes, or that
didn't meet the experimental support criteria

----------
Comprehensive approach:

1) BLAST the Gnomon set, which includes all genes in the NCBI gene set
2) Use the hmm_geneID_RefSeq file to determine which hmm models (column
3) are considered protein-coding genes or pseudogenes (column 8)
3) For referring to specific genes, use the locus_id (9 digits, column
5) or RefSeq accession (XM_/XP_, column 6) from the hmm_geneID_RefSeq
file. Please be aware that the hmm model # is not a permanent
identifier, so it should not be used to refer to a specific model for
publication purposes.
4) Any hmm models that are not classified as protein_coding but appear
to be valid protein-coding genes should be submitted for inclusion in
the "Official Gene Set". Ideally, hmm models that are classified as
pseudogene* should be updated with a valid gene structure that does not
include the frameshift or early stop codon. Details on how and where to
submit genes for inclusion in the Official Gene Set, or how to update an
incorrect model, haven't been worked out yet, but hopefully will be
resolved by the Princeton meeting.

PROS: comprehensive
CONS: need an additional step to identify the locus type, locus_id, and
RefSeq accession

----------

Let me know if you have any questions.
	
Sincerely,

-Terence

-----
Terence Murphy, Ph.D.
RefSeq Scientist
NCBI contractor

> -----Original Message-----
> From: Karl.Gordon at csiro.au [mailto:Karl.Gordon at csiro.au]
> Sent: Tuesday, June 17, 2008 2:43 AM
> To: Murphy, Terence (NIH/NLM/NCBI) [C]
> Subject: aphid gene models - glean?
> 
> Hi Terence,
> Thank you for making the aphid gene models available. After reading
> your notes, I downloaded both the gnomon and NCBI sets; my initial
> target is to compile a unique set of proteases as a minor challenge.
> Cross-blasting with selected sequences confirms a significant
> duplication between these sets.
> Is there something like a Glean set or will there be one, please? (I
> recall the Glean-derived "Official Gene Set" used for the honeybee
> gneome.) If not, what is the most straightforward automated path to
> reduce the overlap to a single set, please?
> I fully apreciate there may not be a simple answer to this question -
> just wanted to check before I start working on 2 overlapping sets of ~
> 150 proteins each!
> best wishes
> Karl Gordon
> 




More information about the Aphidgenomics mailing list