Identify non-coding regions from a genome annotation












6












$begingroup$


I have this GTF file and I use the command below on a Linux machine to extract the coding regions of the genome:



awk '{if($3=="transcript" && $20==""protein_coding";"){print $0}}' gencode.gtf


How I could do the inverse and keep only non coding regions?










share|improve this question











$endgroup$








  • 2




    $begingroup$
    Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
    $endgroup$
    – terdon
    Feb 22 at 18:24
















6












$begingroup$


I have this GTF file and I use the command below on a Linux machine to extract the coding regions of the genome:



awk '{if($3=="transcript" && $20==""protein_coding";"){print $0}}' gencode.gtf


How I could do the inverse and keep only non coding regions?










share|improve this question











$endgroup$








  • 2




    $begingroup$
    Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
    $endgroup$
    – terdon
    Feb 22 at 18:24














6












6








6


2



$begingroup$


I have this GTF file and I use the command below on a Linux machine to extract the coding regions of the genome:



awk '{if($3=="transcript" && $20==""protein_coding";"){print $0}}' gencode.gtf


How I could do the inverse and keep only non coding regions?










share|improve this question











$endgroup$




I have this GTF file and I use the command below on a Linux machine to extract the coding regions of the genome:



awk '{if($3=="transcript" && $20==""protein_coding";"){print $0}}' gencode.gtf


How I could do the inverse and keep only non coding regions?







annotation genome gtf text-processing interval






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 22 at 19:06









Daniel Standage

2,575431




2,575431










asked Feb 22 at 12:20









Feresh TehFeresh Teh

45112




45112








  • 2




    $begingroup$
    Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
    $endgroup$
    – terdon
    Feb 22 at 18:24














  • 2




    $begingroup$
    Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
    $endgroup$
    – terdon
    Feb 22 at 18:24








2




2




$begingroup$
Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
$endgroup$
– terdon
Feb 22 at 18:24




$begingroup$
Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
$endgroup$
– terdon
Feb 22 at 18:24










4 Answers
4






active

oldest

votes


















2












$begingroup$

If you want all transcripts from that gtf file whose type isn't "protein_coding", you can use almost the same command, just change the == ("is") to != ("isn't"):



awk '{if($3=="transcript" && $20!=""protein_coding";"){print $0}}' gencode.gtf 


Or, a simpler version:



awk '$3=="transcript" && $20!=""protein_coding";"' gencode.gtf 


Note that this will not include any of the havana transcripts in the file, but I am assuming that's what you want since that's what your original command did.



Specifically, the command will return the following types of transcript (the numbers on the left are the number of such transcripts in the file):



awk '$3=="transcript" && $20!=""protein_coding";"{print $20}' gencode.gtf  | sort | uniq -c | sort -nk1
1 "translated_processed_pseudogene";
2 "Mt_rRNA";
3 "IG_J_pseudogene";
3 "TR_D_gene";
4 "TR_J_pseudogene";
5 "TR_C_gene";
10 "IG_C_pseudogene";
18 "IG_C_gene";
18 "IG_J_gene";
22 "Mt_tRNA";
25 "3prime_overlapping_ncrna";
27 "TR_V_pseudogene";
37 "IG_D_gene";
58 "non_stop_decay";
59 "polymorphic_pseudogene";
74 "TR_J_gene";
97 "TR_V_gene";
144 "IG_V_gene";
182 "unitary_pseudogene";
196 "IG_V_pseudogene";
330 "sense_overlapping";
387 "pseudogene";
442 "transcribed_processed_pseudogene";
531 "rRNA";
802 "sense_intronic";
860 "transcribed_unprocessed_pseudogene";
1529 "snoRNA";
1923 "snRNA";
2050 "misc_RNA";
2549 "unprocessed_pseudogene";
3116 "miRNA";
9710 "antisense";
10623 "processed_pseudogene";
11780 "lincRNA";
13052 "nonsense_mediated_decay";
25955 "retained_intron";
28082 "processed_transcript";


You might also want to remove that "translated_processed_pseudogene" since that is actually translated into protein and is therefore technically coding:



awk '$3=="transcript" && 
$
20!=""protein_coding";" &&
$20!=""translated_processed_pseudogene";"' gencode.gtf





share|improve this answer











$endgroup$













  • $begingroup$
    Thanks a lot, really thank you for saving me I could not solve that myself. How I can extract the below information from each line of resulting non-coding file chr1 29553 30039 ENSG00000243485.2 + gene_name "MIR1302-11"
    $endgroup$
    – Feresh Teh
    Feb 22 at 20:29












  • $begingroup$
    @FereshTeh you're welcome. I think you want awk '$3=="transcript" && $20!=""protein_coding";" && $20!=""translated_processed_pseudogene";"{print $1,$4,$5,$10,$7}' gencode.gtf but, if not, please ask a new question about that.
    $endgroup$
    – terdon
    Feb 22 at 20:32










  • $begingroup$
    Thanks a lot that returns all except gene name, this is output chr1 29554 31097 "ENSG00000243485.2"; +
    $endgroup$
    – Feresh Teh
    Feb 22 at 21:15






  • 1




    $begingroup$
    @FereshTeh please ask a new question so you can show exactly what output you need.
    $endgroup$
    – terdon
    Feb 22 at 21:17



















4












$begingroup$

Getting the non coding regions of a protein coding transcript, sounds like you are looking for UTR.



UTR has its own feature in the gtf file. So you can do this:



$ awk -v FS="t" '$3=="UTR"' gencode.gtf


If the gtf file is compressed use this instead:



$ zcat gencode.gtf.gz | awk -v FS="t" '$3=="UTR"'


BTW: Why are you using such an old release of gencode? The current version is v29.






share|improve this answer











$endgroup$













  • $begingroup$
    Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
    $endgroup$
    – Feresh Teh
    Feb 22 at 12:53










  • $begingroup$
    Sorry I tried hat but my output is empty
    $endgroup$
    – Feresh Teh
    Feb 22 at 12:59






  • 1




    $begingroup$
    As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with $ awk -v FS="t" '$3=="CDS"' gencode.gtf. Reading the manual for bedtools complement is your task.
    $endgroup$
    – finswimmer
    Feb 22 at 13:14












  • $begingroup$
    Sorry but your commands return nothing, I mean not working returning empty file
    $endgroup$
    – Feresh Teh
    Feb 22 at 18:24










  • $begingroup$
    The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
    $endgroup$
    – terdon
    Feb 22 at 18:37



















2












$begingroup$

This isn't a problem that's easily solved with awk. It's not like you're extracting a feature that's annotated in the GTF file. Instead, you want the empty space between annotated features.



A few years ago I wrote a program called LocusPocus for a similar task. It uses a gene annotation to break down a genome into gene loci and intergenic regions. It handles overlapping annotations and other weirdness pretty robustly. The output will include both coding regions and non-coding regions, but you can identify the intergenic spaces as those with iLocus_type equal to iiLocus or fiLocus.



Note: the --delta parameter will extend each gene/transcript by 500bp by default.



Caveat: the program only accepts GFF3 input by default. Hopefully it won't be too hard to convert your GTF to GFF3.



Another caveat: eventual interpretation of these data will depend on what features are annotated in the genome and which annotations you include vs ignore. Do you want your non-coding regions to include non-coding genes, or should these be treated separately? Some non-coding regions will be full of transposable elements and other repetitive DNA, while others will have enhancers, promoters, or other regulatory elements. It's important to tread carefully before you jump to any conclusions.






share|improve this answer











$endgroup$









  • 1




    $begingroup$
    Absolutely brilliant name! :)
    $endgroup$
    – terdon
    Feb 22 at 19:32










  • $begingroup$
    Sorry, thank you for explanation; After googling I see I actually need enhancers, promoters, or other regulatory elements of human genome to find cancer driver genes placed in these regions.
    $endgroup$
    – Feresh Teh
    Feb 24 at 21:24



















0












$begingroup$

Likely non-coding regions of genome are here



https://elifesciences.org/download/aHR0cHM6Ly9jZG4uZWxpZmVzY2llbmNlcy5vcmcvYXJ0aWNsZXMvMjE3NzgvZWxpZmUtMjE3Nzgtc3VwcDQtdjMueGxzeA==/elife-21778-supp4-v3.xlsx?_hash=KQi5jfO3kT2c4Qw44j4Rg6YAyCBQilYuWHVYXcRDuuo%3D






share|improve this answer









$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "676"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7098%2fidentify-non-coding-regions-from-a-genome-annotation%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2












    $begingroup$

    If you want all transcripts from that gtf file whose type isn't "protein_coding", you can use almost the same command, just change the == ("is") to != ("isn't"):



    awk '{if($3=="transcript" && $20!=""protein_coding";"){print $0}}' gencode.gtf 


    Or, a simpler version:



    awk '$3=="transcript" && $20!=""protein_coding";"' gencode.gtf 


    Note that this will not include any of the havana transcripts in the file, but I am assuming that's what you want since that's what your original command did.



    Specifically, the command will return the following types of transcript (the numbers on the left are the number of such transcripts in the file):



    awk '$3=="transcript" && $20!=""protein_coding";"{print $20}' gencode.gtf  | sort | uniq -c | sort -nk1
    1 "translated_processed_pseudogene";
    2 "Mt_rRNA";
    3 "IG_J_pseudogene";
    3 "TR_D_gene";
    4 "TR_J_pseudogene";
    5 "TR_C_gene";
    10 "IG_C_pseudogene";
    18 "IG_C_gene";
    18 "IG_J_gene";
    22 "Mt_tRNA";
    25 "3prime_overlapping_ncrna";
    27 "TR_V_pseudogene";
    37 "IG_D_gene";
    58 "non_stop_decay";
    59 "polymorphic_pseudogene";
    74 "TR_J_gene";
    97 "TR_V_gene";
    144 "IG_V_gene";
    182 "unitary_pseudogene";
    196 "IG_V_pseudogene";
    330 "sense_overlapping";
    387 "pseudogene";
    442 "transcribed_processed_pseudogene";
    531 "rRNA";
    802 "sense_intronic";
    860 "transcribed_unprocessed_pseudogene";
    1529 "snoRNA";
    1923 "snRNA";
    2050 "misc_RNA";
    2549 "unprocessed_pseudogene";
    3116 "miRNA";
    9710 "antisense";
    10623 "processed_pseudogene";
    11780 "lincRNA";
    13052 "nonsense_mediated_decay";
    25955 "retained_intron";
    28082 "processed_transcript";


    You might also want to remove that "translated_processed_pseudogene" since that is actually translated into protein and is therefore technically coding:



    awk '$3=="transcript" && 
    $
    20!=""protein_coding";" &&
    $20!=""translated_processed_pseudogene";"' gencode.gtf





    share|improve this answer











    $endgroup$













    • $begingroup$
      Thanks a lot, really thank you for saving me I could not solve that myself. How I can extract the below information from each line of resulting non-coding file chr1 29553 30039 ENSG00000243485.2 + gene_name "MIR1302-11"
      $endgroup$
      – Feresh Teh
      Feb 22 at 20:29












    • $begingroup$
      @FereshTeh you're welcome. I think you want awk '$3=="transcript" && $20!=""protein_coding";" && $20!=""translated_processed_pseudogene";"{print $1,$4,$5,$10,$7}' gencode.gtf but, if not, please ask a new question about that.
      $endgroup$
      – terdon
      Feb 22 at 20:32










    • $begingroup$
      Thanks a lot that returns all except gene name, this is output chr1 29554 31097 "ENSG00000243485.2"; +
      $endgroup$
      – Feresh Teh
      Feb 22 at 21:15






    • 1




      $begingroup$
      @FereshTeh please ask a new question so you can show exactly what output you need.
      $endgroup$
      – terdon
      Feb 22 at 21:17
















    2












    $begingroup$

    If you want all transcripts from that gtf file whose type isn't "protein_coding", you can use almost the same command, just change the == ("is") to != ("isn't"):



    awk '{if($3=="transcript" && $20!=""protein_coding";"){print $0}}' gencode.gtf 


    Or, a simpler version:



    awk '$3=="transcript" && $20!=""protein_coding";"' gencode.gtf 


    Note that this will not include any of the havana transcripts in the file, but I am assuming that's what you want since that's what your original command did.



    Specifically, the command will return the following types of transcript (the numbers on the left are the number of such transcripts in the file):



    awk '$3=="transcript" && $20!=""protein_coding";"{print $20}' gencode.gtf  | sort | uniq -c | sort -nk1
    1 "translated_processed_pseudogene";
    2 "Mt_rRNA";
    3 "IG_J_pseudogene";
    3 "TR_D_gene";
    4 "TR_J_pseudogene";
    5 "TR_C_gene";
    10 "IG_C_pseudogene";
    18 "IG_C_gene";
    18 "IG_J_gene";
    22 "Mt_tRNA";
    25 "3prime_overlapping_ncrna";
    27 "TR_V_pseudogene";
    37 "IG_D_gene";
    58 "non_stop_decay";
    59 "polymorphic_pseudogene";
    74 "TR_J_gene";
    97 "TR_V_gene";
    144 "IG_V_gene";
    182 "unitary_pseudogene";
    196 "IG_V_pseudogene";
    330 "sense_overlapping";
    387 "pseudogene";
    442 "transcribed_processed_pseudogene";
    531 "rRNA";
    802 "sense_intronic";
    860 "transcribed_unprocessed_pseudogene";
    1529 "snoRNA";
    1923 "snRNA";
    2050 "misc_RNA";
    2549 "unprocessed_pseudogene";
    3116 "miRNA";
    9710 "antisense";
    10623 "processed_pseudogene";
    11780 "lincRNA";
    13052 "nonsense_mediated_decay";
    25955 "retained_intron";
    28082 "processed_transcript";


    You might also want to remove that "translated_processed_pseudogene" since that is actually translated into protein and is therefore technically coding:



    awk '$3=="transcript" && 
    $
    20!=""protein_coding";" &&
    $20!=""translated_processed_pseudogene";"' gencode.gtf





    share|improve this answer











    $endgroup$













    • $begingroup$
      Thanks a lot, really thank you for saving me I could not solve that myself. How I can extract the below information from each line of resulting non-coding file chr1 29553 30039 ENSG00000243485.2 + gene_name "MIR1302-11"
      $endgroup$
      – Feresh Teh
      Feb 22 at 20:29












    • $begingroup$
      @FereshTeh you're welcome. I think you want awk '$3=="transcript" && $20!=""protein_coding";" && $20!=""translated_processed_pseudogene";"{print $1,$4,$5,$10,$7}' gencode.gtf but, if not, please ask a new question about that.
      $endgroup$
      – terdon
      Feb 22 at 20:32










    • $begingroup$
      Thanks a lot that returns all except gene name, this is output chr1 29554 31097 "ENSG00000243485.2"; +
      $endgroup$
      – Feresh Teh
      Feb 22 at 21:15






    • 1




      $begingroup$
      @FereshTeh please ask a new question so you can show exactly what output you need.
      $endgroup$
      – terdon
      Feb 22 at 21:17














    2












    2








    2





    $begingroup$

    If you want all transcripts from that gtf file whose type isn't "protein_coding", you can use almost the same command, just change the == ("is") to != ("isn't"):



    awk '{if($3=="transcript" && $20!=""protein_coding";"){print $0}}' gencode.gtf 


    Or, a simpler version:



    awk '$3=="transcript" && $20!=""protein_coding";"' gencode.gtf 


    Note that this will not include any of the havana transcripts in the file, but I am assuming that's what you want since that's what your original command did.



    Specifically, the command will return the following types of transcript (the numbers on the left are the number of such transcripts in the file):



    awk '$3=="transcript" && $20!=""protein_coding";"{print $20}' gencode.gtf  | sort | uniq -c | sort -nk1
    1 "translated_processed_pseudogene";
    2 "Mt_rRNA";
    3 "IG_J_pseudogene";
    3 "TR_D_gene";
    4 "TR_J_pseudogene";
    5 "TR_C_gene";
    10 "IG_C_pseudogene";
    18 "IG_C_gene";
    18 "IG_J_gene";
    22 "Mt_tRNA";
    25 "3prime_overlapping_ncrna";
    27 "TR_V_pseudogene";
    37 "IG_D_gene";
    58 "non_stop_decay";
    59 "polymorphic_pseudogene";
    74 "TR_J_gene";
    97 "TR_V_gene";
    144 "IG_V_gene";
    182 "unitary_pseudogene";
    196 "IG_V_pseudogene";
    330 "sense_overlapping";
    387 "pseudogene";
    442 "transcribed_processed_pseudogene";
    531 "rRNA";
    802 "sense_intronic";
    860 "transcribed_unprocessed_pseudogene";
    1529 "snoRNA";
    1923 "snRNA";
    2050 "misc_RNA";
    2549 "unprocessed_pseudogene";
    3116 "miRNA";
    9710 "antisense";
    10623 "processed_pseudogene";
    11780 "lincRNA";
    13052 "nonsense_mediated_decay";
    25955 "retained_intron";
    28082 "processed_transcript";


    You might also want to remove that "translated_processed_pseudogene" since that is actually translated into protein and is therefore technically coding:



    awk '$3=="transcript" && 
    $
    20!=""protein_coding";" &&
    $20!=""translated_processed_pseudogene";"' gencode.gtf





    share|improve this answer











    $endgroup$



    If you want all transcripts from that gtf file whose type isn't "protein_coding", you can use almost the same command, just change the == ("is") to != ("isn't"):



    awk '{if($3=="transcript" && $20!=""protein_coding";"){print $0}}' gencode.gtf 


    Or, a simpler version:



    awk '$3=="transcript" && $20!=""protein_coding";"' gencode.gtf 


    Note that this will not include any of the havana transcripts in the file, but I am assuming that's what you want since that's what your original command did.



    Specifically, the command will return the following types of transcript (the numbers on the left are the number of such transcripts in the file):



    awk '$3=="transcript" && $20!=""protein_coding";"{print $20}' gencode.gtf  | sort | uniq -c | sort -nk1
    1 "translated_processed_pseudogene";
    2 "Mt_rRNA";
    3 "IG_J_pseudogene";
    3 "TR_D_gene";
    4 "TR_J_pseudogene";
    5 "TR_C_gene";
    10 "IG_C_pseudogene";
    18 "IG_C_gene";
    18 "IG_J_gene";
    22 "Mt_tRNA";
    25 "3prime_overlapping_ncrna";
    27 "TR_V_pseudogene";
    37 "IG_D_gene";
    58 "non_stop_decay";
    59 "polymorphic_pseudogene";
    74 "TR_J_gene";
    97 "TR_V_gene";
    144 "IG_V_gene";
    182 "unitary_pseudogene";
    196 "IG_V_pseudogene";
    330 "sense_overlapping";
    387 "pseudogene";
    442 "transcribed_processed_pseudogene";
    531 "rRNA";
    802 "sense_intronic";
    860 "transcribed_unprocessed_pseudogene";
    1529 "snoRNA";
    1923 "snRNA";
    2050 "misc_RNA";
    2549 "unprocessed_pseudogene";
    3116 "miRNA";
    9710 "antisense";
    10623 "processed_pseudogene";
    11780 "lincRNA";
    13052 "nonsense_mediated_decay";
    25955 "retained_intron";
    28082 "processed_transcript";


    You might also want to remove that "translated_processed_pseudogene" since that is actually translated into protein and is therefore technically coding:



    awk '$3=="transcript" && 
    $
    20!=""protein_coding";" &&
    $20!=""translated_processed_pseudogene";"' gencode.gtf






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Feb 22 at 20:22

























    answered Feb 22 at 18:34









    terdonterdon

    4,6302830




    4,6302830












    • $begingroup$
      Thanks a lot, really thank you for saving me I could not solve that myself. How I can extract the below information from each line of resulting non-coding file chr1 29553 30039 ENSG00000243485.2 + gene_name "MIR1302-11"
      $endgroup$
      – Feresh Teh
      Feb 22 at 20:29












    • $begingroup$
      @FereshTeh you're welcome. I think you want awk '$3=="transcript" && $20!=""protein_coding";" && $20!=""translated_processed_pseudogene";"{print $1,$4,$5,$10,$7}' gencode.gtf but, if not, please ask a new question about that.
      $endgroup$
      – terdon
      Feb 22 at 20:32










    • $begingroup$
      Thanks a lot that returns all except gene name, this is output chr1 29554 31097 "ENSG00000243485.2"; +
      $endgroup$
      – Feresh Teh
      Feb 22 at 21:15






    • 1




      $begingroup$
      @FereshTeh please ask a new question so you can show exactly what output you need.
      $endgroup$
      – terdon
      Feb 22 at 21:17


















    • $begingroup$
      Thanks a lot, really thank you for saving me I could not solve that myself. How I can extract the below information from each line of resulting non-coding file chr1 29553 30039 ENSG00000243485.2 + gene_name "MIR1302-11"
      $endgroup$
      – Feresh Teh
      Feb 22 at 20:29












    • $begingroup$
      @FereshTeh you're welcome. I think you want awk '$3=="transcript" && $20!=""protein_coding";" && $20!=""translated_processed_pseudogene";"{print $1,$4,$5,$10,$7}' gencode.gtf but, if not, please ask a new question about that.
      $endgroup$
      – terdon
      Feb 22 at 20:32










    • $begingroup$
      Thanks a lot that returns all except gene name, this is output chr1 29554 31097 "ENSG00000243485.2"; +
      $endgroup$
      – Feresh Teh
      Feb 22 at 21:15






    • 1




      $begingroup$
      @FereshTeh please ask a new question so you can show exactly what output you need.
      $endgroup$
      – terdon
      Feb 22 at 21:17
















    $begingroup$
    Thanks a lot, really thank you for saving me I could not solve that myself. How I can extract the below information from each line of resulting non-coding file chr1 29553 30039 ENSG00000243485.2 + gene_name "MIR1302-11"
    $endgroup$
    – Feresh Teh
    Feb 22 at 20:29






    $begingroup$
    Thanks a lot, really thank you for saving me I could not solve that myself. How I can extract the below information from each line of resulting non-coding file chr1 29553 30039 ENSG00000243485.2 + gene_name "MIR1302-11"
    $endgroup$
    – Feresh Teh
    Feb 22 at 20:29














    $begingroup$
    @FereshTeh you're welcome. I think you want awk '$3=="transcript" && $20!=""protein_coding";" && $20!=""translated_processed_pseudogene";"{print $1,$4,$5,$10,$7}' gencode.gtf but, if not, please ask a new question about that.
    $endgroup$
    – terdon
    Feb 22 at 20:32




    $begingroup$
    @FereshTeh you're welcome. I think you want awk '$3=="transcript" && $20!=""protein_coding";" && $20!=""translated_processed_pseudogene";"{print $1,$4,$5,$10,$7}' gencode.gtf but, if not, please ask a new question about that.
    $endgroup$
    – terdon
    Feb 22 at 20:32












    $begingroup$
    Thanks a lot that returns all except gene name, this is output chr1 29554 31097 "ENSG00000243485.2"; +
    $endgroup$
    – Feresh Teh
    Feb 22 at 21:15




    $begingroup$
    Thanks a lot that returns all except gene name, this is output chr1 29554 31097 "ENSG00000243485.2"; +
    $endgroup$
    – Feresh Teh
    Feb 22 at 21:15




    1




    1




    $begingroup$
    @FereshTeh please ask a new question so you can show exactly what output you need.
    $endgroup$
    – terdon
    Feb 22 at 21:17




    $begingroup$
    @FereshTeh please ask a new question so you can show exactly what output you need.
    $endgroup$
    – terdon
    Feb 22 at 21:17











    4












    $begingroup$

    Getting the non coding regions of a protein coding transcript, sounds like you are looking for UTR.



    UTR has its own feature in the gtf file. So you can do this:



    $ awk -v FS="t" '$3=="UTR"' gencode.gtf


    If the gtf file is compressed use this instead:



    $ zcat gencode.gtf.gz | awk -v FS="t" '$3=="UTR"'


    BTW: Why are you using such an old release of gencode? The current version is v29.






    share|improve this answer











    $endgroup$













    • $begingroup$
      Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
      $endgroup$
      – Feresh Teh
      Feb 22 at 12:53










    • $begingroup$
      Sorry I tried hat but my output is empty
      $endgroup$
      – Feresh Teh
      Feb 22 at 12:59






    • 1




      $begingroup$
      As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with $ awk -v FS="t" '$3=="CDS"' gencode.gtf. Reading the manual for bedtools complement is your task.
      $endgroup$
      – finswimmer
      Feb 22 at 13:14












    • $begingroup$
      Sorry but your commands return nothing, I mean not working returning empty file
      $endgroup$
      – Feresh Teh
      Feb 22 at 18:24










    • $begingroup$
      The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
      $endgroup$
      – terdon
      Feb 22 at 18:37
















    4












    $begingroup$

    Getting the non coding regions of a protein coding transcript, sounds like you are looking for UTR.



    UTR has its own feature in the gtf file. So you can do this:



    $ awk -v FS="t" '$3=="UTR"' gencode.gtf


    If the gtf file is compressed use this instead:



    $ zcat gencode.gtf.gz | awk -v FS="t" '$3=="UTR"'


    BTW: Why are you using such an old release of gencode? The current version is v29.






    share|improve this answer











    $endgroup$













    • $begingroup$
      Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
      $endgroup$
      – Feresh Teh
      Feb 22 at 12:53










    • $begingroup$
      Sorry I tried hat but my output is empty
      $endgroup$
      – Feresh Teh
      Feb 22 at 12:59






    • 1




      $begingroup$
      As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with $ awk -v FS="t" '$3=="CDS"' gencode.gtf. Reading the manual for bedtools complement is your task.
      $endgroup$
      – finswimmer
      Feb 22 at 13:14












    • $begingroup$
      Sorry but your commands return nothing, I mean not working returning empty file
      $endgroup$
      – Feresh Teh
      Feb 22 at 18:24










    • $begingroup$
      The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
      $endgroup$
      – terdon
      Feb 22 at 18:37














    4












    4








    4





    $begingroup$

    Getting the non coding regions of a protein coding transcript, sounds like you are looking for UTR.



    UTR has its own feature in the gtf file. So you can do this:



    $ awk -v FS="t" '$3=="UTR"' gencode.gtf


    If the gtf file is compressed use this instead:



    $ zcat gencode.gtf.gz | awk -v FS="t" '$3=="UTR"'


    BTW: Why are you using such an old release of gencode? The current version is v29.






    share|improve this answer











    $endgroup$



    Getting the non coding regions of a protein coding transcript, sounds like you are looking for UTR.



    UTR has its own feature in the gtf file. So you can do this:



    $ awk -v FS="t" '$3=="UTR"' gencode.gtf


    If the gtf file is compressed use this instead:



    $ zcat gencode.gtf.gz | awk -v FS="t" '$3=="UTR"'


    BTW: Why are you using such an old release of gencode? The current version is v29.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Feb 22 at 13:15

























    answered Feb 22 at 12:50









    finswimmerfinswimmer

    1,002210




    1,002210












    • $begingroup$
      Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
      $endgroup$
      – Feresh Teh
      Feb 22 at 12:53










    • $begingroup$
      Sorry I tried hat but my output is empty
      $endgroup$
      – Feresh Teh
      Feb 22 at 12:59






    • 1




      $begingroup$
      As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with $ awk -v FS="t" '$3=="CDS"' gencode.gtf. Reading the manual for bedtools complement is your task.
      $endgroup$
      – finswimmer
      Feb 22 at 13:14












    • $begingroup$
      Sorry but your commands return nothing, I mean not working returning empty file
      $endgroup$
      – Feresh Teh
      Feb 22 at 18:24










    • $begingroup$
      The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
      $endgroup$
      – terdon
      Feb 22 at 18:37


















    • $begingroup$
      Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
      $endgroup$
      – Feresh Teh
      Feb 22 at 12:53










    • $begingroup$
      Sorry I tried hat but my output is empty
      $endgroup$
      – Feresh Teh
      Feb 22 at 12:59






    • 1




      $begingroup$
      As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with $ awk -v FS="t" '$3=="CDS"' gencode.gtf. Reading the manual for bedtools complement is your task.
      $endgroup$
      – finswimmer
      Feb 22 at 13:14












    • $begingroup$
      Sorry but your commands return nothing, I mean not working returning empty file
      $endgroup$
      – Feresh Teh
      Feb 22 at 18:24










    • $begingroup$
      The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
      $endgroup$
      – terdon
      Feb 22 at 18:37
















    $begingroup$
    Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
    $endgroup$
    – Feresh Teh
    Feb 22 at 12:53




    $begingroup$
    Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
    $endgroup$
    – Feresh Teh
    Feb 22 at 12:53












    $begingroup$
    Sorry I tried hat but my output is empty
    $endgroup$
    – Feresh Teh
    Feb 22 at 12:59




    $begingroup$
    Sorry I tried hat but my output is empty
    $endgroup$
    – Feresh Teh
    Feb 22 at 12:59




    1




    1




    $begingroup$
    As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with $ awk -v FS="t" '$3=="CDS"' gencode.gtf. Reading the manual for bedtools complement is your task.
    $endgroup$
    – finswimmer
    Feb 22 at 13:14






    $begingroup$
    As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with $ awk -v FS="t" '$3=="CDS"' gencode.gtf. Reading the manual for bedtools complement is your task.
    $endgroup$
    – finswimmer
    Feb 22 at 13:14














    $begingroup$
    Sorry but your commands return nothing, I mean not working returning empty file
    $endgroup$
    – Feresh Teh
    Feb 22 at 18:24




    $begingroup$
    Sorry but your commands return nothing, I mean not working returning empty file
    $endgroup$
    – Feresh Teh
    Feb 22 at 18:24












    $begingroup$
    The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
    $endgroup$
    – terdon
    Feb 22 at 18:37




    $begingroup$
    The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
    $endgroup$
    – terdon
    Feb 22 at 18:37











    2












    $begingroup$

    This isn't a problem that's easily solved with awk. It's not like you're extracting a feature that's annotated in the GTF file. Instead, you want the empty space between annotated features.



    A few years ago I wrote a program called LocusPocus for a similar task. It uses a gene annotation to break down a genome into gene loci and intergenic regions. It handles overlapping annotations and other weirdness pretty robustly. The output will include both coding regions and non-coding regions, but you can identify the intergenic spaces as those with iLocus_type equal to iiLocus or fiLocus.



    Note: the --delta parameter will extend each gene/transcript by 500bp by default.



    Caveat: the program only accepts GFF3 input by default. Hopefully it won't be too hard to convert your GTF to GFF3.



    Another caveat: eventual interpretation of these data will depend on what features are annotated in the genome and which annotations you include vs ignore. Do you want your non-coding regions to include non-coding genes, or should these be treated separately? Some non-coding regions will be full of transposable elements and other repetitive DNA, while others will have enhancers, promoters, or other regulatory elements. It's important to tread carefully before you jump to any conclusions.






    share|improve this answer











    $endgroup$









    • 1




      $begingroup$
      Absolutely brilliant name! :)
      $endgroup$
      – terdon
      Feb 22 at 19:32










    • $begingroup$
      Sorry, thank you for explanation; After googling I see I actually need enhancers, promoters, or other regulatory elements of human genome to find cancer driver genes placed in these regions.
      $endgroup$
      – Feresh Teh
      Feb 24 at 21:24
















    2












    $begingroup$

    This isn't a problem that's easily solved with awk. It's not like you're extracting a feature that's annotated in the GTF file. Instead, you want the empty space between annotated features.



    A few years ago I wrote a program called LocusPocus for a similar task. It uses a gene annotation to break down a genome into gene loci and intergenic regions. It handles overlapping annotations and other weirdness pretty robustly. The output will include both coding regions and non-coding regions, but you can identify the intergenic spaces as those with iLocus_type equal to iiLocus or fiLocus.



    Note: the --delta parameter will extend each gene/transcript by 500bp by default.



    Caveat: the program only accepts GFF3 input by default. Hopefully it won't be too hard to convert your GTF to GFF3.



    Another caveat: eventual interpretation of these data will depend on what features are annotated in the genome and which annotations you include vs ignore. Do you want your non-coding regions to include non-coding genes, or should these be treated separately? Some non-coding regions will be full of transposable elements and other repetitive DNA, while others will have enhancers, promoters, or other regulatory elements. It's important to tread carefully before you jump to any conclusions.






    share|improve this answer











    $endgroup$









    • 1




      $begingroup$
      Absolutely brilliant name! :)
      $endgroup$
      – terdon
      Feb 22 at 19:32










    • $begingroup$
      Sorry, thank you for explanation; After googling I see I actually need enhancers, promoters, or other regulatory elements of human genome to find cancer driver genes placed in these regions.
      $endgroup$
      – Feresh Teh
      Feb 24 at 21:24














    2












    2








    2





    $begingroup$

    This isn't a problem that's easily solved with awk. It's not like you're extracting a feature that's annotated in the GTF file. Instead, you want the empty space between annotated features.



    A few years ago I wrote a program called LocusPocus for a similar task. It uses a gene annotation to break down a genome into gene loci and intergenic regions. It handles overlapping annotations and other weirdness pretty robustly. The output will include both coding regions and non-coding regions, but you can identify the intergenic spaces as those with iLocus_type equal to iiLocus or fiLocus.



    Note: the --delta parameter will extend each gene/transcript by 500bp by default.



    Caveat: the program only accepts GFF3 input by default. Hopefully it won't be too hard to convert your GTF to GFF3.



    Another caveat: eventual interpretation of these data will depend on what features are annotated in the genome and which annotations you include vs ignore. Do you want your non-coding regions to include non-coding genes, or should these be treated separately? Some non-coding regions will be full of transposable elements and other repetitive DNA, while others will have enhancers, promoters, or other regulatory elements. It's important to tread carefully before you jump to any conclusions.






    share|improve this answer











    $endgroup$



    This isn't a problem that's easily solved with awk. It's not like you're extracting a feature that's annotated in the GTF file. Instead, you want the empty space between annotated features.



    A few years ago I wrote a program called LocusPocus for a similar task. It uses a gene annotation to break down a genome into gene loci and intergenic regions. It handles overlapping annotations and other weirdness pretty robustly. The output will include both coding regions and non-coding regions, but you can identify the intergenic spaces as those with iLocus_type equal to iiLocus or fiLocus.



    Note: the --delta parameter will extend each gene/transcript by 500bp by default.



    Caveat: the program only accepts GFF3 input by default. Hopefully it won't be too hard to convert your GTF to GFF3.



    Another caveat: eventual interpretation of these data will depend on what features are annotated in the genome and which annotations you include vs ignore. Do you want your non-coding regions to include non-coding genes, or should these be treated separately? Some non-coding regions will be full of transposable elements and other repetitive DNA, while others will have enhancers, promoters, or other regulatory elements. It's important to tread carefully before you jump to any conclusions.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Feb 22 at 19:14

























    answered Feb 22 at 19:03









    Daniel StandageDaniel Standage

    2,575431




    2,575431








    • 1




      $begingroup$
      Absolutely brilliant name! :)
      $endgroup$
      – terdon
      Feb 22 at 19:32










    • $begingroup$
      Sorry, thank you for explanation; After googling I see I actually need enhancers, promoters, or other regulatory elements of human genome to find cancer driver genes placed in these regions.
      $endgroup$
      – Feresh Teh
      Feb 24 at 21:24














    • 1




      $begingroup$
      Absolutely brilliant name! :)
      $endgroup$
      – terdon
      Feb 22 at 19:32










    • $begingroup$
      Sorry, thank you for explanation; After googling I see I actually need enhancers, promoters, or other regulatory elements of human genome to find cancer driver genes placed in these regions.
      $endgroup$
      – Feresh Teh
      Feb 24 at 21:24








    1




    1




    $begingroup$
    Absolutely brilliant name! :)
    $endgroup$
    – terdon
    Feb 22 at 19:32




    $begingroup$
    Absolutely brilliant name! :)
    $endgroup$
    – terdon
    Feb 22 at 19:32












    $begingroup$
    Sorry, thank you for explanation; After googling I see I actually need enhancers, promoters, or other regulatory elements of human genome to find cancer driver genes placed in these regions.
    $endgroup$
    – Feresh Teh
    Feb 24 at 21:24




    $begingroup$
    Sorry, thank you for explanation; After googling I see I actually need enhancers, promoters, or other regulatory elements of human genome to find cancer driver genes placed in these regions.
    $endgroup$
    – Feresh Teh
    Feb 24 at 21:24











    0












    $begingroup$

    Likely non-coding regions of genome are here



    https://elifesciences.org/download/aHR0cHM6Ly9jZG4uZWxpZmVzY2llbmNlcy5vcmcvYXJ0aWNsZXMvMjE3NzgvZWxpZmUtMjE3Nzgtc3VwcDQtdjMueGxzeA==/elife-21778-supp4-v3.xlsx?_hash=KQi5jfO3kT2c4Qw44j4Rg6YAyCBQilYuWHVYXcRDuuo%3D






    share|improve this answer









    $endgroup$


















      0












      $begingroup$

      Likely non-coding regions of genome are here



      https://elifesciences.org/download/aHR0cHM6Ly9jZG4uZWxpZmVzY2llbmNlcy5vcmcvYXJ0aWNsZXMvMjE3NzgvZWxpZmUtMjE3Nzgtc3VwcDQtdjMueGxzeA==/elife-21778-supp4-v3.xlsx?_hash=KQi5jfO3kT2c4Qw44j4Rg6YAyCBQilYuWHVYXcRDuuo%3D






      share|improve this answer









      $endgroup$
















        0












        0








        0





        $begingroup$

        Likely non-coding regions of genome are here



        https://elifesciences.org/download/aHR0cHM6Ly9jZG4uZWxpZmVzY2llbmNlcy5vcmcvYXJ0aWNsZXMvMjE3NzgvZWxpZmUtMjE3Nzgtc3VwcDQtdjMueGxzeA==/elife-21778-supp4-v3.xlsx?_hash=KQi5jfO3kT2c4Qw44j4Rg6YAyCBQilYuWHVYXcRDuuo%3D






        share|improve this answer









        $endgroup$



        Likely non-coding regions of genome are here



        https://elifesciences.org/download/aHR0cHM6Ly9jZG4uZWxpZmVzY2llbmNlcy5vcmcvYXJ0aWNsZXMvMjE3NzgvZWxpZmUtMjE3Nzgtc3VwcDQtdjMueGxzeA==/elife-21778-supp4-v3.xlsx?_hash=KQi5jfO3kT2c4Qw44j4Rg6YAyCBQilYuWHVYXcRDuuo%3D







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Feb 24 at 22:47









        Feresh TehFeresh Teh

        45112




        45112






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Bioinformatics Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7098%2fidentify-non-coding-regions-from-a-genome-annotation%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Probability when a professor distributes a quiz and homework assignment to a class of n students.

            Aardman Animations

            Are they similar matrix