• Uncategorized

About linux : how-to-rewrite-the-last-column-of-a-file-using-sed-or-python

Question Detail

I have this gff file that doesn’t seem to conform to the format I expect. The problem is with the last column (although it does seem some row have more tabs if I split by tab). here is what I have:

scaffold10x_1000_pilon  AUGUSTUS        gene    12711   22079   0.67    -       .       g1
scaffold10x_1000_pilon  AUGUSTUS        transcript      12711   22079   0.47    -       .       g1.t1
scaffold10x_1000_pilon  AUGUSTUS        stop_codon      12711   12713   .       -       0       transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        intron  13044   13486   0.89    -       .       transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        intron  13936   21904   0.5     -       .       transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        CDS     12711   13043   0.99    -       0       transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        CDS     13487   13935   0.64    -       2       transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        CDS     21905   22079   0.67    -       0       transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        start_codon     22077   22079   .       -       0       transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        transcript      12711   14150   0.2     -       .       g1.t2
scaffold10x_1000_pilon  AUGUSTUS        stop_codon      12711   12713   .       -       0       transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        intron  13044   13486   0.91    -       .       transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        intron  13936   14128   0.2     -       .       transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        CDS     12711   13043   0.96    -       0       transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        CDS     13487   13935   0.45    -       2       transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        CDS     14129   14150   0.21    -       0       transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        start_codon     14148   14150   .       -       0       transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        gene    41722   42102   0.32    +       .       g2
scaffold10x_1000_pilon  AUGUSTUS        transcript      41722   42102   0.32    +       .       g2.t1
scaffold10x_1000_pilon  AUGUSTUS        start_codon     41722   41724   .       +       0       transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1000_pilon  AUGUSTUS        CDS     41722   42102   0.32    +       0       transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1000_pilon  AUGUSTUS        stop_codon      42100   42102   .       +       0       transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1000_pilon  AUGUSTUS        gene    106074  106640  1       +       .       g3
scaffold10x_1000_pilon  AUGUSTUS        transcript      106074  106640  1       +       .       g3.t1
scaffold10x_1000_pilon  AUGUSTUS        start_codon     106074  106076  .       +       0       transcript_id "g3.t1"; gene_id "g3";
scaffold10x_1000_pilon  AUGUSTUS        CDS     106074  106640  1       +       0       transcript_id "g3.t1"; gene_id "g3";
scaffold10x_1000_pilon  AUGUSTUS        stop_codon      106638  106640  .       +       0       transcript_id "g3.t1"; gene_id "g3";

and here is what I would like to have:

scaffold10x_1000_pilon  AUGUSTUS        gene    12711   22079   0.67    -       .       ID=g1
scaffold10x_1000_pilon  AUGUSTUS        transcript      12711   22079   0.47    -       .       ID=g1.t1;Parent=g1
scaffold10x_1000_pilon  AUGUSTUS        stop_codon      12711   12713   .       -       0       Parent=g1.t1
scaffold10x_1000_pilon  AUGUSTUS        intron  13044   13486   0.89    -       .       Parent=g1.t1
scaffold10x_1000_pilon  AUGUSTUS        intron  13936   21904   0.5     -       .       Parent=g1.t1
scaffold10x_1000_pilon  AUGUSTUS        CDS     12711   13043   0.99    -       0       ID=g1.t1.cds;Parent=g1.t1
scaffold10x_1000_pilon  AUGUSTUS        CDS     13487   13935   0.64    -       2       ID=g1.t1.cds;Parent=g1.t1
scaffold10x_1000_pilon  AUGUSTUS        CDS     21905   22079   0.67    -       0       ID=g1.t1.cds;Parent=g1.t1
scaffold10x_1000_pilon  AUGUSTUS        start_codon     22077   22079   .       -       0       Parent=g1.t1
scaffold10x_1000_pilon  AUGUSTUS        transcript      12711   14150   0.2     -       .       ID=g1.t2;Parent=g1
scaffold10x_1000_pilon  AUGUSTUS        stop_codon      12711   12713   .       -       0       Parent=g1.t2
scaffold10x_1000_pilon  AUGUSTUS        intron  13044   13486   0.91    -       .       Parent=g1.t2
scaffold10x_1000_pilon  AUGUSTUS        intron  13936   14128   0.2     -       .       Parent=g1.t2
scaffold10x_1000_pilon  AUGUSTUS        CDS     12711   13043   0.96    -       0       ID=g1.t2.cds;Parent=g1.t2
scaffold10x_1000_pilon  AUGUSTUS        CDS     13487   13935   0.45    -       2       ID=g1.t2.cds;Parent=g1.t2
scaffold10x_1000_pilon  AUGUSTUS        CDS     14129   14150   0.21    -       0       ID=g1.t2.cds;Parent=g1.t2
scaffold10x_1000_pilon  AUGUSTUS        start_codon     14148   14150   .       -       0       Parent=g1.t2
scaffold10x_1000_pilon  AUGUSTUS        gene    41722   42102   0.32    +       .       ID=g2
scaffold10x_1000_pilon  AUGUSTUS        transcript      41722   42102   0.32    +       .       ID=g2.t1;Parent=g2
scaffold10x_1000_pilon  AUGUSTUS        start_codon     41722   41724   .       +       0       Parent=g2.t1
scaffold10x_1000_pilon  AUGUSTUS        CDS     41722   42102   0.32    +       0       ID=g2.t1.cds;Parent=g6.t1
scaffold10x_1000_pilon  AUGUSTUS        stop_codon      42100   42102   .       +       0       Parent=g2.t1
scaffold10x_1000_pilon  AUGUSTUS        gene    106074  106640  1       +       .       ID=g3
scaffold10x_1000_pilon  AUGUSTUS        transcript      106074  106640  1       +       .       ID=g3.t1;Parent=g3
scaffold10x_1000_pilon  AUGUSTUS        start_codon     106074  106076  .       +       0       Parent=g3.t1
scaffold10x_1000_pilon  AUGUSTUS        CDS     106074  106640  1       +       0       ID=g3.t1.cds;Parent=g3.t1
scaffold10x_1000_pilon  AUGUSTUS        stop_codon      106638  106640  .       +       0       Parent=g3.t1

I have tried to use grep and sed commands in linux, could seem to get it done.then resolved to python.

speaking of python, I’m trying to read the file in tabs, then kind of use index base on columns, work with column 3 and 9, which I could index as data[2] and data[8].

Here is what I wrote, I know it may not work like that tough, just my idea, I’m a bit new to python too.

data = open("my my_bad_gff", 'r')
new_file = ''
for line in data:

    columns = line.rstrip("\n").split("\t")

    scaffold = columns[0]
    source = columns[1]
    feature = columns[2]
    start = columns[3]
    end = columns[4]
    score = columns[5]
    strand = columns[6]
    frame = columns[7]
    attribute = columns[8]

    if feature == 'gene':  #Im trying to take the row called gene, and assign its content as x, which is g1 in this case
        a = str(columns[8])
        b = 'ID='+ a #which I think should give me ID=g1
    if feature == 'transcript':
        columns[8] = a + '.t1' + ';Parent=' + a  # hopint it gives me ID=g1.t1;Parent=g1, but how can i make sure '.t1' is not fixes, since transcript number can ncrease for each gene

    if feature == 'intron' and 'start_codon' and 'stop_codon':
        columns [8] = 'Parent=' + a + '.t1'# should give me Parent=g1.t1
        d = columns [8]
    if feature == 'CDS':
        columns[8] = a + '.t1' + 'cds;' + d #hoping this gives me ID=g1.t1.cds;Parent=g1.t1
        new_file.append(data)

Is there a command line that can just do this for me, or I have to just use python??
thanks

Question Answer

perhaps this can get you started with awk

$ awk -F'\t' '{match($9,"(g[0-9])",m)} 
    $3=="gene"{$9="ID="m[1]}

    # add other conditions 
    # using the same template
    # ...

    1' file

note that here you actually meant OR not AND feature == 'intron' and 'start_codon' and 'stop_codon'

match($9,"(g[0-9])",m) is to extract the g1,g2 etc values from the last column into m[1], the rest should be easy to read.

You may also like...

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.