Skip to content

Instantly share code, notes, and snippets.

@jaclynsaunders
Created April 22, 2021 04:00
Show Gist options
  • Select an option

  • Save jaclynsaunders/5064316e98fb9e5cd7bcac8738431869 to your computer and use it in GitHub Desktop.

Select an option

Save jaclynsaunders/5064316e98fb9e5cd7bcac8738431869 to your computer and use it in GitHub Desktop.
import pandas as pd
import re
INPUT_FILE = "CAT-taxa-out.txt"
OUTPUT_FILE = "formatted_CAT-taxa-out.txt"
ORF_list = []
taxid_list = []
no_hits = []
with open(INPUT_FILE, "r") as f:
next(f)
for line in f:
data = line.split("\t")
if ("ORF has no hit to database") in line:
no_hits.append(data[0])
else:
ORF_list.append(data[0])
taxaStr = data[2]
z = re.match("(.*);{1,1}(.+)", taxaStr)
try:
lastTax = z.group(2)
except:
lastTax = taxaStr
taxid_list.append(str(lastTax).replace("*", ""))
ORF_list = ORF_list + no_hits
taxid_list = taxid_list + (len(no_hits) * ['-1']) #Give ORFs without hits taxid -1
df = pd.DataFrame(list(zip(ORF_list, taxid_list)), columns = ['Taxon_name', 'NCBI_taxon_id'])
fake_name = [""]*len(df)
df.insert(loc=0, column='Taxon_fasta_file_name', value=fake_name)
df['NCBI_taxon_id'] = df['NCBI_taxon_id'].astype(str)
df.to_csv(OUTPUT_FILE, index=False)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment