FASTA is a text-based format used in bioinformatics for representing sequences, especially those of nucleotides and peptides, with base pairs represented by a single letter. A FASTA sequence consists of a single-line description, distinguished by a “greater than” symbol on the first line, followed by a multi-line nucleotide or peptide sequence. You can extract multiple sequences from a FASTA file using special modules, or add-ons, to the Perl programming language, known as BioPerl, that have been specially developed to handle the FASTA format. You can also manually code a Perl script to match patterns in a file or use other available tools to extract FASTA sequences.
Instructions
Things You'll Need
- FASTA file
- Perl editor
- BioPerl
- ActiveState Perl
- Biopieces
- 1Launch your Perl editor application. You may use a simple text editor, such as Notepad. You will need to save the file with a “.pl” extension to indicate that it is a Perl program.
- 2Extract a sequence from a multiple-FASTA file by performing pattern-matching in Perl, by typing the following code into the editor:#!/usr/bin/perl
my $ fasta_seq = shift;
my $sequence = shift;
my $workfile = `cat $ fasta_seq `;
my ($fasta_seq) = $workfile =~ /(>$sequence[^>]+)/s;
print $fasta_seq; - 3Extract the sequences from the FASTA file using BioPerl. You can extract multiple sequences by typing the following code into the editor:#!/bin/perl -wuse Bio::SeqIO;$sequenceobject = Bio::SeqIO->new(-file => "fasta_file_path", -format => "fasta" );The Bio::SeqIO module provides seamless sequence processing. You can retrieve a single sequence using the following statement:$retrievedsequence = $ sequenceobject ->next_seq;You can loop through the object and retrieve multiple sequences, as follows:while ($retrievedsequence = $ sequenceobject ->next_seq)
{
print $ retrievedsequence ->seq,"\n";
} - 4Extract the sequences from the FASTA file using the “Biopieces” application, which is framework containing a set of modular tools for manipulating bioinformatics data. You run your Biopieces command at the command line.read_fasta -i fasta_file | grab -p sequence | write_fasta -o sequence_file –xThis is a good option if you are not very technically inclined, as the framework encapsulates much of the programming effort required to process the FASTA file and output the matched sequences.
- 1