SeqPHASE is a web software that converts FASTA sequence alignments into the input file format requested for haplotyping using PHASE (http://stephenslab.uchicago.edu/software.html) and transforms PHASE output files back into FASTA. SeqPHASE is particularly geared towards molecular ecologists who use PHASE to find out the haplotypes of nuclear sequence markers from direct sequencing, but may be useful to others as well.
Symbols accepted in the alignments are A (adenine), T (thymine), C (cytosine), G (guanine), W (adenine or cytosine), Y (cytosine or thymine), K (guanine or thymine), M (adenine or cytosine), S (cytosine or guanine), R (adenine or guanine), N and ? (missing information), and - (indel). After the user hits the Submit button on the web form, SeqPHASE starts by verifying that all sequences have the same length, that they contain only authorized symbols and that sequences all have different names. It then removes constant positions, inventories variable positions for which more than two possible nucleotides are found and creates up to three files: one .inp necessary to run PHASE, one .known detailing which phases are known (if any) and one .const recording the constant positions that were removed from the alignment (if any). Since PHASE does not accept letters for multistate characters, nucleotides are written into the .inp file as numbers based on alphabetical order as a mnemonic: -1 or ? for missing information (depending on whether the position displays two or more than two different nucleotides), 0 for indel, 1 for A, 2 for C, 3 for G and 4 for T. Depending on the properties of the dataset submitted, a suitable command syntax for running PHASE is suggested on the output webpage; for more information and other possible options, please refer to the PHASE 2.1 documentation (http://stephenslab.uchicago.edu/instruct2.1.pdf)
Instead of generating PHASE input files from a single FASTA alignment, which would require users to specify manually what phases are already known (for instance from cloning) and what individuals need to be phased, SeqPHASE takes as input two separate FASTA files: one for homozygous individuals and heterozygotes to be phased (with one sequence per individual), and a second one for heterozygotes whose phases are already known (with two sequences per individual). In the alignment of phased heterozygotes, the names of the two sequences of each individual should differ only by their last character (e.g., 'indiv3a' and 'indiv3b'). Heterozygous individuals whose two haplotypes differ only by one substitution or insertion/deletion can be indifferently entered in the first or the second alignment since haplotyping is trivial in such case.
In the PHASE output files, nucleotides are represented by numbers and constant positions are omitted: thus, using these files to find out the actual haplotype sequences can be a tedious and error-prone operation. This is the reason why a second script was written, taking as input the .const file generated during Step 1 and the .out or .out_pairs PHASE output file, and returning a FASTA alignment of haplotype sequences (if the .const file box is left empty, a FASTA alignment containing only the variable positions is generated). If a .out file is inputted, a list of phased haplotypes is returned as FASTA with 1-letter IUPAC indetermination code letters (R, W, M, Y, S or K, see above) at positions where phase certainty is inferior to a certain threshold (90% using PHASE default running options; this probability threshold can be modified by running PHASE using the -p and -q options, see PHASE documentation). If a .out_pairs file is inputted, a list of all possible haplotype pairs for each individual is returned as FASTA with their respective probability indicated between parentheses. Since FASTA alignments normally cannot accommodate two sequences bearing exactly the same name, the two haplotypes of each newly phased individual receive this individual's name with "a" or "b" appended.
Any sequence label name is acceptable (as long as it conforms to the FASTA general format), except in the alignment of phased haplotypes: for SeqPHASE to find out which sequences describe the two haplotypes of a given individual, the label names for these two sequences should differ only by their the last character (ex: individual23a, individual23b).
SeqPHASE was created with in mind the phasing of sequences obtained from direct sequencing of nuclear markers. As a result, it assumes that all nucleotides in the input alignment are actually contiguous and considers the locus position for each variable site to be its actual position in the alignment. However, it is easy to specify different loci positions by manually editing the PHASE input file produced by SeqPHASE (the positions of the variable sites are listed in the third line of the input file following the letter P).
Length-variant heterozygotes (LVHs) are individuals whose haplotypes are of different lengths due to the presence of one or several indels. As a result, chromatograms look fine until the first indel (with sometimes a few double peaks if there are SNPs), then display numerous double peaks due to the superposition of non-homologous bases from the two sequences (see Flot et al. (2006) for an explanatory figure). It is possible to identify SNPs and the position of the indel, then to use SeqPHASE and PHASE to phase these polymorphisms; however, there is no IUPAC codes available to represent "A or indel", "C or indel", "G or indel" and "T or indel", which makes it difficult to code the genotypes of length-variant heterozygotes in SeqPHASE. A possible workaround is to input the data for length-variant heterozyogotes at the end of the second alignment (the one normally intended for individuals whose phase is already known), then to manually edit the .known file and replace 0 (meaning "known phase") in the corresponding lines by * (meaning "unknown phase"). A better way that does not require any tampering with the data files is to phase all length-variant heterozygotes in the dataset prior to running PHASE, by using programs such as Champuru, TraceHaplotyper or Indelligent that analyze the patterns of double peaks in the chromatograms obtained from direct sequencing.
Flot (2010) SeqPHASE: a web tool for interconverting PHASE input/output files and FASTA sequence alignments. Molecular Ecology Ressources 10 (1): 162-166