NAME
run_PRIMUS.pl - Run an IBD file through PRIMUS
SYNOPSIS
run_PRIMUS.pl [options] -p file | -i FILE=file [options] | -h
DESCRIPTION
run_PRIMUS.pl will read genome-wide IBD estimates and will identify a maximum unrelated set and/or all possible pedigrees for each family within the dataset. You may weight the selection of your unrelated set by loading in files containing values on which to weight. For pedigree reconstruction, you may also load information on sex, age, and affection status for each sample. PRIMUS is also able to generate IBD estimates for SNP data in PLINK input file formats (ped/map or bed/bim/fam) as long as the system also has the following software installed: R, PLINK, and EIGENSTRAT's smartpca executable.
OPTIONS
For usage and documentation:
-h, --help Brief help message
--man Full documentation
Required for prePRIMUS, IMUS, and PR (one of the following):
-p, --plink_ibd Specify path to a .genome IBD estimates file produced by PLINK
-i, --input Specify path to an IBD estimates file and additional column information
(or both):
--file Path to PLINK formatted data without the file extensions; behaves the same as in PLINK (requires --genome)
--genome Read in --file and calculate IBD estimates using PLINK
General options:
-t, --rel_threshold Set the minimum level of relatedness for two people to be considered related (default=0.1)
--degree_rel_cutoff Set the maximum degree of relatedness for two people to be considered related (default=3; i.e. 3rd degree relatives)
-o, --output_dir Specify path to the output directory for all results(default=[PATH_TO_IBD_FILE]_PRIMUS/)
-v, --verbose Set verbosity level (0=none; 1=default; 2=more; 3=max)
prePRIMUS IBD estimation using PLINK (not included in lite version)
--file Path to PLINK formatted data without the file extensions; behaves the same as in PLINK (requires --genome)
--genome Read in the specified PLINK fornated data file and calculate IBD estimates using PLINK
--plink_ex Path to the plink executable file (searches environment variables by default)
--smartpca_ex Path to the EIGENSTRAT's smartpca executable file (searches environment variables by default)
--ref_pops Comma separated list of HapMap3 populations used for reference allele freqs (overrides default method)
--no_automatic Turn off automatic selection of the HapMap3 populations for reference allele freqs (On by default)
--remove_AIMs Automatically remove ancestry informative markers (off by default)
--keep_AIMs Do not remove ancestry informative markers(off by default)
--internal_ref Use the dataset provided in --file to get reference allele frequencies
--alt_ref_stem Path to PLINK formatted data (no file extensions) used for allele frequencies
--no_PCA_plot Will not run smartpca or generate the PCA plot
--keep_inter_files Keep intermediate files used to create the IBD estimates with prePRIMUS
Identification of maximum unrelated set options:
--no_IMUS Don't identify a maximum unrelated set (runs IMUS by default)
--missing_val Set value that denotes missing data in IBD file
-s, --size Specify to weight on set size (Default unless a binary trait is specified first)
--high_btrait File with FID, IID, and binary trait to weight for the higher value
--low_btrait File with FID, IID, and binary trait to weight for the lower value
--high_qtrait File with FID, IID, and quantitative trait to weight for higher values
--low_qtrait File with FID, IID, and quantitative trait to weight for lower values
--mean_qtrait File with FID, IID, and quantitative trait to weight towards the mean value
--tails_qtrait File with FID, IID, and quantitative trait to weight against the middle values
Pedigree reconstruction options:
--no_PR Don't reconstruct pedigrees (runs pedigree reconstruction by default)
--max_gens Max number of generations sampled in reconstructed pedigree (default = no limit)
--max_gen_gap Max number of generations between two people that have a child (default = 0)
--age_file Specify path to the file containing the age of each sample
--ages Like --age_file but requires FILE=[file], optional specification of file columns
--sex_file Specify path to the file containing the sex of each sample
--sexes Like --sex_file but requires FILE=[file], optional specification of file columns
--mito_matches Path to mito matching status for each pair of samples (requires FILE=[file])
--y_matches Path to y matching status for each pair of samples (requires FILE=[file])
--MT_error_rate Proportion of the MT sequence that must not match to be called a non-match
--Y_error_rate Proportion of the Y sequence that must not match to be called a non-match
--affection_file Specify path to the file containing the affection status of each sample
--affections Like --affection_file; need FILE=[file], optional specification of file columns
--int_likelihood_cutoff Initial minimum likelihood for a relationship to reconstruction (default = 0.1)
PRIMUS+ERSA options:
--ersa_model_output Path to the model_output_file generated by ERSA
--ersa_results Path to the standard ERSA results file
--project_summary Path to the PRIMUS generated project level summary file (usualy in *_PRIMUS/ dir)
--degree_rel_cutoff Specify the minimum degree of relatedness used to generated pedigrees (default = 3)
DOCUMENTATION
For usage and documentation:
- -h, --help
-
Print a brief help message and exits.
- --man
-
Prints the manual page and exits.
Required (one of the following):
- -p file, --plink_ibd file
-
Specify path to a .genome IBD estimates file that was generated PLINK using the --genome command. The .genome files are white space seperate files with a header. The columns are in the following order: FID1(1) IID1(2) FID2(3) IID2(4) RT(5) EZ(6) IDB0(7) IBD1(8) IBD2(9) PI_HAT(10) and these are the default column settings for this option.
- -i FILE=FILE [options], --input FILE=FILE [options]
-
Specify path to a a file containing white-space separated file in which each row shows a pairwise comparison of two samples (e.g. genome-wide IBD estimates). This file must contain a family ID (FID) and an individual ID (IID) for each sample and the combination of the FID and IID must be unique. The order of columns doesn't matter but if they differ from the default settings (see option --prePRIMUS for defaults and expected column names) then the user must specify the column number for each necessary column. For identification of a maximum unrelated set, the input file must have FID1, IID1, FID2, IID2, and PI_HAT (or another numerical way of representing a relationship). For pedigree reconstruction, the input file must have columns 1-4 and 7-10 described in option --plink_ibd, except that the PI_HAT/RELATEDNESS column can be a different measure of relatedness. The PI_HAT column is used to identify the family networks. The input file MUST have a header line, but the column headings do not need to match those described in this document. However, you must specify each column number that differs from the defaults, and this is what the <options> section allows you to do. For example, your input file is named input.txt and has the following columns: FID1(1) IID1(2) FID2(3) IID2(4) IDB0(5) IBD1(6) IBD2(7) PI_HAT(8), you will need to use the following command:
run_PRIMUS.pl -i FILE=path/input.txt IBD0=5 IBD1=6 IBD2=7 PI_HAT=8
Notice that I didn't have to specify the columns for the FIDs or the IIDs because they still match the defaults. I did have to specify the columns numbers for the other 4 columns.
Note: All column numbers are done using one based numbering, i.e. the first column in a file is column 1, ect.
(or both of the following):
- --file file
-
Path to a PLINK formatted ped/map or bed/bim/fam files, but without the file extensions. This option behaves just like PLINK's --file and --bfile options, but here it must be used with the --genome option. For example, if your ped and map files are /usr/data/foo.ped and /usr/data/foo.map, then you could use the following command:
run_PRIMUS.pl --file /usr/data/foo --genome
Check the PLINK documentation (http://pngu.mgh.harvard.edu/~purcell/plink/) for details on the formatting for this file.
- --genome
-
This option tells PRIMUS to read in the specified ped/map or bed/bim/fam files and calculate IBD estimates using the prePRIMUS IBD pipeline.
General Options:
- -t NUM, --rel_threshold NUM
-
Set the minimum level of relatedness for two people to be considered related (Default = 0.1). PRIMUS considers any pair of individuals to be related if the value in the PI_HAT/RELATEDNESS column of the IBD estimates file is above this threshold. For example, if your measure of relatedness is PI_HAT in PLINK's .genome file and you want to identify all relationships up to third degree relatives, use:
--rel_threshold 0.1
Recommended PI_HAT thresholds for first and second degree relatives are 0.375 and 0.1875, respectively. You will need to determine the threshold cutoffs for different measures of relatedness (e.g. kinship coefficients are half of the PI_HAT values).
- -d NUM, --degree_rel_cutoff NUM
-
Set the maximum degree of relatedness for two people to be considered as related (Default = 3; i.e., 3rd degree relatives). PRIMUS considers any pair of individuals to be related their are classified as 3rd degree relatives or closer by the value in the PI_HAT column of the IBD estimates file. This only works properly if PI_HAT genome-wide IBD estimates are provided. Default is 3rd degree, which results in a --rel_threshold will be set to 0.09375. If 2nd degree is pecified, then the --rel_threshold will be set to 0.1875. If 1st degree is specified, then the --rel_threshold will be set to 0.375. For example, if your measure of relatedness is PI_HAT in PLINK's .genome file and you want to identify all relationships up to 2nd degree relatives, use:
--degree_rel_cutoff 2
Recommended PI_HAT thresholds for first and second degree relatives are 0.375 and 0.1875, respectively. PRIMUS will not allow for any relationships other than 1, 2, and 3. If you specify --degree_rel_cutoff and --rel_threshold, then PRIMUS will use the the threshold cutoffs described above based on the --degree_rel_cutoff specified and will ignore the input provided with --rel_threshold.
- -o DIR, --output_dir DIR
-
Specify path to the output directory for the dataset results. Default is [input_file]_PRIMUS. For example, if your input file is path/input.txt, the default results directory would be
path/input.txt_PRIMUS/
NOTE: Both results from IMUS and Pedigree Reconstruction will be written to this directory. See section on OUTPUT FILES for a description of the directory structures and files.
- --verbose [0|1|2]
-
Set verbosity level (Default = 1). 0: nearly no ouput except for errors; 1: shows input files, settings, and progress of run; 2: Same as "1" plus details about pedigree reconstruction
- --degree_rel_cutoff [1|2|3]
-
Rather than specifying the the PI_HAT cutoff you want, you can specify the maximum degree of relatedness you wish to use to connect a family network. PRIMUS will then concert this into the appropriate PI_HAT cutoff values: 1st degree = 0.375; 2nd degree = 0.1875; and 3rd degree = 0.9375. This option is also used for PRIMUS+ERSA (see below for those details).
PLINK IBD estimation (not included in lite version)
- --file file_stem
-
Path to a PLINK formatted ped/map or bed/bim/fam files, but without the file extensions. This option behaves just like PLINK's --file and --bfile options, but here it must be used with the --genome option. For example, if your ped and map files are /usr/data/foo.ped and /usr/data/foo.map, then you could use the following command:
run_PRIMUS.pl --file /usr/data/foo --genome
Check the PLINK documentation (http://pngu.mgh.harvard.edu/~purcell/plink/) for details on the formatting for this file.
- --genome
-
This option tells PRIMUS to read in the specified ped/map or bed/bim/fam files (given with --file) and calculate IBD estimates using the prePRIMUS IBD pipeline.
- --plink_ex
-
Path to the plink executable file (use if plink is not in your PATH environment variable)
- --smartpca_ex
-
Path to the EIGENSTRAT's smartpca executable file (use if the smartpca executable is not in your PATH environment variable). Note: EIGENSTRAT is currently (July 2014) only available for linux machines.
- --ref_pops
-
Comma separated list of HapMap3 populations (ASW,CEU,CHB,CHD,GIH,JPT,LWK,MEX,MKK,TSI,YRI) used for reference allele freqs.
- --no_automatic_IBD
-
PRIMUS automatically selects the HapMap3 population(s) for reference allele freqs. We recommend inspecting the PCA plot to confirm the selection of the reference populations. Add this option to the command line if you would like to have PRIMUS stop after generating the PCA plots so you can manually inspect the PCA plot and then rerun PRIMUS specifying the reference populations with the --ref_pops option.
- --remove_AIMs
-
Force the removal of SNPs associated with principle component vectors 1 and 2 (usually associated with ancestry and are ancestry informative markers (AIMs)). The default is to NOT remove AIMs unless the HapMap3 reference poplations selected indicate that your data is admixed. By default AIMs are not removed with the options --internal_ref or --alt_ref_stem. Warning: if ancestry is not the most informative principle component vectors, then you will likely unnecessarily remove informative SNPs.
- --keep_AIMs
-
Do not remove ancestry informative markers. The default is to remove AIMs if the HapMap3 reference poplations selected indicate that your data is admixed. Use this option to prevent AIM removal in these cases.
- --internal_ref
-
This option tells PRIMUS to use only the dataset provided in --file to get reference allele frequencies rather than trying to find a good reference HapMap3 population. Warning: only use this option if your dataset has a lot of unrelated samples with similar genetic backgrounds.
- --alt_ref_stem file_stem
-
Path to PLINK formatted ped/map or bed/bim/fam files, but without the file extensions. These files will be merged with an unrelated version of the data provided in --file to obtain reference allele frequencies that will be used to calculate IBD estimates and other QC measures.
- --no_PCA_plot
-
PRIMUS will automatically run smartpca (a program from EIGENSTRAT) and make a PCA plot of your data with any reference data provided or with the entire HapMap3 dataset. This is a very useful tool for doing QC on your data. However, smartpca can take a while to run on large datasets. You can use this option to forgo running smartpca and getting PCA plots in order to get a faster runtime. Note: EIGENSTRAT is currently (July 2014) only available for linux machines, so this option would be necessary if you are running prePRIMUS in Windows or on a Mac
- --alt_ref_stem file_stem
-
Path to a PLINK formatted ped/map or bed/bim/fam files, but without the file extensions. These files will be be merged with an unrelated version of the data provided in --file to obtain reference allele frequencies that will be used to calculate IBD estimates and other QC measures. Caution: prePRIMUS does not run QC on the the alternate reference data. Make sure the reference data is cleaned and only has unrelated samples. YOu can do this using the prePRIMUS.pm subroutines or with other methods.
- --keep_inter_files
-
By default, prePRIMUS will clean up the intermediate files used to generate the IBD estimates and other provided results. If you wish to keep these files, use this option.
Identification of maximum unrelated set options:
- --no_IMUS
-
Add this option to stop PRIMUS from identifying a maximum unrelated set. In the case of large, sparse networks, identifying a maximum unrelated set can take a while. If you are only interested in reconstructing pedigrees, you may prefer to use this option.
- --missing_val NUM
-
Set value that denotes missing data in your trait files (Default = 0).
- -s, --size
-
Specify to weight on size. The order that traits are enter on the command line determines the order that they are applied to weight the unrelated sets. Be default, size is always the first weighting criteria unless a binary trait is specified first on the command line.
- --high_btrait FILE
-
Specify path to a file containing white space separated columns for FID, IID, and binary trait for which you wish to weight for the higher of the two values. For automated processing, this file should only have three columns in this order: FID, IID, TRAIT. If there are more columns then the program will ask you to specify which column in the file corresponds to FID, IID, and TRAIT.
In the case that you wish to weight on more than one trait at a time, the order that traits are entered on the command line determines the order that they are applied to weight the unrelated sets. If a binary trait is specified before --size then PRIMUS identifies an unrelated set that contains the largest number of individuals with the desired binary value, which will not necessarily be a maximum unrelated set.
- --low_btrait
-
Same as --high_btrait, but selects for the lower of the two values.
- --high_qtrait
-
Similar to --high_btrait but the trait value can be continuous quantitative trait. PRIMUS cannot select first for qtrait. If specified first on the command line, PRIMUS will first select on size and then the desired qtrait. As a result, PRIMUS will identify the maximum unrelated set that also has the highest average value for the specified qtrait.
- --low_qtrait
-
Same as --high_qtrait, but selects for the lowest average value for the qtrait.
- --mean_qtrait
-
Same as --high_qtrait, but selects for the maximum unrelated set that has an average value for the qtrait closest to the mean qtrait value of the entire dataset.
- --tails_qtrait
-
Same as --high_qtrait, but selects for the maximum unrelated set that has individual qtrait values furthest from the mean qtrait value of the entire dataset.
Pedigree reconstruction options:
- --no_PR
-
Adding this option will turn off pedigree reconstruction. For some family networks, it takes a while to reconstruct all possible pedigrees. If all you desire is to identify a maximum unrelated set, then you may wish to use this option.
- --max_gens NUM
-
Set max number of sampled generations allowed in reconstructed pedigrees (default = no limit). The number of sampled generations in a pedigree is the number of generations from which the sample data had to be collected. For example, if the pedigree consists of two parents and two children, then there are two generations. Alternatively, if the pedigree is just the two siblings, then there is only one generation because the data had to be collected from a single generation. Finally, if you have a great-grandparent and a great-grandchild but are missing the samples in-between, the pedigree still has 4 generations because sample data must have been collected from individuals spanning 4 generations.
In the case that you know your data was collected from a single time point, you can be confident that you will not have sampled individuals from more than 3-4 generations. So you may want to use the option --max_gens 4 or --max_gens 5.
This option can be beneficial: 1. it reduces the number of possible pedigrees reconstructed. 2. it speeds up the program runtime because you won't be exploring as large of a pedigree search space.
- --max_gen_gap NUM
-
Set max number of generations between two people that have a child (default = 0). This option only matters if the pedigree has cycles due to inbreeding or other complex relationships. For example, a widowed mother had a child with a man named Tom, who died tragically. She then marries Tom's cousin, Jerry and has another child. Although the two children are not inbred, they do have a complex relationship: half-sibling + second-cousin. By having children with two cousins, the mother created a cycle in the pedigree that is non-trivial to resolve. PRIMUS can resolve these pedigrees but to test and reconstruct all possible cycles is computationally intensive. To limit the number of tests for cycles, PRIMUS uses the --max_gen_gap option. In the example provided above, the two cousins are in the same generation, so the generation gap between them is 0. However, let's say that the mother had children with an uncle/nephew pair, then the generation gap between those two would be 1. To reconstruct this second example, --max_gen_gap would need to be set to 1.
For most pedigrees the option default of zero will be perfect. However, it may need to be increased for pedigrees that are expected to have cycles due to inbreeding or other complex relationships where individuals from different generations mate.
- --int_likelihood_cutoff NUM
-
Set the initial minimum likelihood to consider a relationship during reconstruction, must be between 0.001 and 1 (default = 0.1). PRIMUS uses a Kernel Density Estimation function to calculate the likelihood that a given set of IBD estimates corresponds to each relationship category. PRIMUS uses the minimum likelihood cutoff to determine which relationship categories it considers during reconstruction. This is the "initial" likelihood cutoff because it is used for the first attempt at reconstructing the pedigree. If no pedigree is reconstructed, PRIMUS will automatically drop the minimum likelihood cutoff by an order of magnitude and attempt to reconstruct again. Once the minimum likelihood cutoff drops below 0.001, PRIMUS will stop attempting to reconstruct the pedigree.
- --age_file FILE
-
Specify the path to the file containing the age of each sample. This information is used to flag possible pedigrees that are inconsistent with the input ages. It is also included after the individuals name in the pedigree image. The file must also be separated into white space delimited columns. It must have at least three columns and the first three columns must be in this order: FID, IID, and AGE. The FIDs and IIDs must be consistent in all input files. If your columns are not in this order, then use the --ages option. You may have a header in this file, but it is not necessary. Missing value for age needs to be a negative value or "NA". The program can will assume anyone not in the age file as missing.
- --ages FILE=FILE [options]
-
Specify the path and columns to the file containing the age of each sample. This information is used to flag possible pedigrees that are inconsistent with the input ages. The age is also included after the individual's name in the pedigree image. Missing value for age needs to be a negative value or "NA". The program can will assume anyone not in the age file as missing. The file must contain a column for FID, IID, and AGE, but the order does not matter. If the column number differs from the default (FID(1), IID(2), AGE(3)) then you must specify the column number. For example, let's say your file path_to_data/age.txt has 4 columns: AGE(1), IID(2) and FID(3), then the --ages option would need to be:
--ages FILE=path_to_file/age.txt FID=3 AGE=1
Notice I did not have to specify the column number for IID because its column still matches the default. Note: The follow commands are equivalent:
--age_file path/file.txt
--ages FILE=path/file.txt FID=1 IID=2 AGE=3
- --sex_file FILE
-
Specify path to the file containing the sex of each sample. This information is used to test that a pedigree is compatible with know sex information. It is also used to correctly draw the individual's symbol on the pedigree image. The file must also be separated into white space delimited columns. It must have at least three columns and the first three columns must be in this order: FID, IID, and SEX. The FIDs and IIDs must be consistent in all files input files. If you columns are not in this order, then use the --sexes option. You may have a header in this file, but it is not necessary. The default value for male is 1 and female is 2. Use 0 for unknown sex.
- --sexes FILE=FILE [options]
-
Specify the path and columns to the file containing the sex of each sample. This information is used to test that a pedigree is compatible with genetically derived sex information. It is also used to correctly draw the correct symbol on the pedigree image. The file must contain a column for FID, IID, and SEX, but the order does not matter. If the column number differs from the default (FID(1), IID(2), SEX(3)) then you must specify the column number. For example, let's say your file path_to_data/sex.txt has 4 columns: FID(1), SEX(2) and IID(3), then your the --sexes option would need to be:
--ages FILE=path_to_file/sex.txt IID=3 AGE=2
Notice I did not have to specify the column number for FID because its column still matches the default. Note: The follow commands are equivalent:
--sex_file path/file.txt
--sexes FILE=path/file.txt FID=1 IID=2 SEX=3>
The default value for male is "1" and female is "2". Use 0 for unknown sex. To specify a different value for male and female, use the MALE=<val> and FEMALE=<val>. For example, if file.txt has the sexes of the individuals in the 4th column but they are labeled "male" and "female" you should use the command
--sexes FILE=path/file.txt SEX=4 MALE=male FEMALE=female
- --affection_file FILE
-
Specify the path to the file containing the affection status of each sample. This information is only used to label affected individuals on the pedigree images. The default value for affection status is affected=2. The file must also be separated into white space delimited columns. It must have at least three columns and the first three columns must be in this order: FID, IID, and AFFECTION_STATUS. The FIDs and IIDs must be consistent in all input files. If your columns are not in this order, then use the --affections option. You may have a header in this file, but it is not necessary.
- --affections FILE=FILE [options]
-
Specify the path and columns to the file containing the affection status of each sample. This information is only used to label affected individuals on the pedigree images. The file must contain a column for FID, IID, and AFFECTION_STATUS, but the order does not matter. If the column number differs from the default order (FID(1), IID(2), AFFECTION_STATUS(3)) then you must specify the column number. For example, let's say your file path_to_data/a_status.txt has 4 columns: FID(1), IID(2), SEX(3), AFFECTION_STATUS(4), and the values are 0=unaffected and 1=affected. Then your --affections option would need to be:
--affections FILE=path_to_file/a_status.txt AFFECTION=4 AFFECTION_VALUE=1;
Notice I did not have to specify the column number for FID or IID because their columns still match the defaults. Note: The following commands are equivalent:
--affection_file path/file.txt
--affections FILE=path/file.txt FID=1 IID=2 AFFECTION=3 AFFECTION_VALUE=2
- --mito_matches | --mito FILE=FILE [options]
-
Specify the path and columns to the file containing the mitochondrial matching status for each pair of samples. This information is used to eliminate the possible pedigrees that are not consistant with the mitochondrial data. By default, PRIMUS only uses non-matches because matches can often occur by chance in a population. The file must contain a column for FID1, IID1, FID2, IID2, and MATCHING_STATUS, but the order does not matter. If the column number differs from the default order (FID1(1), IID1(2), FID2(3), IID2(4) MATCHING_STATUS(5)) then you must specify the column number. For example, let's say your file path_to_data/matching_status.txt has 7 columns: FID1(1), IID1(2), FID1(3), IID1(4), SEX(5), AFFECTION_STATUS(6), MITO_MATCHING_STATUS(7), and the values are 0=non-match and 1=match. Then your --mito_matches option would need to be:
--mito_matches FILE=path_to_file/matching_status.txt MATCH=7 MATCH_VALUE=1;
Notice I did not have to specify the column number for FID1, IID1, FID2 and IID2 because their columns still match the defaults. An unknown match can be specified with a -1.
- --MT_error_rate NUM
-
Specify a number netween 0 and 1 for as the proportion of the MT sequence that needs to be different between two individuals in order to be called a non-match. Default is 0.01; However, for more error prone data, this will be too low. If pedigree reconstructions are not outputing a possible pedigree due to FAILED MITO CHECK, then you could be rejecting the true pedigree because of poor calling of MT SNPs. It is recommend to provide a higher cutoff if you are having this problem, such as 0.02 or 0.05.
- --y_matches | --y FILE=FILE [options]
-
Specify the path and columns to the file containing the Y chromosome matching status for each pair of samples. This information is used to eliminate the possible pedigrees that are not consistant with the mitochondrial data. By default, PRIMUS only uses non-matches because matches can often occur by chance in a population. The file must contain a column for FID1, IID1, FID2, IID2, and MATCHING_STATUS, but the order does not matter. If the column number differs from the default order (FID1(1), IID1(2), FID2(3), IID2(4) MATCHING_STATUS(5)) then you must specify the column number. For example, let's say your file path_to_data/matching_status.txt has 7 columns: FID1(1), IID1(2), FID1(3), IID1(4), SEX(5), AFFECTION_STATUS(6), Y_MATCHING_STATUS(7), and the values are 0=non-match and 1=match. Then your --y_matches option would need to be:
--y_matches FILE=path_to_file/matching_status.txt MATCH=7 MATCH_VALUE=1;
Notice I did not have to specify the column number for FID1, IID1, FID2 and IID2 because their columns still match the defaults. An unknown match can be specified with a -1.
- --Y_error_rate NUM
-
Specify a number netween 0 and 1 for as the proportion of the Y sequence that needs to be different between two individuals in order to be called a non-match. Default is 0.01; However, for more error prone data, this will be too low. If pedigree reconstructions are not outputing a possible pedigree due to FAILED Y CHECK, then you could be rejecting the true pedigree because of poor calling of Y SNPs. It is recommend to provide a higher cutoff if you are having this problem, such as 0.02 or 0.05.
PRIMUS+ERSA options:
- OVERVIEW
-
PRIMUS+ERSA is an algorithm uses the pairwise relationship estimate likelihoods and results generated by ERSA to connect the pedigrees reconstructed by PRIMUS. This tool will generate two output files. One will contain the network numbers, pedigree numbers, and names of founder in each pedigree that is most likely to be connected to the other pedigree at the degree of relationship specified in the final column. There will be one line per pair of networks. Individuals who are unrelated at the relatedness threshold specified are treated as their own network. The other results file provides the same information but for every related pair of genotyped individuals in all the networks.
- --ersa_model_output path/file
-
ERSA will generate this file when the --model_output_file path/file option specified on the commandline when running ERSA. For each pair of individuals in the dataset, it will contain the likelihood that they are related as 1st through 40th degree relatives sharing 0, 1, or two parents in common at each degree. PRIMUS uses these likelihhods and the likelihood that they are unrelated (obtained from the file specified with the --ersa_results option) to find the most likely way each family network is connected.
- --ersa_results path/file
-
ERSA's main output file (specified with the --output_file option) needs to be provided. PRIMUS + ERSA uses the likelihood that a pair of individuals are unrelated in its algorithm. Provide the path to the file here.
- --project_summary path/Summary_*.genome.txt
-
The project level summary file is in the main output directory which we call the Project level directory (default is *_PRIMUS). In the main ouput directory, there are two summary files. One is a summary containing the pairwise relationship table (*_pairwuse_table.txt), and this is NOT the file you want to provide with this option. You want to provide the other summary file.
Note: you can rerun PRIMUS+ERSA in the same run as the pedigree reconstructions. In which case, you do not need to include this option.
- --degree_rel_cutoff [1|2|3]
-
If you do not reconsruct the pedigrees in the same run as PRIMUS+ERSA, then you need to specify what degree of relatedness you used as your cutoff during reconstruction. Currently, there is no way for PRIMUS+ERSA to know if you used something other than the default relationship cutoff (default is 3), so you must specify if you used a different degree.
Alternative, if you reconstructed using the --threshold or -t option, then you can use the same option and value you used during reconstructionto correctly run PRIMUS + ERSA.
DEFINITIONS
- DIR
-
A full path or a relative path to a directory.
- FILE
-
A full path or a relative path to an input files.
- file_stem
-
A full path or a relative path to an input file, but without including the file extensions (e.g. the file stem for "foo_bar.ped" is "foo_bar").
- NUM
-
A floating point number.
OUTPUT FILES
Identification of a Maximum Unrelated Set
- Output .dot file
-
This file is used to visualize the family network as an undirected graph. The .dot files can be read into graph visualization software like Graphviz (http://www.graphviz.org). Each node is an individual from the dataset and each edge is a pairwise relationship above the specificied relatedness threshold.
- Networks file
-
The (your_IBD_input_file)_networks within the results directory lists all pairwise relationships above the relatedness threshold and they are grouped by family network. The data in this file is exactly the same as your IBD input file except that PRIMUS added the first column that contains the network number assigned to each family network. These numbers are important because they match the network numbers of all the other result files, making this file useful as a look-up reference.
-
The maximum unrelated set file is in the main output directory as (your_IBD_input_file)_maximum_independent_set, and this is the largest unrelated set. There are 3 other related files: (your_IBD_input_file)_maximum_independent_set_PRIMUS/_KING/_PLINK. These are the results for all three algorithms used in PRIMUS for maximum unrelated set identification.
- network IBD files
-
These IBD files exist for each family network and match the data from the IBD input file. If you also ran pedigree reconstruction, then these files will be in the respective network directories if you also ran pedigree reconstruction.
-
File contains the FID and IID of all samples in the dataset that are unrelated to all other samples below the user defined relatedness threshold.
Pedigree Reconstruction
- Directory Structure
-
In this documentation, the main results directory provided or the default *_PRIMUS/ directory is referred to as the dataset directory. This will contain the _networks file, the .dot files, and the network directory for each family network. The network directory contains all the results for the pedigree reconstruction of each family network. The network directory contains:
network_IBD The IBD estimates among only the samples in this family network *.config file The Settings used to run Cranefoot *.cranefoot.fam Cranefoot's required input file format *.fam The six column file: FID, IID, FATHER, MOTHER, SEX, and AFFECTION_STATUS *.ps A post-script image of the pedigree drawn with Cranefoot Summary_*_pairwise_table.txt Summary file describing the all possible relationships between each pair of individuals based on all possible reconstructed pedigrees. Summary_*.txt Summary file of the pedigree reconstruction of the family network.
- .fam files ([IBD_FILE]_network#_#.ps)
-
This file format is commonly used with the program PLINK. It contains the necessary information to reconstruction a pedigree. PRIMUS fills gaps in the pedigree with missing individuals and those are also represented in this file. There are six tab separated columns: FID, IID, FATHER, MOTHER, SEX, and AFFECTION_STATUS. The FID matches the network number and the IID is the merge of the original FID and IID read in from the IBD estimates file.
- Pedigree image files ([IBD_FILE]_network#_#.ps)
-
This is a post-script image file of the pedigree generated by Cranefoot. This can be opened by any program capable of reading post-script files. The drawn pedigrees follow general pedigree drawing practices: male = square, female = circle, 3/4 shading = affected, and diagonal lines mean that the individual is missing from the input dataset but needed to draw the complete pedigree. If ages are provided, they are placed after the individual's name. Cranefoot is unique in that it represents complex or difficult to draw pedigrees by separating out branches of the pedigree and drawing the same individual more than once. You can identify the individuals drawn more than once by the colored, arching lines connecting them. Feel free to read the data into your favorite pedigree drawing software if Cranefoot does not fit your taste.
- Network summary files (Summary_[IBD_FILE]_network#.txt)
-
The network summary file contains basic summary statistics for the entire network at the top of the file, and each line is preceded with "## ". These statistics include the network name, number of pedigrees that fit the data, the number of pedigrees that are not flagged due to incompatibilities with provided ages, the number of samples in the network, the score range statistics of all the possible pedigrees, the number of pedigrees with the highest score, and the sample names in the network.
The file then has information for each reconstructed pedigree: PRIMUS-assigned pedigree id number, number of missing (dummy) individuals added to the pedigree to complete it, the number of generations that had to be sampled given the pedigree structure, the relative scoring of the pedigree based on how well the relationships in the pedigree fit the input data, and the pairwise relationships within this pedigree that contradict the provided ages. If no ages are provided or if there are no incompatibilities, the last column will be empty. The score column will be between 0 and 1 and is the average likelihood of each relationship used in the formation of the pedigree calculated from PRIMUS's KDE.
- Dataset summary files (Summary_[IBD_FILE].txt)
-
The Dataset summary file first provides the dataset name and the number of family networks in the dataset. Next, it contains the network summary statistics (those preceded by "## " in the network summary file) from each family network in the dataset.
- Pairwise relationship summary file (Summary_[IBD_FILE]_pairwise_table.txt)
-
This white space separated file summarizes the possible pairwise relationships between all the samples in the dataset based on the possible reconstructed pedigrees.
QUICK START
There are several example datasets in the PRIMUS_v*/example_data/ directory. Here I will show you how to use several of the more common functionalities and options of PRIMUS. Run each example from the bin directory or adjust the file paths accordingly. Example 5 only works with the full version of PRIMUS:
Example 1. Read in PLINK's .genome file
./run_PRIMUS.pl --plink ../example_data/complete.genome
This command will run both parts of PRIMUS on the IBD estimates in complete.genome and the results should be in ../example_data/complete.genome_PRIMUS/ These data make up a single family network of 12 individuals. The resulting pedigree should not have any missing individuals. The maximum unrelated set should contain 6 individuals.
Example 2. Read in sex and affection status information
./run_PRIMUS.pl --plink ../example_data/complete.genome --sex_file ../example_data/complete.sex --affections FILE=../example_data/complete.fam AFFECTION=6 AFFECTION_VALUE=1
The --sex_file option is straight forward because we just pass the complete.sex file that matches the default 3-column format (FID = column 1; IID = column 2; SEX = column 3). Maybe you noticed this complete.sex file does not have a file header; don't worry, it doesn't matter if these files have headers or not. Alternatively, you could have used --sexes FILE=../example_data/complete.sex SEX=3 because this is the same as the command used above.
The affection option is a little more complicated, because the complete.fam file does not have the affection status in the default third column nor is the affection status value the default 2. We cannot use the --affection_file option; rather we have to use the --affections option and specify the columns. To do so, you must first specify the FILE=../example/complete.fam. Next, you must specify the affection status column (the 6th column of the complete.fam file) using the AFFECTION=6 command. Finally, affection status in this file is specified with the number 1, so AFFECTION_VALUE=1 does the trick. These results from this run likely overwrote the results from Example 1. The difference will be that the resulting pedigrees images will include the correct sex symbol for all individuals as well as shading for the affected individuals and the reconstructed .fam files will also include sex and affection status in the 5th and 6th columns respectively.
Example 3. Only reconstruct pedigrees for an incomplete dataset
./run_PRIMUS.pl --plink ../example_data/incomplete.genome --sexes FILE=../example_data/complete.fam SEX=5 --affections FILE=../example_data/complete.fam AFFECTION=6 AFFECTION_VALUE=1 --no_IMUS
This run is very similar to Example 2, except now it is calling incomplete.genome and using the sex column from the complete.fam file instead of the complete.sex file. The incomplete.genome file is the same family as complete.genome except with 5 individuals removed (id2, id5, id7, id8, and id9). It is ok that the sex and affection status files contain all individuals, the ones not included in the incomplete.genome file will be ignored. The only addition to the command line is the --no_IMUS option which will only run pedigree reconstruction and not identify a maximum unrelated set.
This particular run of PRIMUS produces a single pedigree that looks identical to the complete pedigree in examples 1 and 2, except it fills in the missing individuals with a "Missing#" place-holder. However, if you use the --int_likelihood_cutoff 0.08 instead of the default of 0.1, PRIMUS will reconstruct three possible pedigrees. The main difference between the three possible pedigrees is that ID11 and ID12 are unrelated in pedigree0 and are 3rd degree relatives in the other two pedigrees. The likelihood vector results show that it is more likely that ID12 and ID11 are more distantly related than 3rd degree relatives so that pedigree has a high score in the summary file than the other two pedigrees. Since we dropped the --int_likelihood_cutoff option below the likelihood that ID11 and ID12 are cousins, PRIMUS attempts to reconstruct the pedigree setting ID11 and ID12 as distantly related and as 3rd degree relatives.
Example 4. Reconstruct a HapMap3/1K genomes MEX family using fake ages that produced several possible pedigrees
./run_PRIMUS.pl --plink ../example_data/1K_genomes_MEX_family.genome --sex_file ../example_data/1K_genomes_MEX_family.features --ages FILE=../example_data/1K_genomes_MEX_family.features AGE=4
This dataset includes 3 trios and another individual unreported to be related within the original HapMap3 and 1K genomes releases. It reconstructs to 4 possible pedigrees. I fabricated ages for the samples to illustrate the effectiveness of PRIMUS at flagging pedigrees that do not match the supplied ages. Look at PRIMUS_v*/example_data/1K_genomes_MEX_family.genome_PRIMUS/1K_genomes_MEX_family.genome_network1/Summary_1K_genomes_MEX_family.genome_network1.txt will summarize the pedigree results.
Example 5. Calculate IBD estimates using the PLINK IBD pipeline, and then reconstruct the pedigrees (Does not run with lite version of PRIMUS)
./run_PRIMUS.pl --file ../example_data/MEX_pop --genome --sexes FILE=../example_data/MEX_pop.fam SEX=6
NOTE: This run requires that you have PLINK, R, and EIGENSTRAT (specifically smartpca) installed on your machine and in your PATH environment variable. If PLINK and smartpca are installed but are not in your path, then you will need to use the --plink_ex PATH/TO/EXECUTABLE/plink and --smartpca_ex PATH/TO/EXECUTABLE/smartpca options.
If you do not specify a reference popluation, then PRIMUS will automatically select the best HapMap3 reference population for your data. If you would like to have PRIMUS not automatically select these populations, then use the --no_automatic_IBD option, --internal_ref, or --alt_ref options.
This dataset is the same as in example 4, above. Look at PRIMUS_v*/example_data/1K_genomes_MEX_family.genome_PRIMUS/1K_genomes_MEX_family.genome_network1/Summary_1K_genomes_MEX_family.genome_network1.txt contains the summary of the pedigree results.