hgdp2sweep.pl

Command line: 
perl hgdp2sweep.pl -infile_name InputTest.txt -outfile_name InputTest_out

This program converts a genotype file to the input format for PHASE & then it runs PHASE & creates the input needed to run Sweep. 
!!!The current version of Sweep accepts Build35 coordinates (HG17) at the most (not Build36)!!!
The script gest rid of the children from the HapMap samples and thus needs the use the children_sid file. 

To run the program you need: 
1) Input file with genotype data
2) PHASE program installed in your directory (e.g. /nfs/team19/by1/phd/phase.2.1.1.linux)
3) Run the program on a linux machine (e.g. ssh farm-login)


1) Input file (e.g. InputTest.txt) should be space delimited and looks like this 
(SNP id, chromosome, position and all your samples and their genotypes - use rs#, chrom and pos as first 3 column names)
InputTest.txt:
rs# chrom pos HGDP00448 HGDP00479 HGDP00985 HGDP01094 HGDP00982 HGDP00911 HGDP01202 HGDP00927 HGDP00461 HGDP00451
rs8062140 chr16 22316964 TT TT TT TT TC TT TT TT TT TT
rs12446433 chr16 22331199 GG GG GG GG GG GG GG GG GG GG
rs2887622 chr16 22581356 TC TC TT TC TT TC TT TC TT TC
rs154537 chr16 22620480 GG GG GG GG GG GG GG GG GG TG
rs31966 chr16 22627867 GG GG GG GG GG GG GG GG GG GG
rs12920317 chr16 22629809 TC TC CC CC CC TC CC CC TC CC
rs7193569 chr16 22635346 GG GG GG GG AG GG GG GG GG AG
rs8045314 chr16 22638277 AA AA AG AG AA AA GG AA AA AA
rs12920706 chr16 22638484 TC TC CC CC TC TC CC TC TC CC
rs12599831 chr16 22640779 AG AG AG AG GG AG AA GG AG AG
rs9927141 chr16 22643850 CC TC CC CC TC TC CC TT CC CC
rs428570 chr16 22644799 TT TT TG TT TG TT TG TT TT TT
rs687852 chr16 22646184 AA AG AA AG AA AG AG AG AA AG

2) Put input file in same directory as hgdp2sweep.pl
   Try the test file to see if it runs properly!
   --> should get 10 output files
	1 PHASE input file created by hgdp2sweep.pl
	#InputTest.txt.phase_inp
	5 files created by PHASE program
	#InputTest.txt.phase_out 
	#InputTest.txt.phase_out_freqs 
	#InputTest.txt.phase_out_hbg 
	#InputTest.txt.phase_out_monitor 
	#InputTest.txt.phase_out_pairs 
	#InputTest.txt.phase_out_probs
	#InputTest.txt.phase_out_recom
	2 files created by hgdp2sweep.pl to run sweep
	#InputTest_out.emphase
	#InputTest_out.snp


3) Make your input files with the same format as above.

4) Submit your job to the farm and choose your queue according to the size. 
	My files that had 108 samples and 315 SNPs took ~13.5 hourse to run --> submit to "long"
	My files that had 207 samples and 315 SNPs took more than 24 hours --> submit to "basement"

5) Example of farm submission command line: 
bsub -J Africa -o Africa_out  -q basement -P team19 perl hgdp2sweep.pl -infile_name CNV784_Africa.txt -outfile_name CNV784_Africa_out.txt


6) Xterm output expected from hgdp2sweep.pl can be seen below: 

bc-9-1-03[by1]50: perl hgdp2sweep.pl -infile_name small_input.txt -outfile_name small_out
Use of uninitialized value in print at hgdp2sweep.pl line 95, <IN> line 17.
Use of uninitialized value in print at hgdp2sweep.pl line 95, <IN> line 17.
Use of uninitialized value in print at hgdp2sweep.pl line 95, <IN> line 17.
Use of uninitialized value in print at hgdp2sweep.pl line 95, <IN> line 17.
Use of uninitialized value in print at hgdp2sweep.pl line 95, <IN> line 17.
Use of uninitialized value in print at hgdp2sweep.pl line 95, <IN> line 17.
Use of uninitialized value in print at hgdp2sweep.pl line 95, <IN> line 17.
Use of uninitialized value in print at hgdp2sweep.pl line 95, <IN> line 17.
Use of uninitialized value in print at hgdp2sweep.pl line 95, <IN> line 17.
Use of uninitialized value in print at hgdp2sweep.pl line 95, <IN> line 17.
Use of uninitialized value in print at hgdp2sweep.pl line 95, <IN> line 17.
Use of uninitialized value in print at hgdp2sweep.pl line 95, <IN> line 17.
Use of uninitialized value in print at hgdp2sweep.pl line 95, <IN> line 17.
Use of uninitialized value in print at hgdp2sweep.pl line 95, <IN> line 17.
Reading in data
Reading Positions of loci
Reading individual     45
Finished reading
Computing matrix Q, please wait
Done computing Q
45
14
SSSSSSSSSSSSSS
0 #NA18570
C A G T T A A T T T C C C G 
C A G T T A A T T T C C C C 
0 #NA18593
C A G T A A A T T T C C T C 
C A G T T A A T T T C C C C 
0 #NA18561
C A G T T A A T T T C C C G 
C A G T T A A T T T C C T C 
0 #NA18594
C A G T A A A T T T C C C G 
C A G T T A A T T T C C C C 
0 #NA18577
C A G T A A A T T T C C C C 
C A G T T A A T T T C C T G 
0 #NA18611
C A G T A A A T T T C C T C 
C A G T T A A T T T C C C C 
0 #NA18558
C A G T A A A T T T C C C G 
C A G T T A A T T T C C T G 
0 #NA18582
C A G T A A A T T T C C T C 
C A G T T A A T T T C C C C 
0 #NA18603
C A G T A A A T T T C C T G 
C A G T T A A T T T C C T G 
0 #NA18608
C A G T A A A T T T C C T C 
C A G T A A A T T T C C T C 
0 #NA18550
C A G T T A A T T T C C T G 
C A G T A A A T T T C C C C 
0 #NA18620
C A G T T A A T T T C C T C 
C A G T A A A T T T C C C C 
0 #NA18529
C A G T A A A T T T C C C G 
C A G T T A A T T T C C T C 
0 #NA18579
C A G T T A A T T T C C C C 
C A G T T A A T T T C C C C 
0 #NA18562
C A G T A A A T T T C C C G 
C A G T A A A T T T C C T C 
0 #NA18609
C A G T A A A T T T C C C C 
C A G T T A A T T T C C C C 
0 #NA18573
C A G T T A A T T T C C T G 
C A G T T A A T T T C C C G 
0 #NA18532
C A G T A A A T T T C C T C 
C A G T A A A T T T C C T G 
0 #NA18636
C A G T A A A T T T C C C C 
C A G T T A A T T T C C C C 
0 #NA18623
C A G T A A A T T T C C C C 
C A G T A A A T T T C C T C 
0 #NA18571
C A G T T A A T T T C C C C 
C A G T T A A T T T C C C C 
0 #NA18566
C A G T T A A T T T C C T C 
C A G T A A A T T T C C C C 
0 #NA18635
C A G T T A A T T T C C C C 
C A G T T A A T T T C C C C 
0 #NA18545
C A G T A A A T T T C C C C 
C A G T T A A T T T C C T G 
0 #NA18563
C A G T T A A T T T C C C C 
C A G T T A A T T T C C C C 
0 #NA18605
C A G T A A A T T T C C T G 
C A G T A A A T T T C C T C 
0 #NA18632
C A G T A A A T T T C C C C 
C A G T T A A T T T C C C C 
0 #NA18537
C A G T T A A T T T C C C C 
C A G T T A A T T T C C C C 
0 #NA18637
C A G T A A A T T T C C T C 
C A G T A A A T T T C C T G 
0 #NA18622
C A G T A A A T T T C C C C 
C A G T T A A T T T C C C C 
0 #NA18624
C A G T A A A T T T C C T C 
C A G T T A A T T T C C C C 
0 #NA18540
C A G T A A A T T T C C C C 
C A G T T A A T T T C C T C 
0 #NA18524
C A G T A A A T T T C C T C 
C A G T A A A T T T C C C G 
0 #NA18547
C A G T T A A T T T C C C C 
C A G T A A A T T T C C T G 
0 #NA18555
C A G T T A A T T T C C C C 
C A G T A A A T T T C C C C 
0 #NA18572
C A G T A A A T T T C C T C 
C A G T A A A T T T C C C C 
0 #NA18576
C A G T T A A T T T C C C C 
C A G T T A A T T T C C C C 
0 #NA18621
C A G T T A A T T T C C T C 
C A G T T A A T T T C C C C 
0 #NA18526
C A G T A A A T T T C C C C 
C A G T A A A T T T C C C C 
0 #NA18612
C A G T T A A T T T C C C C 
C A G T T A A T T T C C C C 
0 #NA18564
C A G T A A A T T T C C T C 
C A G T T A A T T T C C C C 
0 #NA18552
C A G T T A A T T T C C C C 
C A G T A A A T T T C C T G 
0 #NA18633
C A G T T A A T T T C C T C 
C A G T A A A T T T C C C G 
0 #NA18542
C A G T T A A T T T C C T C 
C A G T A A A T T T C C T C 
0 #NA18592
C A G T A A A T T T C C T C 
C A G T A A A T T T C C T C 
Resolving with method R
Making List of all possible haplotypes
Method = R
Performing Burn-in iterations
  50% done
Estimating recom rates
Continuing Burn-in
Performing Main iterations
0 segment operations done
Making List of all possible haplotypes
Method = R
Performing Burn-in iterations
  50% done
Estimating recom rates
Continuing Burn-in
Performing Main iterations
1 segment operations done
Method = R
Performing Final Set of Iterations... nearly there!
Performing Burn-in iterations
  50% done
Estimating recom rates
Continuing Burn-in
Performing Main iterations
Writing output to files 
Producing Summary, please wait 
bc-9-1-03[by1]51: 


