Usage¶
After installing the package bgparsers
the command line bgvariants
should be available. This command line has four different
subcommands to process files that contain genomic variants.
Command cat¶
The concatenate command parses a variant file.
$ bgvariants cat 100k.maf.gz | head -n10
SAMPLE DONOR CHROMOSOME POSITION REF ALT STRAND ALT_TYPE
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 114372254 T C - snp
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 145534100 C G + snp
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 153362976 G A - snp
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 156281973 C G - snp
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 156647053 C G - snp
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 161047356 G A - snp
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 171154946 G C + snp
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 172558229 A T + snp
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 183209495 C T + snp
The --where
or -w
parameter allows to filter the output by some column.
$ bgvariants cat --where ALT_TYPE==indel 100k.maf.gz | head -n10
SAMPLE DONOR CHROMOSOME POSITION REF ALT STRAND ALT_TYPE
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 27099309 C - + indel
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 16 67184256 C - - indel
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 19 36431511 CCAGCTG - + indel
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 2 164467360 - A - indel
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 20 56099187 T - - indel
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 3 121409850 - TC - indel
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 3 46306948 T - + indel
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 3 57893702 A - + indel
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 3 73433413 - G - indel
Command count¶
The count command counts the total variants in a file.
$ bgvariants count 100k.maf.gz
TOTAL 100000
The --groupby
or -g
parameter allows to count the variants grouped by one column.
$ bgvariants count --groupby ALT_TYPE 100k.maf.gz
mnp 469
indel 3056
snp 96475
TOTAL 100000
Command groupby¶
The groupby command allows to group the variants by one column and process all of them (in parallel) using one script.
The script will be call once per group and only the variants of that group will be send as standard input to the script. The script
can identify the key of the group that is processing using the environment variable GROUP_KEY
.
An example that is just spliting one file into several files, one per chromosome:
$ bgvariants groupby -g CHROMOSOME -s 'gzip > chr_${GROUP_KEY}.tab.gz' 100k.maf.gz
Computing groups: 100%|██████████████████████████████████| 26/26 [00:42<00:00, 1.64s/it]
$ ls chr*.tab.gz
chr_10.tab.gz chr_16.tab.gz chr_21.tab.gz chr_6.tab.gz chr_X.tab.gz
chr_11.tab.gz chr_17.tab.gz chr_22.tab.gz chr_7.tab.gz chr_Y.tab.gz
chr_12.tab.gz chr_18.tab.gz chr_2.tab.gz chr_8.tab.gz
chr_13.tab.gz chr_19.tab.gz chr_3.tab.gz chr_9.tab.gz
chr_14.tab.gz chr_1.tab.gz chr_4.tab.gz chr_MT.tab.gz
chr_15.tab.gz chr_20.tab.gz chr_5.tab.gz chr_None.tab.gz
$ zcat chr_12.tab.gz | head -n 3
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 12 106712243 G A + snp
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 12 110906002 G A - snp
TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 12 122825944 C T - snp
Using the parameter --where
or -w
it is also possible to filter the groups before sending them to the script.
$ bgvariants groupby -w ALT_TYPE==indel -g CHROMOSOME -s 'gzip > indels_chr_${GROUP_KEY}.tab.gz' 100k.maf.gz
Computing groups: 100%|██████████████████████████████| 26/26 [00:22<00:00, 1.16it/s]
$ zcat indels_chr_7.tab.gz | head -n3
TCGA-BL-A3JM-01A-12D-A21A-08 TCGA-BL-A3JM 7 97842081 CAG - + indel
TCGA-BT-A0S7-01A-11D-A10S-08 TCGA-BT-A0S7 7 128434455 GAA - + indel
TCGA-BT-A20J-01A-11D-A14W-08 TCGA-BT-A20J 7 158704352 - A + indel