Usage

After installing the package bgparsers the command line bgvariants should be available. This command line has four different subcommands to process files that contain genomic variants.

Command cat

The concatenate command parses a variant file.

$ bgvariants cat 100k.maf.gz | head -n10
SAMPLE      DONOR   CHROMOSOME      POSITION        REF     ALT     STRAND  ALT_TYPE
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    1       114372254       T       C       -       snp
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    1       145534100       C       G       +       snp
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    1       153362976       G       A       -       snp
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    1       156281973       C       G       -       snp
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    1       156647053       C       G       -       snp
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    1       161047356       G       A       -       snp
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    1       171154946       G       C       +       snp
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    1       172558229       A       T       +       snp
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    1       183209495       C       T       +       snp

The --where or -w parameter allows to filter the output by some column.

$ bgvariants cat --where ALT_TYPE==indel 100k.maf.gz | head -n10
SAMPLE      DONOR   CHROMOSOME      POSITION        REF     ALT     STRAND  ALT_TYPE
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    1       27099309        C       -       +       indel
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    16      67184256        C       -       -       indel
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    19      36431511        CCAGCTG -       +       indel
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    2       164467360       -       A       -       indel
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    20      56099187        T       -       -       indel
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    3       121409850       -       TC      -       indel
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    3       46306948        T       -       +       indel
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    3       57893702        A       -       +       indel
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    3       73433413        -       G       -       indel

Command count

The count command counts the total variants in a file.

$ bgvariants count 100k.maf.gz
TOTAL       100000

The --groupby or -g parameter allows to count the variants grouped by one column.

$ bgvariants count --groupby ALT_TYPE 100k.maf.gz
mnp     469
indel       3056
snp     96475
TOTAL       100000

Command groupby

The groupby command allows to group the variants by one column and process all of them (in parallel) using one script. The script will be call once per group and only the variants of that group will be send as standard input to the script. The script can identify the key of the group that is processing using the environment variable GROUP_KEY.

An example that is just spliting one file into several files, one per chromosome:

$ bgvariants groupby -g CHROMOSOME -s 'gzip > chr_${GROUP_KEY}.tab.gz' 100k.maf.gz
    Computing groups: 100%|██████████████████████████████████| 26/26 [00:42<00:00,  1.64s/it]

$ ls chr*.tab.gz
chr_10.tab.gz  chr_16.tab.gz  chr_21.tab.gz  chr_6.tab.gz     chr_X.tab.gz
chr_11.tab.gz  chr_17.tab.gz  chr_22.tab.gz  chr_7.tab.gz     chr_Y.tab.gz
chr_12.tab.gz  chr_18.tab.gz  chr_2.tab.gz   chr_8.tab.gz
chr_13.tab.gz  chr_19.tab.gz  chr_3.tab.gz   chr_9.tab.gz
chr_14.tab.gz  chr_1.tab.gz   chr_4.tab.gz   chr_MT.tab.gz
chr_15.tab.gz  chr_20.tab.gz  chr_5.tab.gz   chr_None.tab.gz

$ zcat chr_12.tab.gz | head -n 3
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    12      106712243       G       A       +       snp
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    12      110906002       G       A       -       snp
TCGA-BL-A0C8-01A-11D-A10S-08        TCGA-BL-A0C8    12      122825944       C       T       -       snp

Using the parameter --where or -w it is also possible to filter the groups before sending them to the script.

$ bgvariants groupby -w ALT_TYPE==indel -g CHROMOSOME -s 'gzip > indels_chr_${GROUP_KEY}.tab.gz' 100k.maf.gz
    Computing groups: 100%|██████████████████████████████| 26/26 [00:22<00:00,  1.16it/s]

$ zcat indels_chr_7.tab.gz | head -n3
TCGA-BL-A3JM-01A-12D-A21A-08        TCGA-BL-A3JM    7       97842081        CAG     -       +       indel
TCGA-BT-A0S7-01A-11D-A10S-08        TCGA-BT-A0S7    7       128434455       GAA     -       +       indel
TCGA-BT-A20J-01A-11D-A14W-08        TCGA-BT-A20J    7       158704352       -       A       +       indel