Usage ===== After installing the package ``bgparsers`` the command line ``bgvariants`` should be available. This command line has four different subcommands to process files that contain genomic variants. Command cat ----------- The concatenate command parses a variant file. .. code-block:: bash $ bgvariants cat 100k.maf.gz | head -n10 SAMPLE DONOR CHROMOSOME POSITION REF ALT STRAND ALT_TYPE TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 114372254 T C - snp TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 145534100 C G + snp TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 153362976 G A - snp TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 156281973 C G - snp TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 156647053 C G - snp TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 161047356 G A - snp TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 171154946 G C + snp TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 172558229 A T + snp TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 183209495 C T + snp The ``--where`` or ``-w`` parameter allows to filter the output by some column. .. code-block:: bash $ bgvariants cat --where ALT_TYPE==indel 100k.maf.gz | head -n10 SAMPLE DONOR CHROMOSOME POSITION REF ALT STRAND ALT_TYPE TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 1 27099309 C - + indel TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 16 67184256 C - - indel TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 19 36431511 CCAGCTG - + indel TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 2 164467360 - A - indel TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 20 56099187 T - - indel TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 3 121409850 - TC - indel TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 3 46306948 T - + indel TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 3 57893702 A - + indel TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 3 73433413 - G - indel Command count ------------- The count command counts the total variants in a file. .. code-block:: bash $ bgvariants count 100k.maf.gz TOTAL 100000 The ``--groupby`` or ``-g`` parameter allows to count the variants grouped by one column. .. code-block:: bash $ bgvariants count --groupby ALT_TYPE 100k.maf.gz mnp 469 indel 3056 snp 96475 TOTAL 100000 Command groupby --------------- The groupby command allows to group the variants by one column and process all of them (in parallel) using one script. The script will be call once per group and only the variants of that group will be send as standard input to the script. The script can identify the key of the group that is processing using the environment variable ``GROUP_KEY``. An example that is just spliting one file into several files, one per chromosome: .. code-block:: bash $ bgvariants groupby -g CHROMOSOME -s 'gzip > chr_${GROUP_KEY}.tab.gz' 100k.maf.gz Computing groups: 100%|██████████████████████████████████| 26/26 [00:42<00:00, 1.64s/it] $ ls chr*.tab.gz chr_10.tab.gz chr_16.tab.gz chr_21.tab.gz chr_6.tab.gz chr_X.tab.gz chr_11.tab.gz chr_17.tab.gz chr_22.tab.gz chr_7.tab.gz chr_Y.tab.gz chr_12.tab.gz chr_18.tab.gz chr_2.tab.gz chr_8.tab.gz chr_13.tab.gz chr_19.tab.gz chr_3.tab.gz chr_9.tab.gz chr_14.tab.gz chr_1.tab.gz chr_4.tab.gz chr_MT.tab.gz chr_15.tab.gz chr_20.tab.gz chr_5.tab.gz chr_None.tab.gz $ zcat chr_12.tab.gz | head -n 3 TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 12 106712243 G A + snp TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 12 110906002 G A - snp TCGA-BL-A0C8-01A-11D-A10S-08 TCGA-BL-A0C8 12 122825944 C T - snp Using the parameter ``--where`` or ``-w`` it is also possible to filter the groups before sending them to the script. .. code-block:: bash $ bgvariants groupby -w ALT_TYPE==indel -g CHROMOSOME -s 'gzip > indels_chr_${GROUP_KEY}.tab.gz' 100k.maf.gz Computing groups: 100%|██████████████████████████████| 26/26 [00:22<00:00, 1.16it/s] $ zcat indels_chr_7.tab.gz | head -n3 TCGA-BL-A3JM-01A-12D-A21A-08 TCGA-BL-A3JM 7 97842081 CAG - + indel TCGA-BT-A0S7-01A-11D-A10S-08 TCGA-BT-A0S7 7 128434455 GAA - + indel TCGA-BT-A20J-01A-11D-A14W-08 TCGA-BT-A20J 7 158704352 - A + indel