Quality control -- Fastp

前言

Fastp是一款旨在提供FastQ文件快速的一体化预处理工具。由C语言开发且支持多线程的高性能表现。

官方文档:Fastp in GitHub
Fastp文章链接:https://www.biorxiv.org/content/early/2018/03/01/274100


功能特点

  • 对数据自动进行全方位质控,生成人性化的报告
  • 过滤功能(低质量,太短,太多N……)
  • 对每一个序列的头部或尾部,计算滑动窗内的质量均值,并将均值较低的子序列进行切除(类似Trimmomatic的做法,但是快非常多)
  • 全局剪裁 (在头/尾部,不影响去重),对于Illumina下机数据往往最后一到两个cycle需要这样处理
  • 去除接头污染。厉害的是,你不用输入接头序列,因为算法会自动识别接头序列并进行剪裁
  • 对于双端测序(PE)的数据,软件会自动查找每一对read的重叠区域,并对该重叠区域中不匹配的碱基对进行校正
  • 去除尾部的polyG。对于Illumina NextSeq/NovaSeq的测序数据,因为是两色法发光,polyG是常有的事,所以该特性对该两类测序平台默认打开
  • 对于PE数据中的overlap区间中不一致的碱基对,依据质量值进行校正
  • 可以对带分子标签(UMI)的数据进行预处理,不管UMI在插入片段还是在index上,都可以轻松处理
  • 可以将输出进行分拆,而且支持两种模式,分别是指定分拆的个数,或者分拆后每个文件的行数
  • 更多查看官方文档

注意

  • 默认已经开启一些功能,也可关闭或者改变mo某个功能的默认参数
  • 支持gzip输入与输出
  • 支持SE(single end)和PE(paired end)数据
  • 完美支持short reads和一定程度的long reads
  • 生成html和json报告,且可动态交互html预览 json预览

下载安装

官方文档介绍了三种获得Fastp的方式:

  1. Install with Bioconda (may be not the lastest version)

    1
    conda install -c bioconda fastp
  2. Download binary (only for Linux system, http://opengene.org/fastp/fastp) This binary was compiled on CentOS, and tested on CentOS/Ubuntu (测试于Ubuntu 18.04 & 16.04可用)

    1
    2
    wget http://opengene.org/fastp/fastp
    chmod a+x ./fastp
  3. Compile from source
    Get source (you can also use browser to download from master or releases)

    1
    2
    3
    4
    5
    6
    git clone https://github.com/OpenGene/fastp.git
    # build
    cd fastp
    make
    # Install
    sudo make install

我选择二进制文件,可直接按如下代码使用,无需安装

1
./fastp [option]

简单使用

SE数据:

1
fastp -i in.fq -o out.fq

输入需要过滤、质控和预处理的fastq文件(in.fq)
输出clean data(out.fq)、报告文件(fast.html和fastp.json)
报告文件名称可通过 -h xx.html 和 -j xx.json 更改

PE数据:

1
fastp -i in.R1.fq -o out.R1.fq -I in.R2.fq -O out.R2.fq

注意

  • o与O是小写字母o与大写字母O
  • gzip数据的输入无需参数,自动按后缀识别(xx.gz)
    例:
    1
    fastp -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz

精确使用

具体参数的使用、调节都需要根据具体的质控要求及项目或者文章结果要求进行更改。
附上全部参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
usage: fastp -i <in1> -o <out1> [-I <in1> -O <out2>] [options...]
options:
# I/O options
-i, --in1 read1 input file name (string)
-o, --out1 read1 output file name (string [=])
-I, --in2 read2 input file name (string [=])
-O, --out2 read2 output file name (string [=])
-6, --phred64 indicate the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)
-z, --compression compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 4. (int [=4])
--stdin input from STDIN. If the STDIN is interleaved paired-end FASTQ, please also add --interleaved_in.
--stdout output passing-filters reads to STDOUT. This option will result in interleaved FASTQ output for paired-end input. Disabled by default.
--interleaved_in indicate that <in1> is an interleaved FASTQ which contains both read1 and read2. Disabled by default.
--reads_to_process specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])
--dont_overwrite don't overwrite existing files. Overwritting is allowed by default.

# adapter trimming options
-A, --disable_adapter_trimming adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled
-a, --adapter_sequence the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped. (string [=auto])
--adapter_sequence_r2 the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as <adapter_sequence> (string [=])
--detect_adapter_for_pe by default, the adapter sequence auto-detection is enabled for SE data only, turn on this option to enable it for PE data.

# global trimming options
-f, --trim_front1 trimming how many bases in front for read1, default is 0 (int [=0])
-t, --trim_tail1 trimming how many bases in tail for read1, default is 0 (int [=0])
-b, --max_len1 if read1 is longer than max_len1, then trim read1 at its tail to make it as long as max_len1. Default 0 means no limitation (int [=0])
-F, --trim_front2 trimming how many bases in front for read2. If it's not specified, it will follow read1's settings (int [=0])
-T, --trim_tail2 trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings (int [=0])
-B, --max_len2 if read2 is longer than max_len2, then trim read2 at its tail to make it as long as max_len2. Default 0 means no limitation. If it's not specified, it will follow read1's settings (int [=0])

# polyG tail trimming, useful for NextSeq/NovaSeq data
-g, --trim_poly_g force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
--poly_g_min_len the minimum length to detect polyG in the read tail. 10 by default. (int [=10])
-G, --disable_trim_poly_g disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data

# polyX tail trimming
-x, --trim_poly_x enable polyX trimming in 3' ends.
--poly_x_min_len the minimum length to detect polyX in the read tail. 10 by default. (int [=10])

# per read cutting by quality options
-5, --cut_by_quality5 enable per read cutting by quality in front (5'), default is disabled (WARNING: this will interfere deduplication for both PE/SE data)
-3, --cut_by_quality3 enable per read cutting by quality in tail (3'), default is disabled (WARNING: this will interfere deduplication for SE data)
-W, --cut_window_size the size of the sliding window for sliding window trimming, default is 4 (int [=4])
-M, --cut_mean_quality the bases in the sliding window with mean quality below cutting_quality will be cut, default is Q20 (int [=20])

# quality filtering options
-Q, --disable_quality_filtering quality filtering is enabled by default. If this option is specified, quality filtering is disabled
-q, --qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])
-u, --unqualified_percent_limit how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])
-n, --n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])

# length filtering options
-L, --disable_length_filtering length filtering is enabled by default. If this option is specified, length filtering is disabled
-l, --length_required reads shorter than length_required will be discarded, default is 15. (int [=15])
--length_limit reads longer than length_limit will be discarded, default 0 means no limitation. (int [=0])

# low complexity filtering
-y, --low_complexity_filter enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).
-Y, --complexity_threshold the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30])

# filter reads with unwanted indexes (to remove possible contamination)
--filter_by_index1 specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line (string [=])
--filter_by_index2 specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line (string [=])
--filter_by_index_threshold the allowed difference of index barcode for index filtering, default 0 means completely identical. (int [=0])

# base correction by overlap analysis options
-c, --correction enable base correction in overlapped regions (only for PE data), default is disabled
--overlap_len_require the minimum length of the overlapped region for overlap analysis based adapter trimming and correction. 30 by default. (int [=30])
--overlap_diff_limit the maximum difference of the overlapped region for overlap analysis based adapter trimming and correction. 5 by default. (int [=5])

# UMI processing
-U, --umi enable unique molecular identifier (UMI) preprocessing
--umi_loc specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none (string [=])
--umi_len if the UMI is in read1/read2, its length should be provided (int [=0])
--umi_prefix if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default (string [=])
--umi_skip if the UMI is in read1/read2, fastp can skip several bases following UMI, default is 0 (int [=0])

# overrepresented sequence analysis
-p, --overrepresentation_analysis enable overrepresented sequence analysis.
-P, --overrepresentation_sampling One in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20. (int [=20])

# reporting options
-j, --json the json format report file name (string [=fastp.json])
-h, --html the html format report file name (string [=fastp.html])
-R, --report_title should be quoted with ' or ", default is "fastp report" (string [=fastp report])

# threading options
-w, --thread worker thread number, default is 2 (int [=2])

# output splitting options
-s, --split split output by limiting total split file number with this option (2~999), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (int [=0])
-S, --split_by_lines split output by limiting lines of each file with this option(>=1000), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (long [=0])
-d, --split_prefix_digits the digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding (int [=4])

# help
-?, --help print this message

我的项目的质控要求:

  1. 去除接头(对于双端已自动去除,无需参数)
  2. 去除含N比例大于0.1的 reads(参数 -n 的默认值为5,我的测序reads 长125bp,则 -n 应为12.5≈13)
  3. 去除低质量 reads(Q phred <= 20 的碱基数占整个 read 长度的 50%以上的 reads)(-q 20 -u 50)
  4. 速度尽可能快:线程数:-w 4(CPU:双核四线程)

注意

  • 线程数根据CPU能力调节
  • 质控要求所对应的参数调节具体查看官方文档

批量化处理数据时,可写出相应 bash 脚本:

1
2
3
4
5
6
7
8
for i in 10 11; do
{
./fastp -i raw/E${i}_1.fq.gz -I raw/E${i}_2.fq.gz\
-o out/out.E${i}_1.fq.gz -O out/out.E${i}_2.fq.gz\
-q 20 -u 50 -n 13 -w 4 -h out/E${i}.html -j out/E${i}.json
};
done
wait

注意

  • 这里的 raw 和 out 是我创建的文件夹,便于管理
  • bash 语法自行学习使用

参考

  1. 官方文档
  2. 生信工具Fastp的安装及其在二代测序数据过滤质控中的使用说明
  3. 使用fastp进行数据质控
  4. fastp: 一款超快速全功能的FASTQ文件自动化质控+过滤+校正+预处理软件
-------------本文结束 感谢您的阅读-------------
暖一下
ZJohnson wechat
扫一扫,领红包!
0%