Title: | Sequence Error Filter for Formalin-Fixed and Paraffin-Embedded Samples |
---|---|
Description: | Clinical sequencing of tumor is usually performed on formalin-fixed and paraffin-embedded samples and have many sequencing errors. We found that the majority of these errors are detected in chimeric read caused by single-strand DNA with micro-homology. Our filtering pipeline focuses on the uneven distribution of the artifacts in each read and removes such errors in formalin-fixed and paraffin-embedded samples without over-eliminating the true mutations detected in fresh frozen samples. |
Authors: | Masachika Ikegami [aut, cre] |
Maintainer: | Masachika Ikegami <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.1.6 |
Built: | 2024-11-26 05:59:16 UTC |
Source: | https://github.com/mano-b/microsec |
A BAM file containing the information of eight mutations.
exampleBam
exampleBam
A list with 8 factors, each contains 46527 variables:
chromosome of the read
read ID list
sequence of the read, in DNAString
strand of the read
CIGAR sequence of the read
Phred quality of the read
starting position of the read
insert size of the read
...
A dataset containing the information of eight mutations.
exampleMutation
exampleMutation
A list with 8 factors, each contains 29 variables
sample name
mutation type
altered chromosome
altered position
reference base
altered base
mutation locating repeat sequence
neighborhood sequence
...
This function analyzes the filtering results.
fun_analysis( msec, mut_depth, short_homology_search_length, min_homology_search, threshold_p, threshold_hairpin_ratio, threshold_short_length, threshold_distant_homology, threshold_soft_clip_ratio, threshold_low_quality_rate, homopolymer_length )
fun_analysis( msec, mut_depth, short_homology_search_length, min_homology_search, threshold_p, threshold_hairpin_ratio, threshold_short_length, threshold_distant_homology, threshold_soft_clip_ratio, threshold_low_quality_rate, homopolymer_length )
msec |
Mutation filtering information. |
mut_depth |
Mutation coverage data. |
short_homology_search_length |
Small sequence for homology search. |
min_homology_search |
The sequence length for homology search. |
threshold_p |
The largest p value of significant errors. |
threshold_hairpin_ratio |
The smallest hairpin read ratio. |
threshold_short_length |
Reads shorter than that are analyzed. |
threshold_distant_homology |
The smallest rate of reads from other regions. |
threshold_soft_clip_ratio |
The rate of soft-clipped reads. |
threshold_low_quality_rate |
The smallest rate of low quality bases. |
homopolymer_length |
The smallest length of homopolymers. |
msec
data(msec_summarized) data(mut_depth_checked) fun_analysis(msec = msec_summarized, mut_depth = mut_depth_checked, short_homology_search_length = 4, min_homology_search = 40, threshold_p = 10 ^ (-6), threshold_hairpin_ratio = 0.50, threshold_short_length = 0.75, threshold_distant_homology = 0.15, threshold_soft_clip_ratio = 0.50, threshold_low_quality_rate = 0.1, homopolymer_length = 15 )
data(msec_summarized) data(mut_depth_checked) fun_analysis(msec = msec_summarized, mut_depth = mut_depth_checked, short_homology_search_length = 4, min_homology_search = 40, threshold_p = 10 ^ (-6), threshold_hairpin_ratio = 0.50, threshold_short_length = 0.75, threshold_distant_homology = 0.15, threshold_soft_clip_ratio = 0.50, threshold_low_quality_rate = 0.1, homopolymer_length = 15 )
This function attempts to find hairpin structure sequences.
fun_hairpin_check(hairpin_seq_tmp, ref_seq, hairpin_length, hair)
fun_hairpin_check(hairpin_seq_tmp, ref_seq, hairpin_length, hair)
hairpin_seq_tmp |
The sequence to be checked. |
ref_seq |
Reference sequence around the mutation. |
hairpin_length |
The temporal length of hairpin sequences. |
hair |
The length of sequences to be checked. |
list(hairpin_length, whether hairpin sequences exist or not)
This function attempts to search the homologous regions.
fun_homology( msec, df_distant, min_homology_search, ref_genome, chr_no, progress_bar )
fun_homology( msec, df_distant, min_homology_search, ref_genome, chr_no, progress_bar )
msec |
Mutation filtering information. |
df_distant |
Sequences to be checked. |
min_homology_search |
Minimum length to define "homologous". |
ref_genome |
Reference genome for the data. |
chr_no |
Reference genome chromosome number (human=24, mouse=22). |
progress_bar |
"Y": You can see the progress visually. |
msec
## Not run: data(msec_read_checked) data(homology_searched) fun_homology(msec = msec_read_checked, df_distant = homology_searched, min_homology_search = 40, ref_genome = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, chr_no = 24, progress_bar = "N" ) ## End(Not run)
## Not run: data(msec_read_checked) data(homology_searched) fun_homology(msec = msec_read_checked, df_distant = homology_searched, min_homology_search = 40, ref_genome = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, chr_no = 24, progress_bar = "N" ) ## End(Not run)
This function attempts to load the BAM file.
fun_load_bam(bam_file)
fun_load_bam(bam_file)
bam_file |
Path of the BAM file. |
df_bam
fun_load_bam( system.file("extdata", "sample.bam", package = "MicroSEC") )
fun_load_bam( system.file("extdata", "sample.bam", package = "MicroSEC") )
This function attempts to load the chromosome number.
fun_load_chr_no(organism)
fun_load_chr_no(organism)
organism |
Human or Mouse genome. |
chr_no
fun_load_chr_no("Human")
fun_load_chr_no("Human")
This function attempts to load the appropriate genome.
fun_load_genome(organism)
fun_load_genome(organism)
organism |
Human or Mouse genome. |
ref_genome
fun_load_genome("Human")
fun_load_genome("Human")
This function attempts to load the mutation information file.
fun_load_mutation( mutation_file, sample_name, ref_genome, chr_no, simple_repeat_list = "" )
fun_load_mutation( mutation_file, sample_name, ref_genome, chr_no, simple_repeat_list = "" )
mutation_file |
Path of the mutation information file. |
sample_name |
Sample name. |
ref_genome |
Reference genome for the data. |
chr_no |
Reference genome chromosome number (human=24, mouse=22). |
simple_repeat_list |
Optional, set simple repeat bed file path. |
df_mutation
fun_load_mutation( system.file("extdata", "mutation_list.tsv", package = "MicroSEC"), "sample", BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, 24 )
fun_load_mutation( system.file("extdata", "mutation_list.tsv", package = "MicroSEC"), "sample", BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, 24 )
This function attempts to check the mutation profile in each read.
fun_read_check( df_mutation, df_bam, ref_genome, sample_name, read_length, adapter_1, adapter_2, short_homology_search_length, min_homology_search, progress_bar )
fun_read_check( df_mutation, df_bam, ref_genome, sample_name, read_length, adapter_1, adapter_2, short_homology_search_length, min_homology_search, progress_bar )
df_mutation |
Mutation information. |
df_bam |
Data from the BAM file. |
ref_genome |
Reference genome for the data. |
sample_name |
Sample name (character) |
read_length |
The read length in the sequence. |
adapter_1 |
The Read 1 adapter sequence of the library. |
adapter_2 |
The Read 2 adapter sequence of the library. |
short_homology_search_length |
Small sequence for homology search. |
min_homology_search |
Minimum length to define "homologous". |
progress_bar |
"Y": You can see the progress visually. |
list(msec, homology_search, mut_depth)
## Not run: data(exampleMutation) data(exampleBam) fun_read_check(df_mutation = exampleMutation, df_bam = exampleBam, ref_genome = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, sample_name = "sample", read_length = 150, adapter_1 = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA", adapter_2 = "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT", short_homology_search_length = 4, min_homology_search = 40, progress_bar = "N" ) ## End(Not run)
## Not run: data(exampleMutation) data(exampleBam) fun_read_check(df_mutation = exampleMutation, df_bam = exampleBam, ref_genome = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, sample_name = "sample", read_length = 150, adapter_1 = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA", adapter_2 = "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT", short_homology_search_length = 4, min_homology_search = 40, progress_bar = "N" ) ## End(Not run)
This function attempts to check the repetitive sequence around the mutation.
fun_repeat_check(rep_a, rep_b, ref_seq, ref_width, del)
fun_repeat_check(rep_a, rep_b, ref_seq, ref_width, del)
rep_a |
The shorter sequence of Ref and Alt. |
rep_b |
The longer sequence of Ref and Alt. |
ref_seq |
Reference sequence around the mutation. |
ref_width |
Search length for ref_seq. |
del |
Insertion: 0, Deletion: 1 |
list(pre_rep_status, post_rep_status, pre_rep_short, post_rep_short, homopolymer_status)
This function attempts to save the filtering results.
fun_save(msec, output)
fun_save(msec, output)
msec |
Mutation filtering information. |
output |
output file name (full path). |
## Not run: data(msec_analyzed) fun_save(msec = msec_analyzed, output = "./MicroSEC_test.tsv.gz" ) ## End(Not run)
## Not run: data(msec_analyzed) fun_save(msec = msec_analyzed, output = "./MicroSEC_test.tsv.gz" ) ## End(Not run)
This function attempts to find the mutated bases in each read.
fun_setting(pre, post, neighbor_seq, neighbor_length, alt_length)
fun_setting(pre, post, neighbor_seq, neighbor_length, alt_length)
pre |
The 5' side bases of the sequence for searching. |
post |
The 3' side bases of the sequence for searching. |
neighbor_seq |
Short reference sequence around the mutation. |
neighbor_length |
The length from the mutation to the ends of the short reference sequence. |
alt_length |
The length of altered bases. |
list(pre_search_length, post_search_length, peri_seq_1, peri_seq_2)
This function summarizes the filtering results.
fun_summary(msec)
fun_summary(msec)
msec |
Mutation filtering information. |
msec
data(msec_homology_searched) fun_summary(msec_homology_searched)
data(msec_homology_searched) fun_summary(msec_homology_searched)
This function attempts to divide without 0/0 errors.
fun_zero(a, b)
fun_zero(a, b)
a , b
|
Integers |
a divided by b
A dataset containing the information of reads for homology search.
homology_searched
homology_searched
A list with 7 factors, each contains 1508 variables:
sample name
altered chromosome
altered position
reference base
altered base
5' (pre) or 3' (post) sequence of the mutated base
sequence for homology search
...
A dataset containing the information of eight mutations processed by the fun_homology function.
msec_analyzed
msec_analyzed
A list with 37 factors, each contains 29 variables
sample name
mutation type
altered chromosome
altered position
reference base
altered base
mutation locating repeat sequence
neighborhood sequence
read length
number of mutation supporting reads
number of soft-clipped reads
number of reads produced by hairpin structure
maximum 5'-supporting length
maximum 3'-supporting length
minimum supporting length
5'-farthest supported base from the mutated base
3'-farthest supported base from the mutated base
low quality base rate
low quality base rate of 5'- side
low quality base rate of 3'- side
rate of reads derived from homologous regions
rate of soft clipped reads
possibility of short-supporting length
possibility of 5'-supporting length
possibility of 3'-supporting length
filter 1
filter 2
filter 3
filter 4
filter 5
filter 6
filter 7
filter 8
any of filter 1-3
any of filter 1-4
any of filter 1-8
comment
...
A dataset containing the information of eight mutations processed by the fun_homology function.
msec_homology_searched
msec_homology_searched
A list with 34 factors, each contains 29 variables
sample name
mutation type
altered chromosome
altered position
reference base
altered base
mutation locating repeat sequence
neighborhood sequence
read length
mutation type
length of the mutated bases
number of mutation supporting reads
number of soft-clipped reads
number of reads produced by hairpin structure
maximum length of palindromic sequences
maximum 5'-supporting length
maximum 3'-supporting length
minimum supporting length
minimum 5'-supporting length
minimum 3'-supporting length
5'-farthest supported base from the mutated base
3'-farthest supported base from the mutated base
low quality base rate
low quality base rate of 5'- side
low quality base rate of 3'- side
5'-repeat sequence length
3'-repeat sequence length
homopolymer sequence length
whether the mutation is indel or not
length of indel mutation
number of reads derived from homologous regions
5'-penalty score by the mapper
3'-penalty score by the mapper
comment
...
A dataset containing the information of eight mutations processed by the fun_read_check function.
msec_read_checked
msec_read_checked
A list with 34 factors, each contains 46527 variables
sample name
mutation type
altered chromosome
altered position
reference base
altered base
mutation locating repeat sequence
neighborhood sequence
read length
mutation type
length of the mutated bases
number of mutation supporting reads
number of soft-clipped reads
number of reads produced by hairpin structure
maximum length of palindromic sequences
maximum 5'-supporting length
maximum 3'-supporting length
minimum supporting length
minimum 5'-supporting length
minimum 3'-supporting length
minimum 5'-supporting length
low quality base rate
low quality base rate of 5'- side
low quality base rate of 3'- side
5'-farthest supported base from the mutated base
3'-farthest supported base from the mutated base
3'-repeat sequence length
homopolymer sequence length
whether the mutation is indel or not
length of indel mutation
number of reads derived from homologous regions
5'-penalty score by the mapper
3'-penalty score by the mapper
comment
...
A dataset containing the information of eight mutations processed by the fun_homology function.
msec_summarized
msec_summarized
A list with 52 factors, each contains 29 variables
sample name
mutation type
altered chromosome
altered position
reference base
altered base
mutation locating repeat sequence
neighborhood sequence
read length
mutation type
length of the mutated bases
number of mutation supporting reads
number of soft-clipped reads
number of reads produced by hairpin structure
maximum length of palindromic sequences
maximum 5'-supporting length
maximum 3'-supporting length
minimum supporting length
minimum 5'-supporting length
minimum 3'-supporting length
5'-farthest supported base from the mutated base
3'-farthest supported base from the mutated base
low quality base rate
low quality base rate of 5'- side
low quality base rate of 3'- side
5'-repeat sequence length
3'-repeat sequence length
homopolymer sequence length
whether the mutation is indel or not
length of indel mutation
number of reads derived from homologous regions
5'-penalty score by the mapper
3'-penalty score by the mapper
comment
rate of reads derived from homologous regions
adjusted pre_minimum_length
adjusted pre_minimum_length
adjusted pre_minimum_length
adjusted pre_minimum_length
the shortest short_support_length
theoretically minimum 5'-supporting length
theoretically minimum 3'-supporting length
theoretically minimum supporting length
adjusted short_support_length
substituted/inserted length
half of the read length
range of short_support_length
range of pre_support_length
range of post_support_length
range of possible short_support_length
range of possible supporting length
rate of soft clipped reads
...
A dataset containing the information of reads for homology search.
mut_depth_checked
mut_depth_checked
Three lists with 201 factors, each contains 29 variables: