TY - GEN
T1 - Safety and Completeness in Flow Decompositions for RNA Assembly
AU - Khan, Shahbaz
AU - Kortelainen, Milla
AU - Cáceres, Manuel
AU - Williams, Lucia
AU - Tomescu, Alexandru I.
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - Flow decomposition has numerous applications, ranging from networking to bioinformatics. Some applications require any valid decomposition that optimizes some property as number of paths, robustness, or path lengths. Many bioinformatic applications require the specific decomposition which relates to the underlying data that generated the flow. Thus, no optimization criteria guarantees to identify the correct decomposition for real inputs. We propose to instead report the safe paths, which are subpaths of at least one path in every flow decomposition. Ma et al. [WABI 2020] addressed the existence of multiple optimal solutions in a probabilistic framework, which is referred to as non-identifiability. Later, they gave a quadratic-time algorithm [RECOMB 2021] based on a global criterion for solving a problem called AND-Quant, which generalizes the problem of reporting whether a given path is safe. We present the first local characterization of safe paths for flow decompositions in directed acyclic graphs, giving a practical algorithm for finding the complete set of safe paths. We also evaluated our algorithm against the trivial safe algorithms (unitigs, extended unitigs) and a popular heuristic (greedy-width) for flow decomposition on RNA transcripts datasets. Despite maintaining perfect precision our algorithm reports ≈ 50% higher coverage over trivial safe algorithms. Though greedy-width reports better coverage, it has significantly lower precision on complex graphs. On a unified metric (F-Score) of coverage and precision, our algorithm outperforms greedy-width by ≈ 20%, when the evaluated dataset has significant number of complex graphs. Also, it has superior time (3–5 × ) and space efficiency (1.2–2.2 × ), resulting in a better and more practical approach for bioinformatics applications of flow decomposition.
AB - Flow decomposition has numerous applications, ranging from networking to bioinformatics. Some applications require any valid decomposition that optimizes some property as number of paths, robustness, or path lengths. Many bioinformatic applications require the specific decomposition which relates to the underlying data that generated the flow. Thus, no optimization criteria guarantees to identify the correct decomposition for real inputs. We propose to instead report the safe paths, which are subpaths of at least one path in every flow decomposition. Ma et al. [WABI 2020] addressed the existence of multiple optimal solutions in a probabilistic framework, which is referred to as non-identifiability. Later, they gave a quadratic-time algorithm [RECOMB 2021] based on a global criterion for solving a problem called AND-Quant, which generalizes the problem of reporting whether a given path is safe. We present the first local characterization of safe paths for flow decompositions in directed acyclic graphs, giving a practical algorithm for finding the complete set of safe paths. We also evaluated our algorithm against the trivial safe algorithms (unitigs, extended unitigs) and a popular heuristic (greedy-width) for flow decomposition on RNA transcripts datasets. Despite maintaining perfect precision our algorithm reports ≈ 50% higher coverage over trivial safe algorithms. Though greedy-width reports better coverage, it has significantly lower precision on complex graphs. On a unified metric (F-Score) of coverage and precision, our algorithm outperforms greedy-width by ≈ 20%, when the evaluated dataset has significant number of complex graphs. Also, it has superior time (3–5 × ) and space efficiency (1.2–2.2 × ), resulting in a better and more practical approach for bioinformatics applications of flow decomposition.
KW - DAGs
KW - Flow decomposition
KW - RNA assembly
KW - Safety
UR - https://www.scopus.com/pages/publications/85131150001
U2 - 10.1007/978-3-031-04749-7_11
DO - 10.1007/978-3-031-04749-7_11
M3 - Conference contribution
AN - SCOPUS:85131150001
SN - 9783031047480
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 177
EP - 192
BT - Research in Computational Molecular Biology - 26th Annual International Conference, RECOMB 2022, Proceedings
A2 - Pe’er, Itsik
PB - Springer Science and Business Media Deutschland GmbH
T2 - 26th International Conference on Research in Computational Molecular Biology, RECOMB 2022
Y2 - 22 May 2022 through 25 May 2022
ER -