Development of the amplification strategy and NGS data processing pipeline for HIV-1 genome assembly and resistance genotyping

Alexey Masharskiy, Andrey Komissarov, Artem Fadeev and Nikita Yolshin*

Smorodintsev Research Institute of Influenza

nikita.yolshin [at] gmail.com

Abstract

Next-generation sequencing (NGS) is increasingly used for HIV-1 genotyping due to its scalability, sensitivity, and lower cost per sample compared to Sanger sequencing. The aim of this study was to develop and validate a workflow for large-scale analysis of HIV-1 NGS data, including consensus genome assembly, subtype assignment, drug resistance mutation detection, and cell tropism prediction.

A total of 1888 HIV-1 positive plasma samples collected in Russia during 2024–2025 were analyzed. Sequencing libraries were generated from nested PCR amplicons covering HIV-1 pol and env genes and sequenced using Illumina NextSeq 2000 and MGI DNBSEQ-G400 platforms.

A custom analysis pipeline implemented in Python was developed. Raw FASTQ reads were processed using Fastp for quality trimming followed by de novo assembly with MEGAHIT. In parallel, reads were mapped to a reference subtype dataset from the Los Alamos HIV database using BWA mem. The best-fitting reference was selected according to mapped read counts obtained with Samtools idxstats. Consensus generation included iterative remapping and refinement using Minimap2, Samtools, BCFtools, and iVar.

Subtype determination was performed using COMET, Sierra, and Nextclade tools with a consensus-based decision strategy. Drug resistance mutations were identified using a local Docker deployment of Sierra. CXCR4 tropism prediction was performed using a custom Python implementation of empirical V3-loop rules and compared with Geno2pheno [coreceptor].

The pipeline successfully generated consensus pol sequences for 1888 analyzed samples and env sequences for 60% of them. The workflow enabled simultaneous characterization of HIV-1 subtype diversity, resistance mutations, and tropism. The predominant subtype was A6 (72.4%), followed by CRF63_02A6 (22.2%). The most frequent resistance-associated mutations included M184V/I, K65R, K103N, and G190S. Resistance to NNRTIs and NRTIs was most prevalent among ART-experienced patients. CXCR4-associated variants were detected infrequently.

The developed amplification strategy and bioinformatics workflow provide a scalable solution for HIV-1 NGS analysis and molecular surveillance. Iterative consensus refinement and combined reference-guided/de novo assembly improved robustness for genetically diverse HIV-1 variants. The pipeline enables high-throughput resistance and tropism monitoring and can be adapted for routine epidemiological surveillance and clinical genomics applications.

Keywords: HIV, antiretroviral therapy, drug-resistance mutations