Biology & Genomics
Genomic sequence analysis, protein design, single-cell analysis. ESM, RoseTTAFold, scGPT.
Setup walkthrough
- Biology AI spans: protein structure prediction, genomic sequence analysis, single-cell RNA-seq, and literature reasoning.
- For protein structure:
pip install colabfold(local AlphaFold2). Feed a FASTA sequence → predicted PDB structure. A 400-residue protein: 30-60 minutes on 12 GB GPU. - For protein design:
pip install proteinmpnn(ProteinMPNN — inverse folding: given a backbone structure, design an amino acid sequence that folds to it). Runs in seconds on GPU. - For genomic analysis:
pip install biopython+pip install esm(ESM-2, Meta's protein language model). ESM-2 computes embeddings for protein sequences used for variant effect prediction, structure prediction, and functional annotation. - For single-cell RNA-seq:
pip install scvi-tools(scVI — deep generative model for scRNA-seq). Trains on 100K cells in 30-60 minutes on GPU. Handles batch correction, clustering, differential expression. - For biological literature Q&A: general LLMs handle PubMed-level biology questions. For specialized genomics knowledge, fine-tune or RAG over domain corpora.
The cheap setup
Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs ColabFold for proteins up to 500 residues in 30-90 minutes — enough for most single-domain proteins. ESM-2 embeddings for 100K sequences in ~30 minutes. scVI training on 100K cells in 30-60 minutes. For a molecular biology lab automating routine analysis: $400-500 handles protein structure prediction, variant effect scoring, and single-cell analysis. Pair with Ryzen 5 5600 + 32 GB DDR4 + 1TB NVMe (genomic datasets are large). Total: ~$400-480. Biology AI at $400 replaces cloud compute costs for routine tasks.
The serious setup
Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs ColabFold for large multi-domain proteins (800+ residues) in 60-120 minutes. ESM-2 15B (largest variant) embeddings improve downstream prediction accuracy for variant effects. For genomics-scale analysis: training scVI on 1M+ cells in 2-4 hours. For a computational biology lab: RTX 3090 handles 80% of routine workloads. The remaining 20% (genome-wide association studies, metagenomics assembly) need CPU clusters, not GPU. Total: ~$1,800-2,200. For AlphaFold-class structure prediction at scale: dual RTX 3090 for parallel predictions (predict multiple proteins simultaneously).
Common beginner mistake
The mistake: Running ColabFold on a protein sequence, getting a predicted structure, and treating it as experimentally determined ground truth. Why it fails: AlphaFold/ColabFold predictions are statistical — they predict the most likely structure given the training data. For well-studied protein families: accuracy approaches experiment (~1-2 Å RMSD). For novel proteins, disordered regions, or proteins with rare folds: predictions can be wildly wrong (10+ Å RMSD) with high confidence (high pLDDT scores on wrong structures). The confidence metric (pLDDT) correlates with accuracy on average but is unreliable for individual predictions. The fix: Always check the predicted aligned error (PAE) plot — it shows which regions are reliably positioned relative to each other. Low PAE between domains = high confidence in relative orientation. High PAE = the domains could be anywhere. For publication-quality structures, validate with experimental methods (X-ray crystallography, cryo-EM) or multiple orthogonal predictions (AlphaFold2 + ESMFold + RoseTTAFold — agreement between methods increases confidence). Predicted ≠ determined.
Recommended setup for biology & genomics
Browse all tools for runtimes that fit this workload.
Reality check
Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.
Common mistakes
- Buying for spec-sheet VRAM without modeling KV cache + activation overhead
- Underestimating quantization quality loss below Q4
- Skipping flash-attention support (real perf gap on long context)
- Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)
What breaks first
The errors most operators hit when running biology & genomics locally. Each links to a diagnose+fix walkthrough.
Before you buy
Verify your specific hardware can handle biology & genomics before committing money.