Senior Data Scientist · PhD · Computational Biology

Amy
Francis

Data science, ML & drug discovery - and making these tools accessible to everyone.

I'm a Senior Data Scientist bridging computational biology and experimental research, with expertise in machine learning, bioinformatics, and tool development across cancer genomics, immunology, and drug discovery. I think a lot about how AI is changing science - and about making sure those changes reach every researcher, not just those with a computer science or machine learning background. My work tries to do that: building tools, leading interdisciplinary teams, and closing the gap between what these models can do and what researchers can access.

Cancer Genomics Drug Discovery Protein ML RNA-seq Foundation Models AI in Science 2× Hackathon Winner
Hackathon First Place
PhDBioinformatics & ML · Bristol
RocheAI Internship

About me

I try to read where science is heading,
and build the skills to meet it.

I started at the bench, building a CRISPR activation screening platform at the Crick Institute, working alongside wet-lab scientists and bioinformaticians and attending seminars from some of the best scientists in the world. My time at the Crick offered early exposure to the shifts taking place in the field. The scale of the data, the emergence of AI-driven methods, the widening gap between what we could measure and what we could interpret, and the potential impact of evolving technologies. This sparked my curiosity.

I'm a Senior Data Scientist at Nexus BioQuest, a contract research organisation in Portishead, Bristol, where I work across data science, machine learning, and analytical tool development in support of research programmes spanning pharmaceutical and biotech clients.

My PhD at the University of Bristol, funded by a competitive Cancer Research UK studentship, focused on predicting the functional impact of genetic variants in cancer genomes. I've since worked at Roche, exploring protein language models and antibody optimisation - and published peer-reviewed papers in variant effect prediction during my PhD.

I've also led interdisciplinary teams to back-to-back hackathon wins at Cambridge and the Wellcome Collection, and co-organised Bristol's first AI in Health meeting, catalysing interdisciplinary research. The best science I've been part of has come from people with completely different mental models trying to solve the same problem. It generates questions that nobody in a single-discipline team would ever think to ask, and occasionally, those questions lead somewhere important.

Based in
Bristol, UK
Experience
5+ Years
Current role
Senior Data Scientist, Nexus BioQuest
PhD
Bioinformatics & ML, Bristol
PhD Funded by
Cancer Research UK
Interests
AI in science, open collaboration

A perspective on AI in science

AI is going to be part of everyday science.
I think the a big challenge will now be getting these tools to the scientists who need them, including those who don't yet know they do. I believe that's where the next medicines will come from.

The tools are being developed, and the clinical evidence is now starting to follow. A significant challenge will now be getting them out of research papers and into the hands of the scientists who could use them to find new medicines, and opening up possibilities for researchers who don't even know what they're missing yet.

"Some of the most powerful tools in the history of biology are sitting in research papers and GitHub repos that most bench scientists have never heard of. Building the bridges to get them there, through upskilling and through teams where computational and experimental scientists learn from each other, is how I think we find the next generation of medicines."

AlphaFold transformed protein structure prediction in 2021 - making accurate structural models accessible for hundreds of millions of proteins where before it had required years of experimental work. AI-designed drugs are now reaching clinical trials. The industry is adapting, with pharma companies partnering with specialist AI firms and embedding AI infrastructure directly into their R&D pipelines. The pace of change is accelerating.

But using these tools well still requires an unusual combination of factors and skills - enough ML to run and adapt the models, the compute infrastructure to work with them, and enough domain knowledge to ask the right questions. Most biologists have one of those things, maybe two. Closing that gap - whether through upskilling or building teams that complement each other's strengths - is something I think the field needs to prioritise in the coming years. This is what drives most of what I build and write about.

A deeper analysis, including the partnerships, clinical evidence, how infrastructure investments could lead to more breakthroughs, and discussions about whether any of this produces better drugs - is in the blog.

Foundation models reshaping biology

AlphaFold 2 & 3
DeepMind / Google
Predicted structures for 200M+ proteins by 2022. AlphaFold 3 (2024) extends to DNA, RNA, and small molecules - relevant to structure-based drug design.
ESM-2 & ESMFold
Meta AI
Protein language models trained on 250M sequences. Enable zero-shot fitness and function prediction.
Evo 1 & 2
Arc Institute
Genomic foundation models trained across the tree of life on DNA sequences. Enable generative design of genes, regulatory elements, and CRISPR guides.
RFdiffusion
Baker Lab, UW
Diffusion model for de novo protein design - binders, enzymes, vaccine candidates - designed from scratch and experimentally validated.
BioNeMo
NVIDIA
An open platform wrapping biological foundation models - ESMFold, DiffDock, MolMIM - behind accessible APIs. Used by Amgen, Genentech, AstraZeneca, GSK, and Novo Nordisk. Over 100 firms on the platform as of 2024.
DiffDock & MolMIM
NVIDIA / MIT
DiffDock predicts protein-ligand binding poses using diffusion - faster than traditional docking. MolMIM generates novel small molecules optimised for target properties.
Geneformer
Broad Institute
Transformer pre-trained on 30M single-cell transcriptomes. Supports in silico gene perturbation and network inference - useful for target identification without wet-lab experiments.
scGPT
University of Toronto
Single-cell foundation model pre-trained on 33M cells. Enables cell type annotation, perturbation prediction, and gene regulatory network inference.
TxGNN
Harvard / Broad
Graph neural network trained on biomedical knowledge graphs for drug indication and contraindication prediction - a tool for repurposing existing compounds.
Pharma.AI / INS018_055
Insilico Medicine
End-to-end AI drug discovery platform. Used to design INS018_055, the first fully AI-generated drug to show efficacy in a Phase IIa trial (Nature Biotechnology, 2024). Target to Phase I in under 30 months.

Projects & Hackathons

Things I've built and worked on.

A mix of published tools, hackathon projects, and pipelines built for research problems.

🏆 1st Place · Cambridge

Mapping novel compounds to biological pathways

Led the winning team at GetSeen Ventures' AI × Cancer Bio Hackathon. Used transformer encoders on SMILES strings and high-content image embeddings from the RxRx3-core dataset to predict molecular pathways.

TransformersSMILESImage EmbeddingsDrug Discovery
🏆 1st Place · Wellcome Collection

Deep learning for protein fitness prediction

Led the winning team at the Roche & HDR Hackathon. Encoded protein sequences with pre-trained language models (ESM, AntiBERT) and explored CNNs to model sequence-function relationships using DMS data from Protein Gym. Secured a Roche AI internship.

ESMAntiBERTCNNsDMSPyTorch
Read opinion piece
Alan Turing Institute

Toxicity prediction for drug discovery: Data Study Group with Ignota Labs

Participated in the Alan Turing Institute's Data Study Group in collaboration with Ignota Labs, working on machine learning approaches to predict drug toxicity.

PythonMLDrug DiscoveryToxicity Prediction
Team summary    Full report
Flow Cytometry

Flow cytometry analysis pipeline

Built a post-acquisition flow cytometry analysis pipeline with an intuitive Streamlit interface for processing and visualising high-dimensional cytometry data, enabling efficient downstream reporting across projects.

PythonStreamlitScikit-learn
Cancer Genomics

DrivR-Base - variant annotation toolkit

Published a data mining toolkit integrating molecular annotations for SNVs, creating a centralised resource that reduces redundancy and accelerates machine learning model development for variant effect prediction.

PythonRDockerCOSMICGnomAD
Read paper
Antibody Optimisation

Predicting mutation impact on antibody-antigen binding

At Roche pRED, used TensorFlow models grounded in global epistasis and pre-trained protein language models to predict binding affinity from deep mutational scanning data.

TensorFlowESMHPCDMS
Global Epistasis repo
Mentoring & Leadership

Automating ELISA analysis at Nexus BioQuest

Currently leading a project to automate ELISA data processing and reporting. Mentoring a team member through this project as part of their coding development - with weekly stand-ups, fortnightly code reviews, and task breakdowns.

PythonPandasStreamlitMentoringIn Progress
Community

Bristol AI in Health Meeting

Co-organised Bristol's first interdisciplinary AI in Health Meeting in collaboration with the Elizabeth Blackwell Institute. Facilitated cross-disciplinary collaboration that resulted in two interdisciplinary grants for applied AI projects.

LeadershipEvent OrganisationGrant Facilitation

Skills & Tools

The tools I work with.

Picked up across academia, industry, and a few hackathons.

Languages & Systems

  • Python
  • R
  • Linux / HPC
  • Docker

Machine Learning

  • Scikit-learn
  • XGBoost / SVMs
  • Neural Networks
  • Foundation & Language Models
  • TensorFlow / PyTorch

RNA-seq & Transcriptomics

  • Count matrices: Cell Ranger
  • Quantification: featureCounts, DESeq2
  • Differential expression: edgeR, limma
  • Unsupervised ML for cell population definition
  • Immunology cell type characterisation
  • UMAP / t-SNE dimensionality reduction
  • Pathway & gene set enrichment analysis

Bioinformatics & Data

  • Flow Cytometry
  • Deep Mutational Scanning
  • CRISPR / Image Analysis
  • COSMIC, GnomAD, TCGA
  • Protein / DNA Sequences
  • Proteomics (Olink)
  • scRNA-seq

Visualisation & Comms

  • Streamlit
  • Matplotlib / Seaborn
  • Scientific writing
  • Client consultation
  • Conference presenting

Leadership & Collaboration

  • Hackathon interdisciplinary team lead (2× winner)
  • User-centred design
  • Technical mentoring & coaching
  • Breaking down technical tasks for non-programmers
  • Code review & structured feedback

Publications

Published research.

Three peer-reviewed papers and one technical report, spanning cancer genomics, variant effect prediction, and drug discovery.

Journal Article · 2026
CanDrivR-CS: A Cancer-Specific Machine Learning Framework for Distinguishing Recurrent and Rare Variants
Bioinformatics Advances - Accepted & Published
doi.org/10.1093/bioadv/vbag008
Application Note · 2024
DrivR-Base: A Feature Extraction Toolkit For Variant Effect Prediction Model Construction
Bioinformatics - Accepted & Published
doi.org/10.1093/bioinformatics/btae197
Review Article · 2023
Predicting Pathogenicity from Non-Coding Mutations
Nature Biomedical Engineering - Accepted & Published
doi.org/10.1038/s41551-022-00996-x
Technical Report · 2024
Toxicity Prediction for Drug Discovery
The Alan Turing Institute, Data Study Group · Not peer-reviewed
doi.org/10.5281/zenodo.13882192

Contact