Project 1: Automatic GTV Segmentation from IGTV in planning CT
Aim
Automatically generate clinically usable GTVs on planning CT / a selected respiratory-phase CT given an existing IGTV. Combine conventional 3D segmentation methods with segmentation foundation models (e.g., SAM2 / MedSAM2 ) and explore weak/semisupervised strategies to reduce annotation needs. Produce end-to-end results and provide basic uncertainty indicators for human review.
Meaning
IGTV is commonly available in clinical practice, while per‑phase GTV annotations are scarce. Using IGTV as a constraint, inferring single‑phase GTVs is reasonable and can improve workflow efficiency and consistency. Foundation models’ prompt and fine‑tuning capabilities can substantially lower labeling cost; combined with weak/semisupervised methods, practical accuracy can be achieved with few annotations. The outcome can support routine planning and adaptive workflows and reduce repetitive manual contouring.
Resources
- Data: Existing Johns Hopkins set (60 cases: planning CT / 4D CT + IGTV) and supplement with public datasets to improve generalization.
- Concept: Model the task as “predict GTV on a CT within the IGTV region.” Use a hybrid strategy: a 3D segmentation baseline, foundation-model prompt/fine‑tune usage, and weak/semisupervised training to expand supervision.
- Baseline model: 3D U-Net / nnU-Net variant taking CT + IGTV mask as input; incorporate IGTV-based constraints during training/postprocessing.
- Foundation models: Apply SAM2 / MedSAM2, and so on, in prompt-as-mask mode and/or with light fine‑tuning/adapters; include a simple 3D fusion step to preserve inter-slice consistency.
- Weak / semi‑supervised methods: Pseudo-label self-training, mean‑teacher / consistency training, and hybrid supervision that treats IGTV as an upper‑bound constraint.
- Postprocessing & QC: Connected-component filtering, volume thresholds, uncertainty flags for clinician review; support export to DICOM-RTSTRUCT.
- Evaluation: Quantify Dice, 95% Hausdorff distance, relative volume error, centroid shift, and relevant dosimetric metrics; compare baseline, foundation‑model variants, and semi‑supervised approaches.
Interested? Contact [email protected]
Project 2: A vision language model approach for lung nodule detection in planning CT
Aim
Build an assistance system that takes planning CT + EHR (radiation therapy records, radiology reports, pathology, clinical summaries, etc.) as input and outputs candidate lesions (heatmap / bbox / center points) and, based on EHR, provides confidence/prioritization cues for rapid review and downstream contouring by radiation therapists/physicists.
Meaning
- Clinical value: Use EHR priors (side, lobe, history) to reduce misses (e.g., “previous right upper lobe lesion” increases priority for RUL candidates), shorten review and contouring time, and lower risk of target-miss in radiation planning.
- Scientific value: Evaluate image–text fusion benefits in low-sample / weak-label medical settings and investigate synthetic report generation for cross-modal pretraining. Provide a foundation for downstream lesion characterization (benign vs malignant, surveillance recommendation).
- Engineering & deployment value: Modular design (candidate generator + textual re-ranker) eases PACS/TPS integration and deployment. Produce auditable outputs and evidence snippets for clinical traceability and regulatory review.
Resources
Targets:
- Recall-first behavior: on the FROC curve, noticeably outperforms an image-only baseline at common FP/scan operating points (e.g., recall improvement ≥ 5–10% at ~1–2 FP/scan).
- Localization accuracy: center distance error within clinically acceptable range (e.g., mean center error < 5 mm, or IoU ≥ 0.3 for small nodules).
- Usability: outputs include explainable evidence (candidate heatmap + corresponding EHR snippets) and allow threshold tuning to control the number of candidates.
- Design priorities: maximize recall → interpretability → easy integration (modular interfaces, minimal disruption of existing workflow).
Data sources:
- Local: ~200 planning CTs from Hopkins (or similar), with associated EHR: RT-planning notes, radiology reports, discharge summaries, prior imaging descriptions, pathology/surgery records.
- Public: LUNA16 / LUNA25 / other public CT nodule sets for visual pre-training and candidate generator tuning.
Pre/Annotation:
- Image preprocessing: Resample to consistent spacing (e.g., 1 mm^3 or clinical standard), clip window (e.g., [-1000, 400] HU), intensity normalization. (Optional: lung segmentation to reduce search space and noise in thoracic cases.)
- Annotation normalization: Standardize annotations as center+radius or bbox (x,y,z, size) and include confidence/malignancy scores when available.
- De-identify (remove names, IDs, absolute dates), extract key fields with rule-based and model-based NLP: laterality, anatomical location (lobe/segment), timing, prior surgeries/radiation, pathology results; retain the original sentence as short evidence snippets.
- For public image sets, generate synthetic short reports (templated by location, size, and suspicion) to enable cross-modal pretraining.
Model architecture (modular):
- Visual candidate generator (recall-oriented): 3D heatmap/detection network (e.g., nnU-Net, nnDetection, 3D RetinaNet, 3D U-Net heatmap) producing N candidate RoIs per scan with high sensitivity.
- Text encoder: Pretrained clinical transformer (ClinicalBERT, PubMedBERT, Bio+SBERT) to convert EHR snippets into vectors.
- Fusion / re-ranking module (lightweight preferred initially): Extract visual features per RoI (RoI pooling or small CNN/MLP), concatenate with text embedding, and pass through an MLP or shallow cross-attention layer to produce a re-score for each candidate.
- Optional stronger approach if data allows: cross-modal transformer or CLIP-style contrastive alignment for tighter semantic alignment.
Outputs:
- Ranked candidate list (coordinates, size, re-score), corresponding EHR evidence snippet(s), and visualization heatmap for review.
Training:
- Phase A — Visual pretraining: Train the high-recall detector on public datasets (LUNA) with losses such as focal/BCE for heatmaps, L1 for center regression, IoU loss for box refinement. Use augmentations (translation, rotation, intensity perturbation) and hard negative mining for robustness.
- Phase B — Cross-modal pre-warming (optional): Use public CTs + synthetic reports for CLIP-style contrastive pretraining to align RoI visual embeddings with text embeddings.
- Phase C — Fine-tuning with local EHR (critical): Freeze most of the pretrained visual backbone; train the RoI feature head and the fusion/re-ranking layers to avoid overfitting. Use ranking losses (pairwise ranking loss) or cross-entropy re-scoring target so that true EHR-matching RoIs get higher scores. Include hard negatives (other RoIs in same scan that do not match the text) to improve discrimination.
- Phase D — End-to-end fine-tuning (if sufficient data): If local aligned image-text pairs become plentiful, consider deeper fine-tuning of cross-modal transformers or end-to-end training.
Evaluation & Studies:
- Technical metrics: FROC (recall vs FP/scan), recall@1/2 FP, AP, mean center distance (mm), IoU distribution.
- Clinical/user studies: Measure time saved in contouring, rate of treatment-plan changes prompted by candidates, small reader studies comparing assisted vs unassisted reading for accuracy and efficiency.
- Ablations: Compare image-only, image+synthetic text, image+real EHR; test different fusion methods and sensitivity to candidate-threshold settings.
Deployment:
- Deployment design: Modular deployment: candidate generator runs on image server; fusion/re-ranker can be optional service. Expose configurable thresholds and maximum candidate counts; integrate via PACS/TPS plugins or a web UI.
Interested? Contact [email protected]
Project 3: Exploring Frequency-Guided Diffusion Model for Medical Imaging Translation
Aim
Reproduce and adapt Frequency-Guided Diffusion Models (FGDM) for CBCT→CT / MRI→CT translation; prototype extensions (band decomposition, adaptive weighting, learnable filters, frequency-domain losses) and evaluate on medical datasets.
Meaning
- Frequency priors (edges / FFT bands) help preserve structure and reduce artifacts—crucial for medical image translation where anatomical fidelity matters.
- Dynamic or learnable frequency guidance may better balance global structure vs fine detail across diffusion steps.
Resources
Codebases:
Datasets:
- SynthRAD2023: https://synthrad2023.grand-challenge.org/
- SynthRAD2025: https://synthrad2025.grand-challenge.org/
Methods:
- Frequency Band Decomposition: Split images into low/mid/high bands using filters such as wavelets and contourlets and guide the diffusion model with selected bands.
- Dynamic Frequency Guidance: Adjust frequency emphasis at different diffusion steps (low->global, high->refinement).
- Learnable Frequency Filters: Replace fixed Sobel/FFT with trainable convolutional filters to capture task-specific frequencies.
- Frequency Loss Functions: Add explicit loss in frequency domain (FFT-MSE, bandwise SSIM) to enforce fidelity.
Study Guide:
- Reproduce FGDM → adapt to CBCT–CT / MRI–CT → test extensions on medical datasets; measure structural fidelity and artifact suppression.
References:
- [1] Li, Y., Shao, H.-C., Liang, X., Chen, L., Li, R., Jiang, S., Wang, J., & Zhang, Y. (2024). Zero-Shot Medical Image Translation via Frequency-Guided Diffusion Models. *IEEE Transactions on Medical Imaging*, *43*(3), 980–993. https://doi.org/10.1109/TMI.2023.3325703
- [2] Zhang, Y., Li, L., Wang, J., Yang, X., Zhou, H., He, J., Xie, Y., Jiang, Y., Sun, W., Zhang, X., Zhou, G., & Zhang, Z. (2025). Texture-preserving diffusion model for CBCT-to-CT synthesis. *Medical Image Analysis*, *99*, 103362. https://doi.org/10.1016/j.media.2024.103362
- [3] Chen, J., Ye, Z., Zhang, R., Li, H., Fang, B., Zhang, L., & Wang, W. (2025). Medical image translation with deep learning: Advances, datasets and perspectives. *Medical Image Analysis*, *103*, 103605. https://doi.org/10.1016/j.media.2025.103605
Interested? Contact [email protected]
