Open Projects

Project 1: Automatic GTV Segmentation from IGTV in planning CT

Aim

Automatically generate clinically usable GTVs on planning CT / a selected respiratory-phase CT given an existing IGTV. Combine conventional 3D segmentation methods with segmentation foundation models (e.g., SAM2 / MedSAM2 ) and explore weak/semisupervised strategies to reduce annotation needs. Produce end-to-end results and provide basic uncertainty indicators for human review.

Meaning

IGTV is commonly available in clinical practice, while per‑phase GTV annotations are scarce. Using IGTV as a constraint, inferring single‑phase GTVs is reasonable and can improve workflow efficiency and consistency. Foundation models’ prompt and fine‑tuning capabilities can substantially lower labeling cost; combined with weak/semisupervised methods, practical accuracy can be achieved with few annotations. The outcome can support routine planning and adaptive workflows and reduce repetitive manual contouring.

Resources

  • Data: Existing Johns Hopkins set (60 cases: planning CT / 4D CT + IGTV) and supplement with public datasets to improve generalization.
  • Concept: Model the task as “predict GTV on a CT within the IGTV region.” Use a hybrid strategy: a 3D segmentation baseline, foundation-model prompt/fine‑tune usage, and weak/semisupervised training to expand supervision.
  • Baseline model: 3D U-Net / nnU-Net variant taking CT + IGTV mask as input; incorporate IGTV-based constraints during training/postprocessing.
  • Foundation models: Apply SAM2 / MedSAM2, and so on, in prompt-as-mask mode and/or with light fine‑tuning/adapters; include a simple 3D fusion step to preserve inter-slice consistency.
  • Weak / semi‑supervised methods: Pseudo-label self-training, mean‑teacher / consistency training, and hybrid supervision that treats IGTV as an upper‑bound constraint.
  • Postprocessing & QC: Connected-component filtering, volume thresholds, uncertainty flags for clinician review; support export to DICOM-RTSTRUCT.
  • Evaluation: Quantify Dice, 95% Hausdorff distance, relative volume error, centroid shift, and relevant dosimetric metrics; compare baseline, foundation‑model variants, and semi‑supervised approaches.

Interested? Contact [email protected]


Project 2: A vision language model approach for lung nodule detection in planning CT

Aim

Build an assistance system that takes planning CT + EHR (radiation therapy records, radiology reports, pathology, clinical summaries, etc.) as input and outputs candidate lesions (heatmap / bbox / center points) and, based on EHR, provides confidence/prioritization cues for rapid review and downstream contouring by radiation therapists/physicists.

Meaning

  • Clinical value: Use EHR priors (side, lobe, history) to reduce misses (e.g., “previous right upper lobe lesion” increases priority for RUL candidates), shorten review and contouring time, and lower risk of target-miss in radiation planning.
  • Scientific value: Evaluate image–text fusion benefits in low-sample / weak-label medical settings and investigate synthetic report generation for cross-modal pretraining. Provide a foundation for downstream lesion characterization (benign vs malignant, surveillance recommendation).
  • Engineering & deployment value: Modular design (candidate generator + textual re-ranker) eases PACS/TPS integration and deployment. Produce auditable outputs and evidence snippets for clinical traceability and regulatory review.

Resources

Targets:
  • Recall-first behavior: on the FROC curve, noticeably outperforms an image-only baseline at common FP/scan operating points (e.g., recall improvement ≥ 5–10% at ~1–2 FP/scan).
  • Localization accuracy: center distance error within clinically acceptable range (e.g., mean center error < 5 mm, or IoU ≥ 0.3 for small nodules).
  • Usability: outputs include explainable evidence (candidate heatmap + corresponding EHR snippets) and allow threshold tuning to control the number of candidates.
  • Design priorities: maximize recall → interpretability → easy integration (modular interfaces, minimal disruption of existing workflow).
Data sources:
  • Local: ~200 planning CTs from Hopkins (or similar), with associated EHR: RT-planning notes, radiology reports, discharge summaries, prior imaging descriptions, pathology/surgery records.
  • Public: LUNA16 / LUNA25 / other public CT nodule sets for visual pre-training and candidate generator tuning.
Pre/Annotation:
  • Image preprocessing: Resample to consistent spacing (e.g., 1 mm^3 or clinical standard), clip window (e.g., [-1000, 400] HU), intensity normalization. (Optional: lung segmentation to reduce search space and noise in thoracic cases.)
  • Annotation normalization: Standardize annotations as center+radius or bbox (x,y,z, size) and include confidence/malignancy scores when available.
  • De-identify (remove names, IDs, absolute dates), extract key fields with rule-based and model-based NLP: laterality, anatomical location (lobe/segment), timing, prior surgeries/radiation, pathology results; retain the original sentence as short evidence snippets.
  • For public image sets, generate synthetic short reports (templated by location, size, and suspicion) to enable cross-modal pretraining.
Model architecture (modular):
  • Visual candidate generator (recall-oriented): 3D heatmap/detection network (e.g., nnU-Net, nnDetection, 3D RetinaNet, 3D U-Net heatmap) producing N candidate RoIs per scan with high sensitivity.
  • Text encoder: Pretrained clinical transformer (ClinicalBERT, PubMedBERT, Bio+SBERT) to convert EHR snippets into vectors.
  • Fusion / re-ranking module (lightweight preferred initially): Extract visual features per RoI (RoI pooling or small CNN/MLP), concatenate with text embedding, and pass through an MLP or shallow cross-attention layer to produce a re-score for each candidate.
  • Optional stronger approach if data allows: cross-modal transformer or CLIP-style contrastive alignment for tighter semantic alignment.
Outputs:
  • Ranked candidate list (coordinates, size, re-score), corresponding EHR evidence snippet(s), and visualization heatmap for review.
Training:
  • Phase A — Visual pretraining: Train the high-recall detector on public datasets (LUNA) with losses such as focal/BCE for heatmaps, L1 for center regression, IoU loss for box refinement. Use augmentations (translation, rotation, intensity perturbation) and hard negative mining for robustness.
  • Phase B — Cross-modal pre-warming (optional): Use public CTs + synthetic reports for CLIP-style contrastive pretraining to align RoI visual embeddings with text embeddings.
  • Phase C — Fine-tuning with local EHR (critical): Freeze most of the pretrained visual backbone; train the RoI feature head and the fusion/re-ranking layers to avoid overfitting. Use ranking losses (pairwise ranking loss) or cross-entropy re-scoring target so that true EHR-matching RoIs get higher scores. Include hard negatives (other RoIs in same scan that do not match the text) to improve discrimination.
  • Phase D — End-to-end fine-tuning (if sufficient data): If local aligned image-text pairs become plentiful, consider deeper fine-tuning of cross-modal transformers or end-to-end training.
Evaluation & Studies:
  • Technical metrics: FROC (recall vs FP/scan), recall@1/2 FP, AP, mean center distance (mm), IoU distribution.
  • Clinical/user studies: Measure time saved in contouring, rate of treatment-plan changes prompted by candidates, small reader studies comparing assisted vs unassisted reading for accuracy and efficiency.
  • Ablations: Compare image-only, image+synthetic text, image+real EHR; test different fusion methods and sensitivity to candidate-threshold settings.
Deployment:
  • Deployment design: Modular deployment: candidate generator runs on image server; fusion/re-ranker can be optional service. Expose configurable thresholds and maximum candidate counts; integrate via PACS/TPS plugins or a web UI.

Interested? Contact [email protected]


Project 3: Exploring Frequency-Guided Diffusion Model for Medical Imaging Translation

Aim

Reproduce and adapt Frequency-Guided Diffusion Models (FGDM) for CBCT→CT / MRI→CT translation; prototype extensions (band decomposition, adaptive weighting, learnable filters, frequency-domain losses) and evaluate on medical datasets.

Meaning

  • Frequency priors (edges / FFT bands) help preserve structure and reduce artifacts—crucial for medical image translation where anatomical fidelity matters.
  • Dynamic or learnable frequency guidance may better balance global structure vs fine detail across diffusion steps.

Resources

Codebases:
Datasets:
Methods:
  • Frequency Band Decomposition: Split images into low/mid/high bands using filters such as wavelets and contourlets and guide the diffusion model with selected bands.
  • Dynamic Frequency Guidance: Adjust frequency emphasis at different diffusion steps (low->global, high->refinement).
  • Learnable Frequency Filters: Replace fixed Sobel/FFT with trainable convolutional filters to capture task-specific frequencies.
  • Frequency Loss Functions: Add explicit loss in frequency domain (FFT-MSE, bandwise SSIM) to enforce fidelity.
Study Guide:
  • Reproduce FGDM → adapt to CBCT–CT / MRI–CT → test extensions on medical datasets; measure structural fidelity and artifact suppression.
References:
  • [1] Li, Y., Shao, H.-C., Liang, X., Chen, L., Li, R., Jiang, S., Wang, J., & Zhang, Y. (2024). Zero-Shot Medical Image Translation via Frequency-Guided Diffusion Models. *IEEE Transactions on Medical Imaging*, *43*(3), 980–993. https://doi.org/10.1109/TMI.2023.3325703
  • [2] Zhang, Y., Li, L., Wang, J., Yang, X., Zhou, H., He, J., Xie, Y., Jiang, Y., Sun, W., Zhang, X., Zhou, G., & Zhang, Z. (2025). Texture-preserving diffusion model for CBCT-to-CT synthesis. *Medical Image Analysis*, *99*, 103362. https://doi.org/10.1016/j.media.2024.103362
  • [3] Chen, J., Ye, Z., Zhang, R., Li, H., Fang, B., Zhang, L., & Wang, W. (2025). Medical image translation with deep learning: Advances, datasets and perspectives. *Medical Image Analysis*, *103*, 103605. https://doi.org/10.1016/j.media.2025.103605

Interested? Contact [email protected]