Zhe Gan

See: KOSMOS-1 Architecture, Automated Text Generation, Visual Instruction Tuning.

References

(McKinzie et al., 2024) ⇒ Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah et al. (2024). “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training.” arXiv preprint arXiv:2403.09611

(Lei et al., 2021) ⇒ J Lei, L Li, L Zhou, Zhe Gan, TL Berg, M Bansal, and J Liu. (2021). “Less is More: Clipbert for Video-and-Language Learning via Sparse Sampling.” In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- NOTE: It proposes a novel approach for efficient video-and-language learning, utilizing sparse sampling to reduce computational requirements while maintaining high performance levels.

(Chen et al., 2020) ⇒ YC Chen, L Li, L Yu, A El Kholy, F Ahmed, Zhe Gan, Y Cheng, and J Liu. (2020). “Uniter: Universal Image-Text Representation Learning.” In: European Conference on Computer Vision, Pages 104-120.
- NOTE: It introduces a method for learning universal image-text representations, aimed at improving the interoperability between visual and textual data in various computer vision tasks.

(Sun et al., 2019) ⇒ S Sun, Y Cheng, Zhe Gan, and J Liu. (2019). “Patient Knowledge Distillation for BERT Model Compression.” In: arXiv preprint arXiv:1908.09355.
- NOTE: It introduces a method for compressing BERT models through knowledge distillation, focusing on preserving model performance while reducing model size.

(Xu et al., 2018) ⇒ T Xu, P Zhang, Q Huang, H Zhang, Zhe Gan, X Huang, and X He. (2018). “Attngan: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks.” In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- NOTE: It presents an attentional generative adversarial network for generating detailed images from textual descriptions, focusing on the fine-grained aspects of the generated images.

(Pu et al., 2016) ⇒ Y Pu, Zhe Gan, R Henao, X Yuan, C Li, A Stevens, and L Carin. (2016). “Variational Autoencoder for Deep Learning of Images, Labels, and Captions.” In: NIPS.
- NOTE: It discusses the application of variational autoencoders in learning joint representations of images, labels, and captions, contributing to the field of multi-modal learning.