INTEGRATION OF VISION-LANGUAGE MODELS FOR INTELLIGENT DOCUMENT ANALYSIS IN SALESFORCE

Authors

  • Shalini Polamarasetti Independent researcher Author

DOI:

https://doi.org/10.48047/gr2mzs78

Abstract

Recent momentum in multimodal artificial intelligence (AI) models, including CLIP and Gemini, has allowed companies to extract any structure information in unstructured document-based data such as scanned contracts, forms, and PDFs. Infusion of these capabilities in Salesforce would revolutionize workflows by automating the process of ingesting, classifying and extracting fields in CRM pipelines. The paper discusses architecture design, implementation approach and performance metrics of vision-language models of intelligent document analysis in Salesforce. We describe the process of documents preprocessing, extracting important entities and clauses, and providing a record in a structured format that would reference the Salesforce objects. Compared to zero-shot extraction by CLIP, a fine-tuned Gemini classifier demonstrates an equal navigation between practical flexibility and precision. Experimental results on both datasets of legal agreements and customer forms lead us to report values of the precision, recall, and speed of processing metrics, which show a substantial advantage towards the use of conventional OCR and keyword-based extraction. The need to integrate it, the problem of governance, user experience and best practices and architectural considerations on enterprise deployment are also covered concluding with the best practices and architectural considerations on enterprise deployment

Downloads

Download data is not yet available.

References

A. Vaswani et al., “Attention Is All You Need,” in Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008, 2017.

A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” arXiv:2103.00020, 2021.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proc. NAACL-HLT, 2019.

OpenAI, “CLIP: Connecting Text and Images,” 2021. [Online]. Available: https://openai.com/research/clip

R. Jia et al., “Visual Commonsense R-CNN,” in CVPR, 2021.

Salesforce, “Einstein GPT: The First Generative AI for CRM,” Salesforce.com, 2023. [Accessed: Dec. 2023].

J. Li, Z. Yin, and L. Zheng, “Document Layout Analysis with Multimodal Transformers,” in Proc. ICCV, 2021.

H. Lu et al., “Visual Prompt Tuning,” arXiv:2203.12119, 2022.

A. Dosovitskiy et al., “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in ICLR, 2021.

M. Tan and Q. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” in ICML, 2019.

K. Clark et al., “Visual Question Answering with Transformers,” in EMNLP, 2020.

Salesforce, “Salesforce Service Cloud Documentation,” Salesforce Developers, 2022.

C. Yeh, T. Chen, and H. Wang, “Visual Grounding with Cross-modal Transformers,” in ECCV, 2020.

L. Li et al., “DocPrompting: Generative Pre-training for Document Understanding,” arXiv:2204.10628, 2022.

R. Ramesh et al., “Zero-Shot Text-to-Image Generation,” in ICML, 2021.

H. Touvron et al., “LLaMA: Open and Efficient Foundation Language Models,” arXiv:2302.13971, 2023.

A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal Speed and Accuracy of Object Detection,” arXiv:2004.10934, 2020.

A. Zettlemoyer et al., “Multimodal Chain-of-Thought Reasoning in Language Models,” arXiv:2305.10601, 2023.

T. Wolf et al., “Transformers: State-of-the-Art Natural Language Processing,” in Proc. EMNLP: System Demonstrations, 2020.

Y. Du et al., “Multimodal Pre-training for Document Understanding,” in ACL, 2022.

K. Narasimhan et al., “Vision-Language Navigation: A Survey,” arXiv:2001.03079, 2020.

M. A. Nielsen, “Neural Networks and Deep Learning,” Determination Press, 2015.

T. B. Brown et al., “Language Models are Few-Shot Learners,” in NeurIPS, vol. 33, 2020.

A. Hendricks et al., “Detecting Hallucinated Content in Generative Models,” in ICML, 2022.

X. Li, Z. Yin, and H. Chen, “DocVQA: A Dataset for VQA on Document Images,” in CVPR, 2020.

J. Chen et al., “Multimodal Retrieval in Knowledge Graphs,” in WWW, 2019.

M. Jagannatha and H. Yu, “Bidirectional RNN for Medical Event Detection,” in NAACL, 2016.

T. Mikolov et al., “Distributed Representations of Words and Phrases,” in NeurIPS, 2013.

A. Papernot et al., “The Limitations of Deep Learning in Adversarial Settings,” in IEEE EuroS&P, 2016.

K. He et al., “Deep Residual Learning for Image Recognition,” in CVPR, 2016.

R. Al-Rfou et al., “The Unreasonable Effectiveness of Character-level Language Models,” arXiv:1508.02096, 2015.

S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” in Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

Salesforce AI Research, “AI Ethics Guidelines,” 2021. [Online]. Available: https://www.salesforce.com/blog/ethical-ai/

R. Geirhos et al., “Shortcut Learning in Deep Neural Networks,” in Nature Machine Intelligence, vol. 2, pp. 665–673, 2020.

A. Bansal et al., “Multimodal Pretraining for Automated Medical Coding,” in AAAI, 2022.

H. Yin et al., “TabFact: A Large-Scale Dataset for Table-Based Fact Verification,” in ICLR, 2020.

P. Rajpurkar et al., “SQuAD: 100,000+ Questions for Machine Comprehension,” in EMNLP, 2016.

L. Wang et al., “ChartOCR: Accurate and Robust Recognition of Chart Images,” in CVPR, 2021.

S. Ruder, “An Overview of Multi-Task Learning in Deep Neural Networks,” arXiv:1706.05098, 2017.

D. Amodei et al., “Concrete Problems in AI Safety,” in arXiv preprint, arXiv:1606.06565, 2016.

Downloads

Published

2025-03-31

How to Cite

INTEGRATION OF VISION-LANGUAGE MODELS FOR INTELLIGENT DOCUMENT ANALYSIS IN SALESFORCE (S. Polamarasetti , Trans.). (2025). Cuestiones De Fisioterapia, 53(1), 715-723. https://doi.org/10.48047/gr2mzs78