Large Language Model for Bioinformatics and Health Science

We are developing a new graph model to analyze antimicrobial peptides and proteins (AMPs) and building an open-source dataset for AI training. Funded by the NSF Cybersecurity Innovation for Cyberinfrastructure (CICI) program, our project automates security assessments to protect data integrity from both accidental and malicious threats.

AMPs are essential for food safety, livestock health, and agricultural productivity, but traditional discovery methods are expensive and labor-intensive. AI and bioinformatics have boosted AMP research, yet large, unverified datasets pose cybersecurity risks if the data is altered or flawed. Our automated framework evaluates AMP sequences and functionality, reducing costly lab validations and promoting security awareness in the research community. It also creates an open-source dataset focused on AMP functionality security and offers an online platform for evaluation and security education.

Our work has two main goals: (1) filtering low-quality data with a model-driven approach and (2) exploring and defending against data poisoning vulnerabilities. By combining a novel graph model and an open-source dataset, we strengthen data integrity for AMP research and other peptide/protein studies. This effort bridges gaps in data security, fosters more reliable scientific collaborations, and provides an automated verification framework to broaden data access and innovation. Finally, the project emphasizes community engagement and cybersecurity education to advance health, prosperity, and welfare through scientific research.

Large Language Model for Financial Exploitation

We use large language models (LLMs) to help older adults combat fraud and stay socially connected. Funded by NSF’s Smart and Connected Communities program, this project tackles financial exploitation among older adults.

Our smart assistive technology employs data-driven algorithms and generative AI to detect fraud, tracking all interactions from a soliciting party for comprehensive protection. We also consider social psychological factors—ergonomics, privacy, loneliness, attitudes toward technology, and fraud susceptibility.

This work helps product designers create practical solutions to prevent financial exploitation, improving older adults’ quality of life, independence, and reducing healthcare costs.

We tailor AI models for different user profiles (e.g., marital status, urban/rural, socio-economic status) and integrate email, social media, financial records, and phone calls. The proposed Large Multimodal Model (LMM) can process images and voices on users’ devices to protect privacy.

By examining fraudulent solicitations, social-emotional factors, and demographics, we aim to develop more targeted interventions to protect this vulnerable population.

 

Large Language Model for Hardware Security

Hardware Phi-1.5B: First hardware domain-specific pretrained LLM in the World

We have conducted pretraining based on the Phi-1.5B model structure, making it more closely aligned with the needs of the hardware domain, enhancing the model's performance and stability in hardware design and verification tasks. It is the first pretrained hardware domain-specific LLM. Accordingly, we created three differently sized datasets rigorously screened and optimized them to guarantee content relevance and quality, thus laying a strong foundation for model training. The pre-trained model is offered openly to the community, thus supporting ongoing research, development, and innovation in both academic and industrial spheres. The releasing date will be around the presentation time of this paper in ASP-DAC 2024.

For testing the fine-tunned model, please reach out to Dr. Xiaolong Guo: guoxiaolong@ksu.edu

More details please refer to our recent accepted paper:

Weimin Fu, Shijie Li, Yifang Zhao, Haocheng Ma, Raj Dutta, Xuan Zhang, Kaicheng Yang, Yier Jin, and Xiaolong Guo. Hardware Phi-1.5B:A Large Language Model Encodes Hardware Domain Specific Knowledge. 29th IEEE/ACM Asia and South Pacific Design Automation Conference (ASP-DAC); 2024 January; Incheon Songdo Convensia, South Korea. [download]

LLM4SecHW

LLM4SecHW is a LLM-based hardware debugging framework designed to address the aforementioned issues. It aims to identify bugs and provide debugging suggestions during the hardware design iteration process. Specifically, we develop an innovative data collection and preprocessing method to harness version control information from open-source hardware projects. From this information, we construct a hardware debugging-oriented dataset by filtering and processing the version control data, which is subsequently utilized to fine-tune our model. Leveraging this dataset, we fine-tune a suite of hardware domain-specific language models capable of reading hardware designs and autonomously locating and rectifying bugs.

Our dataset, LLM4SecHW-OSHD, is now officially available on Huggingface: https://huggingface.co/datasets/KSU-HW-SEC/LLM4SecHW-OSHD

For testing the fine-tunned model, please reach out to Dr. Xiaolong Guo: guoxiaolong@ksu.edu

More details please refer to our recent accepted paper:

Weimin Fu, Kaichen Yang, Raj Gautam Dutta, Xiaolong Guo, and Gang Qu. Llm4sechw: Leavering domain-specific large language model for hardware debugging. Asian Hardware Oriented Security and Trust (AsianHOST), 2023. [download ]

LM4AsrtHW

The analysis and verification of hardware security require the development of robust security properties/assertions, which is a complex and time-consuming process. Crafting hardware security assertions to meet specific requirements is tedious and requires expert knowledge. This paper introduces a framework that leverages Language Models (LMs) to generate hardware security assertions. We present a novel framework LM4AsrtHW for creating a hardware security-centric dataset to fine-tune LMs. These fine-tuned models generate hardware security assertions corresponding to specific CWE vulnerabilities and hardware design. The generated assertions are validated using commercial EDA tools. Our experimental results show our fine-tuned, hardware security-oriented LMs consistently outperform existing automated assertion generators and commercial LLMs in generating hardware security assertions.

Some Preliminary Results of LM4AsrtHW:

Assertions generated by different LLMs