多模态特征融合——基于BERT和ResNet152模型

它强调了不再像以往一样采用传统的单向语言模型或者把两个单向语言模型进行浅层拼接的方法进行预训练，而是采用新的masked language model（MLM），以能生成深度的双向语言表征。等四名华人提出，通过使用ResNet Unit成功训练出了152层的神经网络，并在ILSVRC2015比赛中取得冠军，在top5上的错误率为3.57%，同时参数量比VGGNet低，效果非常突出。ResNet的结

小英熊M

2998人浏览 · 2023-11-06 13:19:25

小英熊M · 2023-11-06 13:19:25 发布

一.预训练模型介绍

1.BERT模型

BERT的全称为Bidirectional Encoder Representation from Transformers，是一个预训练的语言表征模型。它强调了不再像以往一样采用传统的单向语言模型或者把两个单向语言模型进行浅层拼接的方法进行预训练，而是采用新的masked language model（MLM），以能生成深度的双向语言表征。BERT论文发表时提及在11个NLP（Natural Language Processing，自然语言处理）任务中获得了新的state-of-the-art的结果，令人目瞪口呆。

2.ResNet152模型

ResNet（Residual Neural Network）由微软研究院的Kaiming He等四名华人提出，通过使用ResNet Unit成功训练出了152层的神经网络，并在ILSVRC2015比赛中取得冠军，在top5上的错误率为3.57%，同时参数量比VGGNet低，效果非常突出。ResNet的结构可以极快加速神经网络的训练，模型的准确率也有比较大的提升。同时ResNet的推广性非常好，甚至可以直接用到InceptionNet网络中。

二.代码实现

1.下载预训练模型

BERT：https://huggingface.co/bert-base-uncased

ResNet152：https://huggingface.co/microsoft/resnet-152

2. 实现代码


import torch
from transformers import BertTokenizer, BertModel, ResNetModel
from PIL import Image
from torchvision import transforms
import csv
def extract_combined_features(image_path, text):
    # 定义用于预处理图像的transforms
    transform = transforms.Compose([
        transforms.Resize((256, 256)),  # 将图像大小调整为 256x256
        transforms.CenterCrop(224),  # 对图像进行中心裁剪
        transforms.ToTensor(),  # 将图像转换为张量
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # 标准化图像
    ])
    resnet = ResNetModel.from_pretrained("resnet-152")
    resnet.eval()
    image = Image.open(image_path)
    image = transform(image).unsqueeze(0)
    image_features = resnet(image)

    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    text_tokens = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    bert = BertModel.from_pretrained('bert-base-uncased')
    bert.eval()
    text_features = bert(**text_tokens).last_hidden_state

    image_features = image_features['last_hidden_state']  # 提取图像特征张量
    text_features = text_features  # 提取文本特征张量
    image_features = image_features.view(-1, 2048, 49).permute(0, 2, 1).contiguous()
    image_pooled_output, _ = image_features.max(1)
    text_pooled_output = text_features.mean(dim=1)  # 使用平均值代替文本特征
    combined_features = torch.cat([image_pooled_output, text_pooled_output], dim=1)
    return combined_features
image_path = '图片路径'
text = '文本信息'
ecf = extract_combined_features(image_path, text)
print(ecf)

最后：

如果你想要进一步了解更多的相关知识，可以关注下面公众号联系~会不定期发布相关设计内容包括但不限于如下内容:信号处理、通信仿真、算法设计、matlab appdesigner，gui设计、simulink仿真......希望能帮到你！

NVIDIA AI技术专区

NVIDIA官方入驻，分享最新的官方资源以及活动/会议信息，精选收录AI相关技术内容，欢迎大家加入社区并参与讨论。

更多推荐

16_Vue3动画(一)之动画的基本使用及animate.css库的使用

Vue3动画的基本使用及animate.css库的使用认识动画在开发中，我们想要给一个组件的显示和消失添加某种过渡动画，可以很好的增加用户体验：React框架本身并没有提供任何动画相关的API，所以在React中使用过渡动画我们需要使用一个第三方库react-transition-group；Vue中为我们提供一些内置组件和对应的API来完成动画，利用它们我们可以方便的实现过渡动画效果；我们来看一