Keras版fasterrcnn算法详解RPN计算_[#第一枪]

发布时间：2021-06-07 11:55:33 阅读：次来源：奖杯厂家

AI科技评论按：本文首发于知乎专栏Learning Machine，作者张潇捷，雷锋网 AI科技评论获其授权转载。

前段时间学完Udacity的机器学习和深度学习的课程，感觉只能算刚刚摸到深度学习的门槛，于是开始看斯坦福的cs231n（http://cs231n.stanford.edu/syllabus.html），一不小心便入了计算机视觉的坑。原来除了识别物体，还可以进行定位(localization)，检测(object detection)，语义分割(semantic segmentation)，实例分割(instance segmentation)，左右手互搏(GAN)，风格学习(transfer learning)等等。。。真是一下开了眼。从detection学起，开干！

detection的话，自然是rgb大神的一系列工作，从rcnn一路到YOLO。这里贴一个YOLO的视频，给各位看官鉴赏一下:YOLO: Real-Time Object Detection（https://www.youtube.com/watch?v=VOC3huqHrss&feature=youtu.be）。也可以直接看这个地址，有更详细的内容：YOLO: Real-Time Object Detection（https://pjreddie.com/darknet/yolo/）。Faster-rcnn的原文在这里：Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks（https://arxiv.org/abs/1506.01497）。

由于tensorflow使用的不是很熟练，大部分项目都是用keras做的，因此在github上找到了一个keras版的faster-rcnn（https://github.com/yhenon/keras-frcnn），学习一下。基本上clone下来以后稍微调整几处代码就能成功跑起来了。我用Oxford的pet数据集进行了训练，在我的老爷卡gtx970上训练了差不多1个多小时，就能够比较有效的实现detection了。下面是效果图。

&amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;img src=&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;https://static.leiphone.com/uploads/new/article/pic/201709/9725b06f37d79f9e3f1d43916336e9bf.png&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot; data-rawwidth=&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;640&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot; data-rawheight=&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;480&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot; class=&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;origin_image zh-lightbox-thumb&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot; width=&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;640&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot; data-original=&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;https://pic2.zhimg.com/v2-0e07c33ac757ef077f5216a65b0b086d_r.png&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot; _src=&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;https://static.leiphone.com/uploads/new/article/pic/201709/9725b06f37d79f9e3f1d43916336e9bf.png&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;/&amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;

接下来就是理解代码了，faster-rcnn的核心思想就是通过RPN替代过往的独立的步骤进行region proposal，实现完全的end-to-end学习，从而对算法进行了提速。所以读懂RPN是理解faster-rcnn的第一步。下面的代码是如何得到用于训练RPN的ground truth的，完全理解之后也就理解RPN的原理了。

计算过程比较长，但没有复杂的数学知识，我画了一个大概的流程图，在此基础上理解应该就容易多了。

&amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;img src=&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;https://static.leiphone.com/uploads/new/article/pic/201709/eb5aa8a8db9cc7e183394e868073ceff.jpg&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot; data-rawwidth=&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;919&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot; data-rawheight=&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;1226&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot; class=&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;origin_image zh-lightbox-thumb&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot; width=&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;919&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot; data-original=&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;https://pic4.zhimg.com/v2-efdb4615e8075af9213f693090a71123_r.jpg&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot; _src=&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;https://static.leiphone.com/uploads/new/article/pic/201709/eb5aa8a8db9cc7e183394e868073ceff.jpg&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;/&amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;

下面来看代码：

def calc_rpn(C, img_data, width, height, resized_width, resized_height, img_length_calc_function):

downscale = float(C.rpn_stride)

anchor_sizes = C.anchor_box_scales

anchor_ratios = C.anchor_box_ratios

num_anchors = len(anchor_sizes) * len(anchor_ratios)

# calculate the output map size based on the network architecture

(output_width, output_height) = img_length_calc_function(resized_width, resized_height)

n_anchratios = len(anchor_ratios)

# initialise empty output objectives

y_rpn_overlap = np.zeros((output_height, output_width, num_anchors))

y_is_box_valid = np.zeros((output_height, output_width, num_anchors))

y_rpn_regr = np.zeros((output_height, output_width, num_anchors * 4))

num_bboxes = len(img_data['bboxes'])

num_anchors_for_bbox = np.zeros(num_bboxes).astype(int)

best_anchor_for_bbox = -1*np.ones((num_bboxes, 4)).astype(int)

best_iou_for_bbox = np.zeros(num_bboxes).astype(np.float32)

best_x_for_bbox = np.zeros((num_bboxes, 4)).astype(int)

best_dx_for_bbox = np.zeros((num_bboxes, 4)).astype(np.float32)

# get the GT box coordinates, and resize to account for image resizing

gta = np.zeros((num_bboxes, 4))

for bbox_num, bbox in enumerate(img_data['bboxes']):

# get the GT box coordinates, and resize to account for image resizing

gta[bbox_num, 0] = bbox['x1'] * (resized_width / float(width))

gta[bbox_num, 1] = bbox['x2'] * (resized_width / float(width))

gta[bbox_num, 2] = bbox['y1'] * (resized_height / float(height))

gta[bbox_num, 3] = bbox['y2'] * (resized_height / float(height))

首先看一下参数，C是配置信息，img_data包含一张图片的路径，bbox坐标和对应的分类（可能一张图片有多组，即表示图片里包含多个对象）。后面是图片的原尺寸和resize之后的尺寸，用于求bbox坐标在resize之后图片上的坐标，img_length_calc_function是一个方法，基于我们的设置来从图片尺寸计算出经过网络之后特征图的尺寸。

接下来读取了几个参数，downscale就是从图片到特征图的缩放倍数，anchor_size和anchor_ratios是我们初步选区大小的参数，比如3个size和3个ratios，可以组合成9种不同形状大小的选区。接下来通过img_.....function这个方法计算出了特征图的尺寸。

下一步是几个变量初始化，可以先不看，后面用到的时候再看。因为我们的计算都是基于resize以后的图像的，所以接下来把bbox中的x1,x2,y1,y2分别通过缩放匹配到resize以后的图像。这里记做gta，尺寸为(num_of_bbox,4)。

for anchor_size_idx in range(len(anchor_sizes)):

for anchor_ratio_idx in range(n_anchratios):

anchor_x = anchor_sizes[anchor_size_idx] * anchor_ratios[anchor_ratio_idx][0]

anchor_y = anchor_sizes[anchor_size_idx] * anchor_ratios[anchor_ratio_idx][1]

for ix in range(output_width):

# x-coordinates of the current anchor box

x1_anc = downscale * (ix + 0.5) - anchor_x / 2

x2_anc = downscale * (ix + 0.5) + anchor_x / 2

# ignore boxes that go across image boundaries

if x1_anc < 0 or x2_anc > resized_width:

continue

for jy in range(output_height):

# y-coordinates of the current anchor box

y1_anc = downscale * (jy + 0.5) - anchor_y / 2

y2_anc = downscale * (jy + 0.5) + anchor_y / 2

# ignore boxes that go across image boundaries

if y1_anc < 0 or y2_anc > resized_height:

continue

# bbox_type indicates whether an anchor should be a target

bbox_type = 'neg'

# this is the best IOU for the (x,y) coord and the current anchor

# note that this is different from the best IOU for a GT bbox

best_iou_for_loc = 0.0

上面这一段计算了anchor的长宽，然后比较重要的就是把特征图的每一个点作为一个锚点，通过乘以downscale，映射到图片的实际尺寸，再结合anchor的尺寸，忽略掉超出图片范围的。一个个大小、比例不一的矩形选框就跃然纸上了。对这些选框进行遍历，对每个选框进行下面的计算：

# bbox_type indicates whether an anchor should be a target

bbox_type = 'neg'

# this is the best IOU for the (x,y) coord and the current anchor

# note that this is different from the best IOU for a GT bbox

best_iou_for_loc = 0.0

for bbox_num in range(num_bboxes):

# get IOU of the current GT box and the current anchor box

curr_iou = iou([gta[bbox_num, 0], gta[bbox_num, 2], gta[bbox_num, 1], gta[bbox_num, 3]], [x1_anc,y1_anc, x2_anc, y2_anc])

# calculate the regression targets if they will be needed

if curr_iou > best_iou_for_bbox[bbox_num] or curr_iou > C.rpn_max_overlap:

cx = (gta[bbox_num, 0] + gta[bbox_num, 1]) / 2.0

cy = (gta[bbox_num, 2] + gta[bbox_num, 3]) / 2.0

cxa = (x1_anc + x2_anc)/2.0

cya = (y1_anc + y2_anc)/2.0

tx = (cx - cxa) / (x2_anc - x1_anc)

ty = (cy - cya) / (y2_anc - y1_anc)

tw = np.log((gta[bbox_num, 1] - gta[bbox_num, 0]) / (x2_anc - x1_anc))

th = np.log((gta[bbox_num, 3] - gta[bbox_num, 2]) / (y2_anc - y1_anc))

定义了两个变量，bbox_type和best_iou_for_loc，后面会用到。计算了anchor与gta的交集，比较简单，就不展开说了。然后就是如果交集大于best_iou_for_bbox[bbox_num]或者大于我们设定的阈值，就会去计算gta和anchor的中心点坐标，再通过中心点坐标和bbox坐标，计算出x,y,w,h四个值的梯度值（不知道这么理解对不对）。为什么要计算这个梯度呢？因为RPN计算出来的区域不一定是很准确的，从只有9个尺寸的anchor也可以推测出来，因此我们在预测时还会进行一次回归计算，而不是直接使用这个区域的坐标。

接下来是根据anchor的表现对其进行标注。

if img_data['bboxes'][bbox_num]['class'] != 'bg':

# all GT boxes should be mapped to an anchor box, so we keep track of which anchor box was best

if curr_iou > best_iou_for_bbox[bbox_num]:

best_anchor_for_bbox[bbox_num] = [jy, ix, anchor_ratio_idx, anchor_size_idx]

best_iou_for_bbox[bbox_num] = curr_iou

best_x_for_bbox[bbox_num,:] = [x1_anc, x2_anc, y1_anc, y2_anc]

best_dx_for_bbox[bbox_num,:] = [tx, ty, tw, th]

# we set the anchor to positive if the IOU is >0.7 (it does not matter if there was another better box, it just indicates overlap)

if curr_iou > C.rpn_max_overlap:

bbox_type = 'pos'

num_anchors_for_bbox[bbox_num] += 1

# we update the regression layer target if this IOU is the best for the current (x,y) and anchor position

if curr_iou > best_iou_for_loc:

best_iou_for_loc = curr_iou

best_regr = (tx, ty, tw, th)

# if the IOU is >0.3 and <0.7, it is ambiguous and no included in the objective

if C.rpn_min_overlap < curr_iou < C.rpn_max_overlap:

# gray zone between neg and pos

if bbox_type != 'pos':

bbox_type = 'neutral'

# turn on or off outputs depending on IOUs

if bbox_type == 'neg':

y_is_box_valid[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 1

y_rpn_overlap[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 0

elif bbox_type == 'neutral':

y_is_box_valid[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 0

y_rpn_overlap[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 0

elif bbox_type == 'pos':

y_is_box_valid[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 1

y_rpn_overlap[jy, ix, anchor_ratio_idx + n_anchratios * anchor_size_idx] = 1

start = 4 * (anchor_ratio_idx + n_anchratios * anchor_size_idx)

y_rpn_regr[jy, ix, start:start+4] = best_regr

前提是这个bbox的class不是'bg'，即背景。如果交集大于这个bbox的最佳值，则进行一系列更新。如果交集大于我们设定的阈值，则定义为一个positive的anchor，即存在与之重合度比较高的bbox，同时该bbox的num_anchors加1。如果交集刚好也大于best_iou_for_loc，则将best_regr设为当前的梯度值。这里best_iou_for_loc指的是该anchor下的最佳交集，我的理解就是一个anchor如果能匹配到1个以上的bbox为pos，那我们取best_iou_for_loc下的梯度，要知道这一步我们只要找到最佳的选区就行了，并不管选区里是哪个class。如果刚好处于最大和最小阈值之间，那我们不确定它是背景还是对象，将其定义为neutral，即中性。

接下来根据bbox_type对本anchor进行打标，y_is_box_valid和y_rpn_overlap分别定义了这个anchor是否可用和是否包含对象。

for idx in range(num_anchors_for_bbox.shape[0]):

if num_anchors_for_bbox[idx] == 0:

# no box with an IOU greater than zero ...

if best_anchor_for_bbox[idx, 0] == -1:

continue

y_is_box_valid[

best_anchor_for_bbox[idx,0], best_anchor_for_bbox[idx,1], best_anchor_for_bbox[idx,2] + n_anchratios *

best_anchor_for_bbox[idx,3]] = 1

y_rpn_overlap[

best_anchor_for_bbox[idx,0], best_anchor_for_bbox[idx,1], best_anchor_for_bbox[idx,2] + n_anchratios *

best_anchor_for_bbox[idx,3]] = 1

start = 4 * (best_anchor_for_bbox[idx,2] + n_anchratios * best_anchor_for_bbox[idx,3])

y_rpn_regr[

best_anchor_for_bbox[idx,0], best_anchor_for_bbox[idx,1], start:start+4] = best_dx_for_bbox[idx,:]

这里又出现了一个问题，很多bbox可能找不到心仪的anchor，那这些训练数据就没法利用了，因此我们用一个折中的办法来保证每个bbox至少有一个anchor与之对应。下面是具体的方法，比较简单，对于没有对应anchor的bbox，在中性anchor里挑最好的，当然前提是你不能跟我完全不相交，那就太过分了。。

y_rpn_overlap = np.transpose(y_rpn_overlap, (2, 0, 1))

y_rpn_overlap = np.expand_dims(y_rpn_overlap, axis=0)

y_is_box_valid = np.transpose(y_is_box_valid, (2, 0, 1))

y_is_box_valid = np.expand_dims(y_is_box_valid, axis=0)

y_rpn_regr = np.transpose(y_rpn_regr, (2, 0, 1))

y_rpn_regr = np.expand_dims(y_rpn_regr, axis=0)

pos_locs = np.where(np.logical_and(y_rpn_overlap[0, :, :, :] == 1, y_is_box_valid[0, :, :, :] == 1))

neg_locs = np.where(np.logical_and(y_rpn_overlap[0, :, :, :] == 0, y_is_box_valid[0, :, :, :] == 1))

num_pos = len(pos_locs[0])

接下来通过numpy大法进行了一系列操作，对pos和neg的anchor进行了定位。

num_regions = 256

if len(pos_locs[0]) > num_regions/2:

val_locs = random.sample(range(len(pos_locs[0])), len(pos_locs[0]) - num_regions/2)

y_is_box_valid[0, pos_locs[0][val_locs], pos_locs[1][val_locs], pos_locs[2][val_locs]] = 0

num_pos = num_regions/2

if len(neg_locs[0]) + num_pos > num_regions:

val_locs = random.sample(range(len(neg_locs[0])), len(neg_locs[0]) - num_pos)

y_is_box_valid[0, neg_locs[0][val_locs], neg_locs[1][val_locs], neg_locs[2][val_locs]] = 0

因为negtive的anchor肯定远多于postive的，因此在这里设定了regions数量的最大值，并对pos和neg的样本进行了均匀的取样。

y_rpn_cls = np.concatenate([y_is_box_valid, y_rpn_overlap], axis=1)

y_rpn_regr = np.concatenate([np.repeat(y_rpn_overlap, 4, axis=1), y_rpn_regr], axis=1)

return np.copy(y_rpn_cls), np.copy(y_rpn_regr)

最后，得到了两个返回值y_rpn_cls,y_rpn_regr。分别用于确定anchor是否包含物体，和回归梯度。

再来看一下网络中RPN层的结构：

def rpn(base_layers,num_anchors):

x = Convolution2D(512, (3, 3), padding='same', activation='relu', kernel_initializer='normal', name='rpn_conv1')(base_layers)

x_class = Convolution2D(num_anchors, (1, 1), activation='sigmoid', kernel_initializer='uniform', name='rpn_out_class')(x)

x_regr = Convolution2D(num_anchors * 4, (1, 1), activation='linear', kernel_initializer='zero', name='rpn_out_regress')(x)

return [x_class, x_regr, base_layers]

通过1*1的窗口在特征图上滑过，生成了num_anchors数量的channel,每个channel包含特征图（w*h）个sigmoid激活值，表明该anchor是否可用，与我们刚刚计算的y_rpn_cls对应。同样的方法，得到x_regr与刚刚计算的y_rpn_regr对应。

得到了region proposals，接下来另一个重要的思想就是ROI，可将不同shape的特征图转化为固定shape，送到全连接层进行最终的预测。等我学习完了再更新。由于自己也是学习过程，可能很多地方的理解有误差，欢迎指正～

雷锋网版权文章，未经授权禁止转载。详情见转载须知。