SNPE TUTORIAL (pytorch)

Qualcomm 의 Snapdragon AP에서 딥러닝 기반 Super Resolution 네트워크를 동작시키기 위한 예제

(Pytorch version) - 사실 tensorflow 버전과 거의 유사하다

https://developer.qualcomm.com/docs/snpe/setup.html

이걸 따라 해보자

먼저, Prerequisites를 맞춰보자

Currently the SNPE SDK development environment is limited to Ubuntu, specifically version 14.04.
The SDK requires either Caffe, Caffe2, ONNX or TensorFlow.
1. Instructions for Caffe: Caffe and Caffe2 Setup
2. Instructions for TensorFlow: TensorFlow Setup
3. Instructions for ONNX: ONNX Setup
Python 2.7
Android NDK (android-ndk-r11-linux-x86) is optional and only required to build the native CPP example that ships with the SDK

- SDK Android binaries built with gcc require libgnustl_shared.so which can be found in the Android NDK. (See Platform Runtime Libraries below).
- SDK Android binaries build with clang require libc++_shared.so which is shipped with the SDK.

Android SDK (SDK version 23 and build tools version 23.0.2) is optional and only required to build the Android APK that ships with the SDK.

리눅스 OS 설치
GPU 그래픽 드라이버 설치 (설치 확인: nvidia-smi)
CUDA 9.0 설치
- NVIDIA 홈페이지에서 CUDA 9.0 RUN file download(ubuntu 16.04에 맞는)
- sudo sh cuda_9.2.148_396.37_linux.run
- export PATH
  - /home/.bashrc (home폴더에서 숨김파일 ctrl+h로 볼수 있음, 또는 vi ~/.bashrc로 수정)
  - 아래 네 줄 추가 후 command 종료 한 뒤 다시 새로운 command 창 켜서 확인 (참고 : ":~/anaconda3/bin" 은 나중에 anaconda3 설치 한 뒤에 추가)
    - - export PATH=${PATH}:/usr/local/cuda-9.0/bin:~/anaconda3/bin
      - export CUDA_HOME=${CUDA_HOME}:/usr/local/cuda:/usr/local/cuda-9.0
      - export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-9.0/lib64
      - export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64
- 설치 확인 : nvcc -V
cuDNN 설치
- NVIDIA 홈페이지에서 cuDNN 7.0 library 버전 파일을 다운로드
- Tensorflow 1.12 이상에서는 cuDNN 7.2 버전 이상을 요구
  - sudo tar -xzvf cudnn-9.0-linux-x64-v7.0.tgz
  - sudo tar -xzvf cudnn-10.2-linux-x64-v8.0.2.39.tgz
  - cd cuda
  - sudo cp include/cudnn.h /usr/local/cuda/include
  - sudo cp lib64/libcudnn* /usr/local/cuda/lib64
  - sudo chmod a+r /usr/local/cuda/lib64/libcudnn*
- 설치 확인 : cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
anaconda3 설치
- anaconda 홈페이지에서 Anaconda3-5.2.0-Linux-x86_64.sh 다운로드
- bash Anaconda3-5.2.0-Linux-x86_64.sh
  - 엔터를 살살 치면서 설치 진행 , 한번에 막막 치면서 진행하면 조금 꼬일 수도?!
  - 그래픽 카드 드라이버는 설치 안함 위에서 했으므로
- export PATH
  - ~/.bashrc 열어서
  - export PATH=${PATH}:/usr/local/cuda-9.0/bin:~/anaconda3/bin

- - - 아까 위에서 참고로 이야기 했던 부분임

- 설치 확인 : conda
anaconda 가상환경 설정
- conda update conda
- conda create -n name python=x.x anaconda
  - 예) conda create -n pytorch_27 python=2.7 anaconda
- 가상환경 활성
  - source activate name
- 비활성
  - source deactivate
- 가상환경 제거
  - conda remove -n name --all
- 앞으로 설치할 것들은 모두 가상환경을 켠 후 가상환경 내에서 설치하도록 한다

7. pytorch 설치

- 가상환경 켜기
  - source activate pytorch_27
- conda install pytorch torchvision -c pytorch
- 설치 확인
  - python
  - import torch

8. caffe2 build

- 가상환경 켜기
- pytorch와 caffe2가 병합이 되어서 설치를 한번만 하면 되는게 아니라 그래도 build를 따로 해줘야한다.
- pytorch 같은 경우는 conda를 가지고 바로 설치를 했고 caffe2의 경우 git에서 폴더를 다운받아서 anaconda를 사용해 build를 하는 식으로 설치를 한다.
- 사이트 참조: https://caffe2.ai/docs/getting-started.html?platform=mac&configuration=compile
- anaconda를 사용해서 source code build 형식으로 설치
  - git clone --recursive https://github.com/pytorch/pytorch
  - cd pytorch
  - ./scripts/build_anaconda.sh --install-locally --cuda 9.0 --cudnn 7
    - - 업데이트 되서 build_anaconda.sh 가 없어지고 build_android.sh 가 생김
      - ./scripts/build_android.sh --install-locally --cuda 9.0 --cudnn 7
- 설치 확인 : cd ~ && python -c 'from caffe2.python import core' 2>/dev/null && echo "Success" || echo "Failure"

9. onnx build

- pytorch 네트워크를 caffe2로 변환 하기 위해서 onnx를 사용해야하는데 이게 caffe2 build와 자꾸 꼬이는 문제가 발생이 되었다.
- 위의 pytorch와 caffe2를 모두 설치한 뒤에 pip를 사용해서 onnx를 설치 (--no-binary flag 필수)
  - pip install --no-binary onnx onnx
- 설치 확인
  - python
  - import onnx
  - from caffe2.python import core
  - 위의 두 개가 에러나 경고 없이 동작 되어야 된다. 사실 여기 까지 오는데 시간이 굉장히 오래 걸렸다 ㅜㅜ

10. SNPE 설치

- SNPE를 다운받아보자
- https://developer.qualcomm.com/software/snapdragon-neural-processing-engine
- 회원가입을 하고 다운로드 하자.
- snpe-1.17.0 version 을 다운 받고 압축을 푼다.
- 설치 하기 이전에 필요로 하는 패키지를 미리 설치한다
  - sudo apt-get install python-dev python-matplotlib python-numpy python-protobuf python-scipy python-skimage python-sphinx wget zip
- 그러고 나서
  - source snpe-1.17.0/bin/dependencies.sh
- 실행 한다. 필요로 하는 패키치를 설치하도록 하는 bash 파일인것 같다
- 혹여나 .sh 파일 실행의 permission 문제가 발생되면
  - chmod a+x dependencies.sh
- 를 통해 권한을 변경해주면 된다.
- 마지막으로 check_python_depens.sh 를 통해서 필요로 하는 패키지를 확인한다.
- 실질적으로 SNPE를 설치하는 것은 다운로드 받은 경로를 PATH에 추가해주는 것이라고 볼 수 있다.
- 경로 설정하는 건 bin/envsetup.sh를 사용하면 간단하다 (-t 뒤 경로는 tensorflow 가 설치된 폴더로 하면 된다)
  - cd snpe-1.17.0
  - 예: source bin/envsetup.sh -t /home/kaist/anaconda3/envs/tensorflow_27/lib/python2.7/site-packages/tensorflow
  - 동작 확인
    - snpe-tensorflow-to-dlc 라고 shell에 입력 해서 command not found 가 뜨지 않고 입력 변수를 넣으세요 라는 말이 나온다면 올바르게 설치가 된것 이다.

11. adb 설치 (Android SDK/NDK 설치)

- 컴퓨터와 안드로이드 폰의 전송 통로? 의 역할을 하는 프로그램으로 android sdk, ndk가 설치가 되어야한다.
- android studio 홈페이지에서 설치
  - https://developer.android.com/studio/?hl=ko
- 실행을 시켜서 최신의 sdk와 ndk를 설치
  - cd andorid-studio
  - source bin/studio.sh
  - File-->settings-->Appearance & Behavior-->System Settings-->Android SDK-->SDK tools 창에서 설치/업데이트 가능
- 경로 설정
  - ~/.bashrc 파일을 열어서 아래 문장을 맨마지막에 추가하고 저장한다.
  - export ANDROID_SDK=/home/kaist/Android/Sdk
  - export ANDROID_NDK=/home/kaist/Android/Sdk/ndk-bundle
  - export PATH=$PATH:$ANDROID_SDK/platform-tools
  - export PATH=$PATH:$ANDROID_SDK/tools
  - export PATH=$PATH:$ANDROID_NDK
  - alias adb='/home/kaist/Android/Sdk/platform-tools/adb'
- 설치 확인 : adb

tensorflow와 동일하게 미리 학습을 진행해야한다. 이번에는 pytorch 를 사용해서 학습하는 경우를 고려한다.

예제 코드를 받아보면 학습하는 코드와 테스트하는 코드, 모바일에서 동작되는 코드를 모두 병합해 놓았다.

tensorflow 버전에서도 말했듯이 아직 pixel-shuffle layer를 아직 SNPE에서 지원하지 않아서 본 예제에서는 그 부분을 PC 환경에서 추가로 한 뒤에 결과를 확인하는 예제코드 이다.

https://drive.google.com/open?id=1d8PbWGC2n4rwCuauv0ZD9D3xYbXZ6qNc

미리 학습된 네트워크 파라미터를 호출해서 결과를 확인해볼 수 있다. 아래의 코드를 차근차근 따라가보면 pytorch 모델 --> onnx 모델 --> dlc 파일 로 변환하는 과정과, PC 환경에서 동작시킨 네트워크 성능 결과 와 모바일 환경에서 동작시킨 네트워크 성능 결과를 비교해 볼 수 있다.

# Some standard imports

import io

import numpy as np

import os

from math import log10

import rawpy

from torch import nn

from torch.autograd import Variable

import torch.onnx

import torch

import onnx

from caffe2.python import core

from skimage.transform import resize

import caffe2.python.onnx.backend

# Super Resolution model definition in PyTorch

import torch.nn as nn

import torch.nn.init as init

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"

os.environ["CUDA_VISIBLE_DEVICES"]="0"

def weights_init(m):

classname = m.__class__.__name__

if classname.find('Conv') != -1:

m.weight.data.normal_(0.0, 0.02)

elif classname.find('BatchNorm') != -1:

m.weight.data.normal_(1.0, 0.02)

m.bias.data.fill_(0)

class CASR_CNN_10(nn.Module):

def __init__(self, upscale_factor, inplace=False):

super(CASR_CNN_10, self).__init__()

self.relu = nn.ReLU(inplace=inplace)

self.leakyrelu = nn.LeakyReLU(negative_slope=0.1, inplace=inplace)

self.conv1 = nn.Conv2d(1, 64, kernel_size=3, padding=1, stride=1)

self.conv2 = nn.Conv2d(64, 64, kernel_size=3, padding=1, stride=1)

self.conv3 = nn.Conv2d(64, 64, kernel_size=3, padding=1, stride=1)

self.conv4 = nn.Conv2d(64, 64, kernel_size=3, padding=1, stride=1)

self.conv5 = nn.Conv2d(64, 64, kernel_size=3, padding=1, stride=1)

self.conv6 = nn.Conv2d(64, 64, kernel_size=3, padding=1, stride=1)

self.conv7 = nn.Conv2d(64, 64, kernel_size=3, padding=1, stride=1)

self.conv8 = nn.Conv2d(64, 64, kernel_size=3, padding=1, stride=1)

self.conv9 = nn.Conv2d(64, 64, kernel_size=3, padding=1, stride=1)

self.conv10 = nn.Conv2d(64, upscale_factor ** 2, kernel_size=3, padding=1, stride=1)

self.pixel_shuffle = nn.PixelShuffle(upscale_factor)

def forward(self, x):

x = self.conv1(x)

x_tmp = x

x = self.conv2(self.relu(x))

x = self.conv3(self.relu(x))

x = x + x_tmp

x_tmp = x

x = self.conv4(self.relu(x))

x = self.conv5(self.relu(x))

x = x + x_tmp

x_tmp = x

x = self.conv6(self.relu(x))

x = self.conv7(self.relu(x))

x = x + x_tmp

x_tmp = x

x = self.conv8(self.relu(x))

x = self.conv9(self.relu(x))

x = x + x_tmp

x = self.conv10(self.relu(x))

out = self.pixel_shuffle(x)

return out, x

if __name__ == '__main__':

CASR_CNN = CASR_CNN_10_non_shuffle(2) # 10 layers

CASR_CNN.apply(weights_init)

CASR_CNN.load_state_dict(torch.load('param/netG_epoch_399.pth', map_location=lambda storage, loc: storage))

#print(CASR_CNN)

CASR_CNN.train(False)

# Input to the model

x = Variable(torch.randn(1, 1, 72, 88), requires_grad=True) # profile 2, (88,72) --> (176, 144)

#print("Make .onnx file")

torch_out = torch.onnx._export(CASR_CNN, # model being run

x, # model input (or a tuple for multiple inputs)

"out/profile2_10/CASR_CNN.onnx",

# where to save the model (can be a file or file-like object)

export_params=True) # store the trained parameter weights inside the model file

# ==================================================================================================================

# Load the ONNX GraphProto object. Graph is a standard Python protobuf object

model = onnx.load("out/profile2_10/CASR_CNN.onnx")

prepared_backend = caffe2.python.onnx.backend.prepare(model)

# run the model in Caffe2

# Construct a map from input names to Tensor data.

# The graph of the model itself contains inputs for all weight parameters, after the input image.

# Since the weights are already embedded, we just need to pass the input image.

# Set the first input

W = {model.graph.input[0].name: x.data.numpy()}

# Run the Caffe2 net:

#c2_out, c2_out_non_shuffle = prepared_backend.run(W)[0]

c2_out = prepared_backend.run(W)[0]

# Verify the numerical correctness upto 3 decimal places

np.testing.assert_almost_equal(torch_out.data.cpu().numpy(), c2_out, decimal=3)

#print("Exported model has been executed on Caffe2 backend, and the result looks good!")

# ==================================================================================================================

# extract the workspace and the graph proto from the internal representation

c2_workspace = prepared_backend.workspace

c2_net_def = prepared_backend.predict_net

# Now import the caffe2 mobile exporter

from caffe2.python.predictor import mobile_exporter

# TODO: remove two lines below - my caffe2 is not up-to-date and there was fix pushed for exporter

from caffe2.python import core

cnet = core.Net(c2_net_def)

# call the Export to get the predict_net, init_net. These nets are needed for running things on mobile

init_net, predict_net = mobile_exporter.Export(c2_workspace, cnet, c2_net_def.external_input)

# Let's also save the init_net and predict_net to a file that we will later use for running them on mobile

#with open('out/profile2_10/init_net.pb', "wb") as fopen:

# fopen.write(init_net.SerializeToString())

#with open('out/profile2_10/predict_net.pb', "wb") as fopen:

# fopen.write(predict_net.SerializeToString())

# ==================================================================================================================

# Some standard imports

from caffe2.proto import caffe2_pb2

from caffe2.python import core, net_drawer, net_printer, visualize, workspace, utils

import numpy as np

import os

import subprocess

from PIL import Image

from matplotlib import pyplot

from skimage import io, transform

# load the image

#img_in = io.imread("half_profile2.png")

# resize the image to dimensions 224x224

#img = transform.resize(img_in, [72, 88])

# save this resized image to be used as input to the model

#io.imsave("cat_72x88.jpg", img)

# load the resized image and convert it to Ybr format

img = Image.open("half_profile2_1.png") # test image

img_ycbcr = img.convert('YCbCr')

img_y, img_cb, img_cr = img_ycbcr.split()

# Let's run the mobile nets that we generated above so that caffe2 workspace is properly initialized

workspace.RunNetOnce(init_net)

workspace.RunNetOnce(predict_net)

# Caffe2 has a nice net_printer to be able to inspect what the net looks like and identify

# what our input and output blob names are.

#print(net_printer.to_string(predict_net))

model_input_blob = predict_net.external_input[0]

# model_output_blob = predict_net.external_output[-1]

model_output_blob = '43' # for 10 layer CASR-CNN output

#print('Input blob: ', model_input_blob)

#print('Output blob: ', model_output_blob)

# ==================================================================================================================

# Now, let's also pass in the resized cat image for processing by the model.

img_y_array = np.array(img_y)[np.newaxis, :, :, np.newaxis].astype(np.float32)

img_y_input = img_y_array / 255.0 - 0.5

img_y_input.tofile('out/profile2_10/input.raw') # NHWC order

img_y_array = np.array(img_y)[np.newaxis, np.newaxis, :, :].astype(np.float32)

img_y_input = img_y_array / 255.0 - 0.5

workspace.FeedBlob(model_input_blob, img_y_input)

# run the predict_net to get the model output

workspace.RunNetOnce(predict_net)

#print(img_y.size)

# Now let's get the model output blob

img_out = workspace.FetchBlob(model_output_blob)

#img_out.tofile('tmp.raw')

img_out = torch.tensor(img_out).float()

pixel_shuffle = nn.PixelShuffle(2)

img_out = pixel_shuffle(img_out)

img_out = img_out.numpy()

img_y_up = img_y.resize((176, 144), Image.BICUBIC)

img_y_up_array = np.array(img_y_up)[np.newaxis, np.newaxis, :, :].astype(np.float32)

#img_out = img_y_up_array + (img_out + 0.5) * 255.0

img_out = (img_out + 0.5) * 255.0

img_out = (img_out[0, 0]).clip(0, 255)

# =================================================================================================================

img_out_y = Image.fromarray(np.uint8(img_out), mode='L')

# get the output image follow post-processing step from PyTorch implementation

final_img = Image.merge(

"YCbCr", [

img_out_y,

img_cb.resize(img_out_y.size, Image.BICUBIC),

img_cr.resize(img_out_y.size, Image.BICUBIC),

]).convert("RGB")

# get the output image follow post-processing step from PyTorch implementation

bicubic_img = Image.merge(

"YCbCr", [

img_y.resize(img_out_y.size, Image.BICUBIC),

img_cb.resize(img_out_y.size, Image.BICUBIC),

img_cr.resize(img_out_y.size, Image.BICUBIC),

]).convert("RGB")

ori_img = Image.open("profile2_1.png") # original image

ori_img_ycbcr = ori_img.convert('YCbCr')

ori_img_y, ori_img_cb, ori_img_cr = ori_img_ycbcr.split()

ori_img_y_array = np.array(ori_img_y).astype(np.float32)

mse = ((ori_img_y_array - img_y_up_array) ** 2).mean()

psnr = 10 * log10(255 * 255 / mse)

print("bic PSNR: %f dB" % (psnr))

mse = ((ori_img_y_array - img_out) ** 2).mean()

psnr = 10 * log10(255 * 255 / mse)

print("RVC PSNR in PC: %f dB" % (psnr))

# Save the image, we will compare this with the output image from mobile device

final_img.save("RVC_PC.png")

bicubic_img.save("Bic.png")

"""

=================================================================================================================

In Mobile platform, SNPE is operated.

source activate Snpe

source /home/kaist/Desktop/snpe-1.17.0/bin/envsetup.sh -t /home/kaist/anaconda3/envs/Snpe/lib/python2.7/site-packages/caffe2

cd /home/kaist/Desktop/android_pytorch/out/profile2_10

1) .onnx file --> dlc file

snpe-onnx-to-dlc --model_path CASR_CNN.onnx --dlc CASR_CNN.dlc

2) transfer dlc file and input image file&list to android device

adb devices

adb push input.raw /data/local/tmp/profile2_10

adb push half_profile2.txt /data/local/profile2_10

adb push CASR_CNN.dlc /data/local/profile2_10

3) Run the network

adb shell

export SNPE_TARGET_ARCH=arm-android-gcc4.9

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/local/tmp/snpeexample/$SNPE_TARGET_ARCH/lib

export PATH=$PATH:/data/local/tmp/snpeexample/$SNPE_TARGET_ARCH/bin

cd /data/local/tmp/profile2_10

snpe-net-run --container CASR_CNN.dlc --input_list half_profile2.txt --use_gpu

exit

4) Get the results

adb pull /data/local/tmp/profile2_10/output output_android_gpu

=================================================================================================================

"""

out_raw = np.fromfile('out/profile2_10/output_android_gpu/output/Result_0/43.raw', dtype=np.float32)

out_raw = out_raw.reshape((1,72,88,4))

out_raw = np.transpose(out_raw, [0, 3, 1, 2])

out_raw = torch.tensor(out_raw).float()

#pixel_shuffle = nn.PixelShuffle(2)

out_raw = pixel_shuffle(out_raw)

out_raw = out_raw.numpy()

out_raw = (out_raw + 0.5) * 255.0

out_raw = (out_raw[0, 0]).clip(0, 255)

img_out_raw_y = Image.fromarray(np.uint8(out_raw), mode='L')

final_img_mobile = Image.merge(

"YCbCr", [

img_out_raw_y,

img_cb.resize(img_out_y.size, Image.BICUBIC),

img_cr.resize(img_out_y.size, Image.BICUBIC),

]).convert("RGB")

mse = ((ori_img_y_array - out_raw) ** 2).mean()

psnr = 10 * log10(255 * 255 / mse)

print("RVC PSNR in Andorid: %f dB" % (psnr))

Google Sites

Report abuse

SNPE TUTORIAL (pytorch)

Qualcomm 의 Snapdragon AP에서 딥러닝 기반 Super Resolution 네트워크를 동작시키기 위한 예제

Contact