Xiang E. Xiang (项翔)

PhD in Computer Science at Johns Hopkins University (JHU)

Applied Scientist at Amazon.com, Inc. (AWS AI, Amazon Textract & Amazon Rekognition)

Welcome to Xiang's personal homepage. All content on this homepage is the sole responsibility of Xiang and does not reflect the opinions of Xiang's previous or current employer.


[10/19] Serving in the program committee of ACM Multimedia 2019.

[6/19] Attending CVPR at Long Beach.

[5/19] Textact project that I participated in is launched for General Availability.

[2/19] After about a year, my PhD Dissertation is finally publicly downloadable at this page of the Johns Hopkins Sheridan Libraries' website.

[6/18] Attending CVPR at SLC.

[5/18] Serving in the program committee of the 4th DLMIA at MICCAI'18.

[4/18] Serving in the program committee of the 8th workshop on AMFG at CVPR'18.

[3/18] Attending the PhD Forum of WACV'18 at Lake Tahoe awarded with the travel grant.

[10/17] Attending ACM Multimedia at Bay Area.

[7/17] Serving as a committee member of the 3rd Workshop on Deep Learning in Medical Image Analysis (DLMIA) at MICCAI'17.

[5/17] Giving a talk at MACV'17 hold at UPenn.

[3/17] Serving as a technical staff of the PASCAL in Detail Challenge at CVPR'17.

[2/17] Our team (JHU) won two 2nd Place awards in the facial expression recognition track and AU detection track in the EmotionNet 2017 Challenge.

[1/17] Finished writing my first grant proposal (as a student!) with faculty mentors during the winter holiday.


I work in the e-commerce and logistics industry that dilievers products and business-oriented technologies to business customers in both online and offline markets. I specialize in technologies of computer vision, visual sensing, robot vision, pattern recognition, machine vision/learning/intelligence.

"It's time to go to work." -Shane Battier


I work in the area of video, facial & medical image analysis and health informatics. I specialize in action quality assessment, tracking and temporal modeling, and restrictive representation learning. See also my research proposal.

Research questions

Representing Image Sequences as Image Sets versus Sequential Models

Representing Image Sequences as Image Sets versus Sequential Models for Restrictive Learning: How to train with limited data?

Albert Einstein said, everything should be made as simple as possible, but not simpler. As complicating things is straightforward, I am very much curious if a certain simple model really does not work well. With such a “less is more” mindset, there's always a line connecting the series of my works, which is to address the limited annotated data problem and enables restrictive learning. While it may be favored in industry to engineer more annotated data, it is of great scientific value to examine if we can and how we can train a model with limited training data.

Image Sequences come at various forms. For example, a video is a sequence of images frame by frame over time; a 3D volumetric image is a sequence of images slice by slice over one dimension of the 3D space; a typical multiresolution image analysis is sliding over and convolving an image with different kernel size to induce a sequence of image patches over the scale space.

Now, if we simply model an image sequence as a set of orderless images, we lose the sequential information. Possibly due to the obvious drawback, this area is not heavily investigated. In fact, we can still learn an informative representation (say, the covariance matrix) from an image set. One approach of relatively high popularity nowadays is pooling or aggregating a single feature vector. Another approach from a more intuitive perspective is exploiting the low-rankness over image intensities and group sparsity over the linear representation vectors so that all the images collaboratively represent a set of images. The elegance of the model is also given by the fact that linear is simple.

On the other hand, sequential modeling is a well-studied area: global vs. local, long-term vs. short-term, Markov vs. semi-Markov, and so on. However, I tend to revisit simple models such as frame/slice differencing, motion modeling, and correlating features over time.

With such a "less is more" mindset, I even doubt if going from an image to an image sequence is always necessary. At least in certain real-world computer vision systems, it is critical to make recognition, prediction and decision based on the visual information of exactly here right now. In that case, it is wise to make a full use and a deep exploration of an image. One empirical work I performed is to transfer image features for a different task simply by regularizing the image-based model. Another example is gaining the discriminative power by normalizing the deep features as well as the weights of deep networks.

A sequence of images usually come at more storage memory and processing time than an image. While I am not sure if what Einstein says is always correct, I have already realized the beauty of a simpler model - less computation, less memory occupation, less parameter tuning, or simply better understanding. In the case when computation, memory and less tuning are important to us, we may want to simply maximize the performance of a simple model.

[CVML]: Computer Vision & Machine Learning; [CVHC]: Computer Vision in Health Care (Health Informatics).


[CVHC] Haibo Wang, Xiang Xiang, Kees van Zon. Subject identification systems and methods. US Patent, No. 16/014,046, 2018. [google] [System work: proposed a hybrid face recognition solution for patient identification]

Journal Preprint

[CVML]: Z. Wang, S. Lao, X. Xiang, B. Zhang, Y. Zhou, J. Han: Pruning Multi-view Stereo Network for Efficient 3D Reconstruction. Under journal review, 2020.


[CVML] Xiang Xiang: Image-set, Temporal and Spatiotemporal Representations of Videos for Recognizing, Localizing and Quantifying Actions. Johns Hopkins University Ph.D. Dissertation, Baltimore, USA, June 2018. [pdf][talk-a][talk-b][official page at JHU Sheridan libraries][official announcement at JHU CS Dept][semanticscholar]

[CVML] Xiang Xiang: Object Tracking and Segmentation in Dynamic Scenes: University of Chinese Academy of Sciences M.S. Dissertation, Beijing, China, June 2012. [PDF][bib]


[CVML] Junqin Huang, Zhenhuan Huang, Xiang Xiang, Xuan Gong, Baochang Zhang: Long-Short Graph Memory Network for Skeleton-based Action Recognition. To appear at IEEE Winter Conference on Applications of Computer Vision 2020 (WACV 2020), Aspen, USA. [PDF]

[CVHC] Xiang Xiang: Beyond Deep Feature Averaging: Sampling Videos Towards Practical Facial Pain Recognition. Full paper at IEEE Conference on Computer Vision and Pattern Recognition 2019 (CVPR 2019) workshop on Face and Gesture Analysis for Health Informatics, Long Beach, USA. [PDF][thecvf][upmc.fr]

[CVML] Xiang Xiang and Trac D. Tran: Linear Disentangled Representation Learning for Facial Actions. IEEE Transactions on Circuits and System for Video Technology (IEEE T-CSVT), Volume: 28, Issue: 12, 2018. [ieee][arxiv][github][Modeling Work (new perspective): video-based sparsity + PCP]

[CVML] Xiang Xiang*, Ye Tian*, Austin Reiter, Gregory D. Hager, Trac D. Tran: S3D: Stacking Segmental P3D for Action Quality Assessment. Full paper IEEE International Conference on Image Processing 2018 (ICIP 2018), Athens, Greece. (* Equal contribution.) [ieee][sigport][researchgate][github][linkedin][grant proposal][followup works at cvpr19]

[CVHC] Xiang Xiang: Effect of Spatial Alignment in Cataract Surgical Phase Recognition. Abstract paper at IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018) workshop on Fine-Grained Instructional Video Understanding, Salt Lake City, USA. [pdf][umich][poster]

[CVHC] Wentao Zhu, Xiang Xiang, Trac D. Tran, Gregory D. Hager and Xiaohui Xie: Adversarial Deep Structural Networks for Mammographic Mass Segmentation. Full paper at IEEE International Symposium on Biomedical Imaging 2018 (ISBI 2018), Washington DC, USA. [arxiv] [github][linkedin]

[CVHC] Xiang Xiang, Wentao Zhu, Trac D. Tran, Gregory D. Hager: Survey on Multi-Scale CNNs for Lung Nodule Detection. Abstract paper at IEEE International Symposium on Biomedical Imaging 2018 (ISBI 2018), Washington DC, USA.

[CVHC] Xiang Xiang*, Ye Tian*, Gregory D. Hager, Trac D. Tran: Assessing Pain Levels from Videos Using Temporal Convolutional Networks. Abstract paper at IEEE Winter Conf. on Applications of Computer Vision 2018 (WACV 2018) workshop on Computer Vision for Active and Assisted Living , Lake Tahoe, USA. (* Equal contribution.) [youtube]

[CVML] Feng Wang, Xiang Xiang, Jian Cheng, Alan L. Yuille. NormFace: L2 Hypersphere Embedding for Face Verification. Full long paper at ACM MultiMedia (MM) Conference 2017, Mountain View, USA. [arxiv][github][researchgate][comment][linkedin][csdn in Chinese][Theoretic work (3 new propositions): the norm of weights and features matter for nonlinear representation; although demonstrated on faces, this is a generic deep learning paper showing that when L2 norm is applied prior to softmax, the networks dont converge. The novel design is to add a scale layer that is learnt to help the network converge. Theoretical results are shown then lower bounds of loss with respect to the scale parameters]

[CVHC] Feng Wang, Xiang Xiang*, Chang Liu, Trac D. Tran, Austin Reiter, Gregory D. Hager, Jian Cheng and Alan L. Yuille. Regularizing Face Verification Nets for Pain Intensity Regression. Full paper at IEEE International Conference on Image Processing 2017 (ICIP 2017), Beijing, China. (*corresponding author. Oral presentation.) [ieee][researchgate][arxiv] [github][slides][linkedin][Modeling work (new objective): fine-tuning a face net with a regression loss regularized by a classification loss to induce discrete values]

[CVML] Hao Zhu, Feng Wang, Xiang Xiang and Trac D. Tran. Supervised Hashing with Jointly Learning Embedding and Quantization. Full paper at IEEE International Conference on Image Processing 2017 (ICIP 2017), Beijing, China. [ieee][researchgate][linkedin][Modeling work (new objective): relaxed optimization for image retrieval]

[CVML] Xiang Xiang, Dung N. Tran, Trac D. Tran. Sparse Unsupervised Clustering with Mixture Observations for Video Summarization. Abstract paper at IEEE Applied Imagery Pattern Recognition Workshop (AIPR) 2017, Washington DC, USA. [ieee][researchgate]

[CVML&HC] Xiang Xiang and Trac D. Tran: Pose-Selective Max Pooling for Measuring Similarity. Workshop paper at IAPR International Conference on Pattern Recognition (ICPR) 2016 workshops, Cancun, Mexico. LNCS, vol. 10165 (Video Analytics), ISBN: 978-3-319-56686-3. [springer][github][researchgate][arxiv][aau][linkedin][google][Algorithmic work (2 new algorithms): video-based keyframe + deepface]

[CVML] Xiang Xiang and Trac D. Tran: Recursively Measured Action Units. Workshop paper at IAPR International Conference on Pattern Recognition (ICPR) 2016 workshops, Cancun, Mexico. LNAI, vol. 10183 (Pattern Recognition of Social Signals), 978-3-319-59258-9. [researchgate][books.google][springer][Algorithmic work (new algorithm): video-based LSTM + SMP]

[CVML] Xiang Xiang, Minh Dao, Gregory D. Hager, Trac D. Tran: Hierarchical Sparse and Collaborative Low-Rank Representation for Emotion Recognition. Full paper at IEEE International Conference on Acoustics, Speech, and Signal Processing 2015 (ICASSP 2015), Brisbane, Australia. ISBN: 978-1-4673-6997-8. [ieee][arxiv][github] [mathworks] [youtube] [elsevier][Modeling work (2 new objectives): video-based sparsity + PCP]

[CVHC] Xiang Xiang, Daniel Mirota, Austin Reiter, Gregory D. Hager: Is Multi-Model Feature Matching Better for Endoscopic Motion Estimation? Workshop paper at International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2014 workshops, Boston, USA. LNCS, vol. 8899 (Comp.-Ass. & Rob. End.), 2014, ISBN: 978-3-319-13409-3. [springer][nih][researchgate][code][google][Empirical work: video-based 3D vision]

[CVML] Xiang Xiang, Hong Chang, Jiebo Luo: Online Web-Data-Driven Segmentation of Selected Moving Objects in Videos. Full paper at Asian Conference on Computer Vision 2012 (ACCV 2012): 134-146, Daejeon, Korean. LNCS, vol. 7725 (ACCV), 2013, ISBN: 978-3-642-37443-2. [demo: over 10 seconds!!!][springer][acm][pdf][researchgate][github][youtube] [dataset][google][cvpapers][visionbib][slides][Modeling work (new objective for segmentation + new algorithm for tracking): video-based MIL tracking + Graph cuts]

[CVHC] Xiang Xiang: A Localization Framework under Non-rigid Deformation for Rob. Surg.. Full paper at International Symposium on Visual Computing (ISVC) 2011: 11-22, Lake Tahoe, USA. LNCS, vol. 6938 (Advances in Visual Computing), 2011, ISBN: 978-3-642-24027-0. [springer][acm][linkedin][researchgate][google][visionbib][Modeling work (new formulation): geometry based registration]

[CVML] Xiang Xiang: An Attempt to Segment Foreground in Dynamic Scenes. Full paper at International Symposium on Visual Computing (ISVC) 2011: 124-134, Lake Tahoe, USA. LNCS, vol. 6938 (Advances in Visual Computing), 2011, ISBN: 978-3-642-24027-0. [springer][acm][researchgate][google][Empirical work: video-based graph cuts]

[CVML] Xiang Xiang, Wenhui Chen, Du Zeng: Intelligent Target Tracking and Shooting System with Mean Shift. Full paper at IEEE International Symposium on Parallel and Distributed Processing and Applications (ISPA) 2008: 417-421, Sydney, Australia. Parallel and Distributed Processing, IEEE, 2008, ISBN: 978-0-7695-3471-8. [ieee][pdf][github][youtube][whu][sjtu][Embedded vision work: video-based tracking camera] [Interesting cited paper] (a typical paper from this conference; no free lunch, but welcome to the jungle! It turns out that I start doing multi-core 10 years ago in college and observe it move to multiple CPU/GPU which induces great success of modern vision and graphics applications.)

Research interest

My research focuses on the intersection between computer vision and matrix theory. In particular, solving visual recognition using representation learning involves discovering the data structure. A data structure, in the simplest terms, is a structured representation of information. For example, video frames are highly correlated so that an underlying low-rank matrix exist if we represent each frame as a vector and arrange them to form a matrix. Low-rank matrix approximation does not necessary give us a visually-meaningful object so once again we apply other prior knowledge such as structured sparsity in the representation learning.


Fresh Research, Colorful Life, 2009. [google]

A Brief Review on Visual Tracking Methods, 2011. [google][researchgate]

Xiang Xiang and Daniel Gildea: EM Algorithm Lecture Notes, 2012. [rochester]

Surgical Phase Recognition, 2012. [dropbox][youtube] (eye tracking)

One Way to Implement Trimmed ICP algorithm, 2012. [dropbox]

Learning CRFs for Stereo Matching, 2013. [dropbox]

A Review on Stereo Matching Using Belief Propagation, 2013. [dropbox]

Xiang Xiang, Fabian Prada and Hao Jiang. Joint Sparse and Low-Rank Representation for Emotion Recognition, 2014. [dropbox]

Sample GBO Questions on Computer Vision, Machine Learning and Optimization, 2014. [blogger]

Tutorial on setting up Caffe, 2014. [dropbox]

From Receipts to Charts: Automating the Accounting Workflow. MASC-SLL, JHU, 2015 [researchgate].

A Review on EigenFace, 2015. [dropbox][youku]

A Review on Dimensionality Reduction, 2015. [dropbox]

A Survey on Audio-Visual Speech Recognition, 2015. [researchgate][linkedin]

A C++ Solution for Testing the VGG_Face Deep Model, 2016. [github][linkedin]

Soccer-Field Computer Vision, 2016. [linkedin]

A Future of Fully-Auto Production of Animations, 2016. [linkedin]

Extraction of Object Skeleton, 2016. [linkedin][dropbox]

Multi-Scale Deep 3D CNNs for Automatic Detection and Segmentation of Pulmonary Nodules, 2017. [dropbox]

A Brief Review on Video Representation, 2018. [slideshare]


Xiang Xiang received the Ph.D. degree at Johns Hopkins University (JHU), Baltimore, USA, in Summer 2018, the M.S.E. degree in Computer Science from Johns Hopkins University in Spring 2014, the M.S. degree in Computer Science from the Institute of Computing Technology (ICT) at Chinese Academy of Sciences (CAS) and namely the University of Chinese Academy of Sciences (UCAS), Beijing, China, in Summer 2012, the B.S. degree from Wuhan University (WHU) Wuhan, China, in 2009, all in Computer Science (CS). His research interests are computer vision and machine learning with a focus on video representation learning. For a glimpse of his research, please refer to his DBLP page. For more details, please see also his Google Scholar page and follow links to papers therein. His academic genealogy can be found here and his Erdös Number is 5.

In detail, he studied at JHU as a PhD student in CS since Fall 2012 and initially worked with Gregory D. Hager on 3D reconstruction from videos. His M.S.E. thesis was advised by Gregory D. Hager and completed in Summer 2014. Since then when he also passed the PhD candidacy exam, he started to work quite independently towards dissertation and more closely with Trac D. Tran on learning sparse representations. He first stayed in CIRL and LCSR where he was also exposed to medical imaging and robotics and then stayed in CCVL where he collaborated on learning deep representations. Along those lines and based on his own ideas, eventually he wrote up a Ph.D. dissertation, which was advised by Gregory D. Hager and Trac D. Tran, read by Alan Yuille, Austin Reiter, and René Vidal. In the meanwhile, he interned at the Imaging Division of FDA's Center for Device and Radiology, Silver Spring, MD, USA, in Summer 2017 as an Oak Ridge Institute (ORISE) fellow , at Phillips Research North America, Cambridge, MA, USA, in Summer 2016, at Emotient R&D (now part of Apple AI), San Diego, CA, USA, in Summer 2015.

Before moving to US in 2012, he studied at UCAS as a PhD student in CS since Fall 2009 and worked with Xilin Chen on developing object tracking algorithms. Along that topic, eventually he finished a M.S. dissertation, which was advised by Xilin Chen and Hong Chang, read by Jian Cheng and Yonghong Tian. He stayed in VIPL and JDL where he was also exposed to face recognition and video coding. Even before, he studied at WHU since Fall 2005 and got hands on designing real-time object tracking systems in 2008, when he was also recommended to UCAS for PhD studies. He stayed in Yagi Lab while being exchanged at Osaka University, Japan, during 2008-2009. His B.S. thesis was advised by Yulin Wang and defended in Summer 2009. For more info, please see also http://www.cs.jhu.edu/~xxiang/ or drop a line at xxiang@cs.jhu.edu .

His current position is an Applied Scientist at Amazon Web Services (AWS), Seattle, WA, USA. He has been working for Amazon Textract and Amazon Rekognition at the Computer Vision Science Division of Amazon AI since Summer 2018.


Associate Editor for AMIS Journal: Mathematical Foundations of Computing; Reviewer for Pattern Recognition'18-, IEEE series Trans. Pattern Analysis and Machine Intelligence'19-, Trans. Neural Networks and Learning Systems'18-, Trans. Image Processing'18-, Trans. Circuits and Systems for Video Technology'15-, Trans. Affective Computing'18-, Trans. Biometrics, Behavior, and Identity Science'19-, as well as IEEE Access, IEEE Geoscience and Remote Sensing Letters, Multimedia Tools and Applications'18-, SPIE Journal of Electronic Imaging'18-, CVPR'19, ACM MM'19, IROS'19, ICCV’15, MICCAI17’16’15, ICIP’18'17'12'11, ICASSP’17, ICRA’14, and ICME’12.

Technical staff of CVPR'17 PASCAL in detail challenge. Program committee member of ACM MM'19, DLMIA@MICCAI'17 and AMFG@CVPR'18.

Volunteer for ACM MM'09'17 and ACM ICMI'10, ISCAS'16, NGO Junior Achievement 09-10.

JHU-CS: Faculty Liaison Czar of the Graduate Student Council in Academic Year 17-18 & Graduate Admission Representative 13-18.


I have taught EN.600.311 Sparse Representation in Computer Vision as an independent instructor in winter 2015 and I am a recipient JHU Teaching Academy's Preparing Future Faculty Program . [youtube] In addition, I am the Head TA for EN 600.320/420/620 Parallel Programming taught by Randal Burns during Spring 2018, the co-T.A. for EN.600.461/661 Computer Vision taught by Austin Reiter during Fall 2017 & 2015 and EN.600.461/661 Computer Vision taught by Rene Vidal during Fall 2013, the Head T.A. for EN.600.683 Vision as Bayesian Inference taught by Alan L. Yuille during spring 2017, the Head T.A. for EN.600.477/677 Causal Inference taught by Ilya Shpitser during Fall 2016, and the co-T.A. for EN.520.648 Sparse Recovery and Compressed Sensing taught by Trac D. Tran during Spring 2015.


"We’re gonna get it, get it together and go up, and up, and up. " - <Up & Up> Coldplay, 2015.

Manager (领队)/Player-coach of Hopkins CSSA soccer club in Academic Year 17-18. Winning Ivy League Cup'18 championship, DC Cup'15 championship, JHU intramural'16 fall championship, JHU intramural'15 fall & 17 fall final-list, and Ivy League Cup'17 Best Fighting Award.

Member of JHU-CS soccer team'13-18. Winning JHU intramural'17 spring championship (coed).

My wife is a CMU alumna and we have a daughter. We help the fundraising for CEE, a non-CS department at CMU.


2015: (left two) ICASSP @ Brisbane, Australia; (middle) CIRL @ Cambridge, MD; (right) Emotient @ San Deigo, CA

2016: (left) Philips Research @ Cambridge, MA; (middle) Harvard big data conference; (right) JHMI symposium @ Baltimore, MD

2017: CCVL @ Baltimore, MD (above); FDA @ Silver Spring, MD (below left); ACM MM @ Mountain View, CA (below middle & right)

WACV 2018 @ Lake Tahoe, CA & NV.

ISBI 2018 @ Washington, DC.

Dissertation defense @ JHU, Baltimore, MD

CVPR 2018 @ Salt Lake City, UT

Contributors of Deep Sail project of Stanford AI Lab (left), its Boston University team (middle) and Pascal-in-Detail API of JHU-CCVL (right).

Demo for fun: Expression-preserving 3D face modeling from video.

Hopkins CSSA Soccer wins 2018 Ivy Cup @ Brown University, Providence, RI

"High up above or down below, if you never try you'll never know, just what you're worth." - <Fix you> Coldplay, 2005.


If you disagree with any content in this webpage, the first thing you need to do is to drop a line to xxiang@cs.jhu.edu . Any comment is welcome and will be accommodated with the best sincerity. Thanks for your cooperation!

Info in other languages

The following linked webpages contain information in Chinese regarding my research work during Chinese Academy of Science and Wuhan University, respectively. [intel][ict][vipl][iip][jdl][jdl][jdl][slide][slide][sina][doc88][doc88]