Research Data Life Cycle

Overview

Data is a set of values of qualitative or quantitative variables that scholars draw upon to support their claims and/or produce new knowledge.

We will go over the six steps of the Data Life Cycle with corresponding tools recommended to you.

Step 1: Data Creation

Before collecting data, it is best to plan ahead and ask yourself: What types and formats of data will be collected? Is there any copyright issue involved? What are the best approaches to store and back up data?
You may go to the Research Data Management library guide for more information.

Data can be collected…

- Through observation – generally be collected once and is unique
- By experimenting – through experiments; in general can be repeated
- By simulation – test models; usually can be reproduced
- By researching sources – deriving from literature, manuscripts, publications, etc.
- By data processing – combining, reprocessing, (re)grouping, etc. of data created before
- By using existing data

Library Services

If you have difficulties in filling out research data management plans (DMP) requested by publishers or fund agencies, please feel free to contact our Scholarly Communications team at lib-sct@hkbu.edu.hk.

Step 2: Data Processing

This step involves data inputting (if the raw data is not collected in a digital format), data conversion (from one system to another system, or from one format to another format), and data cleaning.

Data cleaning requires tedious and time-consuming manual work, but its importance should not be underestimated. Proper data cleaning can prevent researchers from coming back to this step at a later stage of the research and avoid drawing false conclusions.

The following data cleaning tips can serve as a starting point:

- Clear field labeling – make sure you can understand the labels even after one year of time
- Remove unwanted observations – including duplicate or irrelevant observations
- Filter unwanted outliers – only for the suspicious measurements that are unlikely to be accurate
- Handle missing data – by dropping observations with missing values or inputting missing values based on other observations
- Fix structural errors – including typos, inconsistent capitalization, and inconsistent name formats
- Controlled vocabularies may help – e.g., develop a small dictionary to remind yourself to use "United States" (instead of "USA" or "America") or "computer" (instead of "computers" or "PC") throughout the document
- Beware of strange characters – especially when you directly copy and paste web contents into an Excel; an invisible strange character is usually added at the end of a sentence

Some of these points are mentioned in EliteDataScience. Go there for a more comprehensive explanation.

Software Recommendations

OpenRefine FREE

Official Website: http://openrefine.org/#download_openrefine

User Guide: https://libjohn.github.io/openrefine/start.html

Step 3: Data Analysis

This is the most challenging but also most exciting part of the cycle. It can involve quantitative analysis, qualitative analysis, machine learning, etc.

This guide does not intend to cover basic statistics that can be found on the Internet easily. (If you have no idea which internet sites to use, you may start with Statistics How To.) We hope to introduce commonly-used software tools instead.

Software Recommendations

Quantitative Analysis Software

SPSS INSTALLED IN MLC

More often used by social scientists

Official Website: https://www.ibm.com/products/spss

User Guide: https://stats.idre.ucla.edu/spss/

Stata INSTALLED IN MLC

More often used by social scientists

Official Website: https://www.stata.com/

User Guide: https://stats.idre.ucla.edu/stata/

OriginPro INSTALLED IN MLC

More often used by scientists and engineers

Official Website: https://www.originlab.com/

User Guide: https://www.originlab.com/doc/User-Guide

Good Calculators: Mathematics Statistics and Analysis Calculators FREE

This website provides a variety of handy online calculators, such as math and statistics, engineering and conversion calculators

Official Website: https://goodcalculators.com/statistics-calculators/

Qualitative Analysis Software

NVivo INSTALLED IN MLC

For text mining and analysis

Official Website: https://lumivero.com/products/nvivo/

MaxQDA INSTALLED IN DAR OF MLC

For text mining and analysis

Official Website: http://www.maxqda.com

Programming Languages to Provide an Integrated Support from Data Preparation to Web Applications

The following two programming languages are quite powerful and can support many aspects of the data life cycle, including web crawling, statistics, data manipulation, machine learning, data visualization, web applications, etc.

Python FREE

Official Website: https://www.python.org/

User Guide: https://swcarpentry.github.io/python-novice-inflammation/

R FREE

Official Website: https://www.r-project.org/

User Guide: https://swcarpentry.github.io/r-novice-inflammation/

Comparison between Python and R: https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis

Library Services

Stay tuned for our semester-based Research Data Tools Series Workshops if you want to learn how to use these software. We also offer a limited number of course-embedded basic training each year.

Step 4: Data Storage

This step involves short-term measures such as proper file version control during a research project and long-term data archiving measures to migrate data to the best format and store it in the most suitable medium for your or your company's future use. You may learn more about this through TechTarget.

Tool Recommendations

Git FREE

A version control tool

Official Website: https://git-scm.com/

User Guide: https://swcarpentry.github.io/git-novice/

Step 5: Data Sharing

Data storage is more on internal use of data, but data sharing refers to open data that can be accessed and re-used by the public for free. Open data is not only a trend but also an obligation that researchers are recommended to meet for the benefits of academia and the society. Some major publishers also request authors to share their data, e.g., Nature and Science.

Data can be shared in its original form (after removing privacy and sensitive information) through publicly accessible data repositories. Researchers can also choose to share their data through data visualizations or developing interactive web applications.

Tool Recommendations

Data Repositories

There are many data repositories available online for you to share data sets; some are subject-based, material-type specific, or region specific. If you are new to this area, you may want to start from the following three platforms:

Figshare FREE

A multi-disciplinary repository for research data, managed by a commercial firm

Official Website: https://figshare.com/

Harvard Dataverse FREE

A multi-disciplinary repository for research data, managed by Harvard University

Official Website: https://dataverse.harvard.edu/

Github FREE

Mainly for sharing codes

Official Website: https://github.com/

You can also develop your own data management / sharing systems using open source data platforms:

CKAN FREE

Both the US and HK Governments use this open source platform to share governmental data

Official Website: https://ckan.org/

Data Visualization Software

Tableau INSTALLED IN MLC

Official Website: https://www.tableau.com/

User Guide: https://data-flair.training/blogs/category/tableau/

Gephi FREE

Official Website: https://gephi.org/

User Guide: https://medium.com/@Luca/guide-analyzing-twitter-networks-with-gephi-0-9-1-2e0220d9097d

Flourish FREE (partially)

Official Website: https://flourish.studio/

User Guide: https://flourish.studio/developers/tutorial/

Library Services

The Library's Digital Initiatives and Research Cluster has a team of project managers, programmers and project assistants ready to provide support for Digital Scholarship Services to help faculty members develop interactive web applications for public access. We offer Digital Scholarship Grant and a track of non-grant application. Contact us at libms@hkbu.edu.hk to discuss potential ideas and make good use of your data!

Step 6: Re-use of Data

There are many free and subscribed data resources available for researchers to re-use. We have prepared another library guide for data resources, please visit the guide on Sources for Data-Mining.

Top Analytics Software 2016-18

(developed by KDnuggets)

Useful Online Learning Resources for Data Science

DataCamp
https://www.datacamp.com/
Coursera
https://www.coursera.org/browse/data-science
edX
https://www.edx.org/course?subject=Data%20Analysis%20%26%20Statistics
codeacademy
https://www.codecademy.com/

Data Software Training Videos

The library has collaborated with Apps Resource Centre to develop a series of Python and SPSS online training videos. These videos are specifically designed for local students who have no prior knowledge about programming or statistics.

https://digital.lib.hkbu.edu.hk/digital/RDS_training.php

Turn Your Data into Digital Scholarship Projects

Since 2015, HKBU Library has been working closely with many faculty members to present and visualize research data in the form of digital scholarship projects. The Library now boasts a portfolio of 40+ Digital Scholarship Projects from across different disciplines, sharing valuable scholarly sources that benefit and impact academia and beyond.

Watch these videos on why and how HKBU researchers share data.