Surse de date

Aplicații de analiză statistică gratuite

Într-o multitudine de proiecte, studii și analize sunt necesare aplicații care să ofere diverse instrumente de analiză statistică. Această pagină încearcă să pună la dispoziția studenților diverse aplicații de analiză statistică la care accesul este gratuit:

PSPP - https://www.gnu.org/software/pspp/
SAS - https://www.sas.com/en_us/software/on-demand-for-academics.html?utm_source=PredictiveAnalyticsToday&utm_medium=Review&utm_campaign=PAT#8802f278-270c-44c8-b9c6-815ead010daf
The Statistical Lab - http://www.statistiklabor.de/en/
Develve - https://develve.net/
Datamelt - https://datamelt.org/
Octave - http://www.gnu.org/software/octave/
SofaStatistics - http://www.sofastatistics.com/home.php
Dataplot - https://www.itl.nist.gov/div898/software/dataplot/homepage.htm
ZeligProject - https://zeligproject.org/
SalStat - https://www.salstat.com/
Rtudio - https://www.rstudio.com/
Timi - https://timi.eu/
JASP - https://jasp-stats.org/
SciLab - https://www.scilab.org/software/scilab/statistics
Jamovi - https://www.jamovi.org/

Date

Într-o multitudine de proiecte, studii și analize sunt necesare date consistente, corecte și relevante referitoare la diverse subiecte. Această pagină încearcă să pună la dispoziția studenților cele mai bogate și importante surse de date existente în acest moment, surse la care accesul este gratuit:

Baza de date a Institutului Național de Statistică al României: http://statistici.insse.ro:8077/tempo-online/#/pages/tables/insse-table
Calculator al indicelui prețurilor de consum (rata inflației) de la INS: http://statistici.insse.ro/shop/?page=ipc1&lang=ro
Google Datasearch (!!!): https://datasetsearch.research.google.com
Google Public Data Explorer: https://www.google.com/publicdata/directory
Date puse la dispoziție public de către guvernul României: http://data.gov.ro/dataset
Baza de date a Institului European de Statistică (Eurostat): https://ec.europa.eu/eurostat/data/database
European Open Data Portal: https://data.europa.eu/euodp/en/data
Date puse la dispoziție public de către guvernul SUA: https://www.data.gov/
Organizația Mondială a Sănătății: https://www.who.int/gho/database/en/
World Bank Open Data: https://data.worldbank.org/
Federal Reserve Economic Data (FRED): https://fred.stlouisfed.org/
Baza de date a Organisation for Economic Co-operation and Development (OECD): https://data.oecd.org/
Proiectul Dataverse (multiple servere; selectați serverul dorit pe hartă): https://dataverse.org/
Global Open Data Index: https://index.okfn.org/dataset/
Biroul de recensământ al SUA: https://www.census.gov/data.html
UNICEF: https://data.unicef.org/
Date puse la dispoziție public de către guvernul UK: https://data.gov.uk/
Date puse la dispoziție public de către guvernul Franței: https://www.data.gouv.fr/en/
Date puse la dispoziție public de către guvernul Australiei: https://data.gov.au/
Date puse la dispoziție public de către guvernul Indiei: https://data.gov.in/
Open Food Data Datacentral: https://food.schoolofdata.ch/
Harta interactivă cu peste 2600 de surse de date: https://opendatainception.io/
Microsoft Research Open Data: https://msropendata.com/
GitHub, Awesome Public Datasets: https://github.com/awesomedata/awesome-public-datasets
Data Is Plural: https://tinyletter.com/data-is-plural
data.world: https://data.world/datasets/open-data
MakeoverMonday: https://www.makeovermonday.co.uk/data/
Reddit datasets: https://www.reddit.com/r/datasets/
UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/index.php
Registry of Open Data on AWS: https://registry.opendata.aws/
Quandl.com: https://www.quandl.com/
FiveFirthyEight: https://data.fivethirtyeight.com/
Yelp Dataset: https://www.yelp.com/dataset
Kaggle: https://www.kaggle.com/datasets
Pew Research Center: https://www.pewresearch.org/global/datasets/
PapersWithCode: https://paperswithcode.com/
KDnuggets: https://www.kdnuggets.com/datasets/index.html
Datagious: https://datagious.com/datasets/
Academic Torrents: https://academictorrents.com/browse.php?cat=6
Reddit Datasets: https://www.reddit.com/r/datasets/
Apret.io: https://apert.io/
HuggingFace: https://huggingface.co/datasets
Open Data for Africa: http://dataportal.opendataforafrica.org/
freeCodeCamp OpenData: https://github.com/freeCodeCamp/open-data
UCIU Machine Learning repository: https://archive.ics.uci.edu/ml/index.php
Lista de servere de baze de date geospatiale: https://mappingsupport.com/p/surf_gis/list-federal-state-county-city-GIS-servers.pdf
Date ale Inspectoratului General pentru Situații de Urgență: http://date-igsu.opendata.arcgis.com/
Agenția pentru Finanțarea Investițiilor Rurale: http://opendata.afir.info/
Primăria Municipiului Călărași: http://www.primariacalarasi.ro/index.php/home/open-data
Open Data Alba Iulia: http://3.121.0.153/#/
UCI Machine Learning Repository: The Machine Learning Repository at UCI provides an up to date resource for open-source datasets: http://mlr.cs.umass.edu/ml/
VisualData: Discover computer vision datasets by category; it allows searchable queries: https://www.visualdata.io/
CMU Libraries: Discover high-quality datasets thanks to the collection of Huajin Wang, at CMU: https://guides.library.cmu.edu/machine-learning/datasets
Boston Housing Dataset: Contains information collected by the US Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive and has been used extensively throughout the literature to benchmark algorithms: https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html
Google-Landmarks-v2: An improved dataset for landmark recognition and retrieval. This dataset contains 5M+ images of 200k+ landmarks from across the world, sourced and annotated by the Wiki Commons community: https://www.kaggle.com/xiuchengwang/python-dataset-download
Mall Customers Dataset: The Mall customers dataset contains information about people visiting the mall in a particular city. The dataset consists of various columns like gender, customer id, age, annual income, and spending score. It’s generally used to segment customers based on their age, income, and interest: https://www.kaggle.com/shwetabh123/mall-customers
IRIS Dataset: The iris dataset is a simple and beginner-friendly dataset that contains information about the flower petal and sepal width. The data is divided into three classes, with 50 rows in each class. It’s generally used for classification and regression modeling: https://archive.ics.uci.edu/ml/datasets/Iris
MNIST Dataset: This is a database of handwritten digits. It contains 60,000 training images and 10,000 testing images. This is a perfect dataset to start implementing image classification where you can classify a digit from 0 to 9: http://yann.lecun.com/exdb/mnist/
Fake News Detection Dataset: It is a CSV file that has 7796 rows with four columns. There are four columns: news, title, news text, result: https://www.kaggle.com/c/fake-news/data
Wine quality dataset: The dataset contains different chemical information about the wine. The dataset is suitable for classification and regression tasks: https://archive.ics.uci.edu/ml/datasets/wine+quality
SOCR data — Heights and Weights Dataset: This is a basic dataset for beginners. It contains only the height and weights of 25,000 different humans of 18 years of age. This dataset can be used to build a model that can predict the height or weight of a human: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights
Titanic Dataset: The dataset contains information like name, age, sex, number of siblings aboard, and other information about 891 passengers in the training set and 418 passengers in the testing set: https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/problem12.html
Credit Card Fraud Detection Dataset: The dataset contains transactions made by credit cards; they are labeled as fraudulent or genuine. This is important for companies that have transaction systems to build a model for detecting fraudulent activities: https://www.kaggle.com/mlg-ulb/creditcardfraud
xView: xView is one of the most massive publicly available datasets of overhead imagery. It contains images from complex scenes around the world, annotated using bounding boxes: http://xviewdataset.org/#dataset
ImageNet: The largest image dataset for computer vision. It provides an accessible image database that is organized hierarchically, according to WordNet: http://image-net.org/
Kinetics-700: A large-scale dataset of video URLs from Youtube. Including human-centered actions. It contains over 700,000 videos: https://deepmind.com/research/open-source/open-source-datasets/kinetics/
Google’s Open Images: A vast dataset from Google AI containing over 10 million images: https://research.googleblog.com/2016/09/introducing-open-images-dataset.html
Cityscapes Dataset: This is an open-source dataset for Computer Vision projects. It contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene: https://www.cityscapes-dataset.com/
IMDB-Wiki dataset: The IMDB-Wiki dataset is one of the most extensive open-source datasets for face images with labeled gender and age. The images are collected from IMDB and Wikipedia. It has five million-plus labeled images: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/
Color Detection Dataset: The dataset contains a CSV file that has 865 color names with their corresponding RGB(red, green, and blue) values of the color. It also has the hexadecimal value of the color: https://github.com/codebrainz/color-names/blob/master/output/colors.csv
Stanford Dogs Dataset: It contains 20,580 images and 120 different dog breed categories: http://vision.stanford.edu/aditya86/ImageNetDogs/
Lexicoder Sentiment Dictionary: This dataset is specific for sentiment analysis. The dataset contains over 3000 negative words and over 2000 positive sentiment words: http://www.lexicoder.com/
IMDB reviews: An interesting dataset with over 50,000 movie reviews from Kaggle: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
Stanford Sentiment Treebank: Standard sentiment dataset with sentiment annotations: http://nlp.stanford.edu/sentiment/code.html
Twitter US Airline Sentiment: Twitter data on US airlines from February 2015, classified as positive, negative, and neutral tweets: https://www.kaggle.com/crowdflower/twitter-airline-sentiment
HotspotQA Dataset: Question answering dataset featuring natural, multi-hop questions, with intense supervision for supporting facts to enable more explainable question answering systems: https://hotpotqa.github.io/
Amazon Reviews: A vast dataset from Amazon, containing over 45 million Amazon reviews: https://snap.stanford.edu/data/web-Amazon.html
Rotten Tomatoes Reviews: Archive of more than 480,000 critic reviews (fresh or rotten): https://drive.google.com/file/d/1w1TsJB-gmIkZ28d1j7sf1sqcPmHXw352/view
SMS Spam Collection in English: A dataset that consists of 5,574 English SMS spam messages: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
Enron Email Dataset: It contains around 0.5 million emails of over 150 users: https://www.cs.cmu.edu/~enron/
Recommender Systems Dataset: It contains various datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, and others that are used in building a recommender system: https://cseweb.ucsd.edu/~jmcauley/datasets.html
UCI Spambase Dataset: Classifying emails as spam or non-spam is a prevalent and useful task. The dataset contains 4601 emails and 57 meta-information about the emails. You can build models to filter out the spam: https://archive.ics.uci.edu/ml/datasets/Spambase
IMDB reviews: The large movie review dataset consists of movie reviews from IMDB website with over 25,000 reviews for training and 25,000 for the testing set: http://ai.stanford.edu/~amaas/data/sentiment/
Waymo Open Dataset: This is a fantastic dataset resource from the folks at Waymo. Includes a vast dataset of autonomous driving, enough to train deep nets from zero: https://waymo.com/open/
Berkeley DeepDrive BDD100k: One of the largest datasets for self-driving cars, containing over 2000 hours of driving experiences across New York and California: http://bdd-data.berkeley.edu/
Bosch Small Traffic Light Dataset: Dataset for small traffic lights for deep learning: https://hci.iwr.uni-heidelberg.de/node/6132
LaRa Traffic Light Recognition: Another dataset for traffic lights. This dataset is gathered from Paris: http://www.lara.prd.fr/benchmarks/trafficlightsrecognition
WPI datasets: Datasets for traffic lights, pedestrian, and lane detection: http://computing.wpi.edu/dataset.html
Comma.ai: It contains details such as a car’s speed, acceleration, steering angle, and GPS coordinates: https://archive.org/details/comma-dataset
MIT AGE Lab: A sample of the 1,000+ hours of multi-sensor driving datasets collected at AgeLab: http://lexfridman.com/automated-synchronization-of-driving-data-video-audio-telemetry-accelerometer/
LISA: Laboratory for Intelligent & Safe Automobiles, UC San Diego Datasets: This dataset includes traffic signs, vehicle detection, traffic lights, and trajectory patterns: http://cvrr.ucsd.edu/LISA/datasets.html
Cityscape Dataset: This is an extensive dataset that has street scenes in 50 different cities: https://www.cityscapes-dataset.com/
COVID-19 Dataset: The Allen Institute of AI research has released a vast research dataset of over 45,000 scholarly articles about COVID-19: https://www.semanticscholar.org/cord19
MIMIC-III: Openly available dataset developed by the MIT Lab for Computational Physiology, comprising de-identified health data associated with ~40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more: https://mimic.physionet.org/
MovieLens: It contains rating data sets from the MovieLens web site: https://grouplens.org/datasets/movielens/
Jester: It contains 4.1 Million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users. It’s mostly used for the collaborative filter: http://www.ieor.berkeley.edu/~goldberg/jester-data/
Million Song Dataset: It can be used for both collaborative and content-based filtering: https://www.kaggle.com/c/msdchallenge#description
https://commoncrawl.org/latest-crawl

Despre date

Nu este suficient să avem doar acces la date. este important și să știm ce sunt datele și în ce mod pot fi obținute, prelucrate și utilizate. Câteva resurse referitoare la aceste subiecte:

European Data Portal e-learning programme: https://www.europeandataportal.eu/elearning/en
Principiile fundamentale ale datelor guvernamentale deschise: https://opengovdata.org/
The Open Data Handbook: http://opendatahandbook.org/
The Open Data Model: https://opendatamodel.com/home/
O bogata colecție de cursuri de Open Science: https://www.fosteropenscience.eu/

Tehnologie pentru open data

The Open Data Kit: https://opendatakit.org/

Platforme de open data:

CKAN: https://ckan.org/
DKAN: https://getdkan.org/
Junar: http://www.junar.com/
Open Data Soft: https://www.opendatasoft.com/
Semantic MediaWiki: https://www.semantic-mediawiki.org/wiki/Semantic_MediaWiki
Plenar.io: http://plenar.io/

Platforme GIS:

ArcGIS: https://hub.arcgis.com/pages/open-data
GeoNode: http://geonode.org/

Resurse și tutoriale pentru analiză statistică

This is How Easy It Is to Lie With Statistics - https://www.youtube.com/watch?v=bVG2OQp6jEQ