Where do we draw the line between Small Data and Big Data ?
It is easier to start out with small data.
No-one who 'gets' databases blinks an eye at 100 Gb. But gigabytes are an abstract concept, especially for some non-technical folk. To put it in perspective a page or screenful of text is in the order of a kilobyte of data. The Bible, a fairly weighty tome, is less than 5Mb of text1 (without any form of compression). Put 20 of them on a bookshelf, and you've got 100Mb. Have ten such shelves in a bookcase and you have a single Gigabyte. So a basic 100 Gb is a small library full of text. Add another order of magnitude, and a thousand gigabytes is a terabyte. That's a pretty large library2.
Where might all that volume come from ?
A few dozen people all tapping away at data entry screens will take a long while to create or record that much data. Generally you might talk about a person generating a page a minute. Data entry staff can work faster, but someone in a call centre or on a service desk is in discussion with the customer, which is a bottleneck. Let's be generous and say 500 pages per person per day. That is in the order of 100Mb a year. With a hundred staff, that is about 10Gb a year. Pretty trivial stuff.
So if your data is being entered manually by your own staff, you don't have anything to worry about. If members of the public are creating your data, you get into the "Many hands make light work" scenario. That's what fuels the data volumes of Google, Facebook and Twitter.
A status update or tweet might be 100 bytes, so we'll put ten of them to a page. An individual may generate a page a day (admittedly a lot more for heavy users, but equally a lot less for casual users), or a book a year. 20,000 people (enough for a good crowd at an A-league soccer game3) would generate that 100Gb. If you've got the whole population of Australia generating data for you, that's 100 Terabytes a year. Facebook is about a third of Australia though4. Realistically, a homegrown Australian social site wouldn't hit a terabyte.
Rise of the Machines
Most Big Data is actually fed from measurements taken by machines. You make a phone call and data will be gathered about what time you called, who you called and how long you spoke. But data will also be gathered as you travel between cells, whether you are connected or not. Telecoms was the original big data.
It isn't just monitoring human activity either. Environmental measurements are recorded too, such as weather or water pressure in a pipe. Utilities are in a Big Data industry too.
An interesting feature of Big Data is that it is pretty much insert and query only. It isn't the sort of data you update, and you are only likely to delete it in large chunks, purging out the out-dated information. The queries are also likely to extract summarized information, rather than details. This is why it works so well for Oracle Exadata and other 'divide and conquer' mechanisms.
So it is actually pretty easy to determine whether you are in a 'Big Data' position, just by looking at your *automated* data feeds. If you are dealing with manual data entry, its not something you'll have to worry about unless you have millions of people at their keyboards every day.
Importantly, what this means is that while you may need to cope with Big Data in your databases, it isn't going to be an issue for your applications.
1 Project Gutenberg King James Bible
2 How Many Bytes
3 A-League Attendance 2010/2011
4 Australians on Facebook
This article was written by Gary Myers, and is filed under "Oracle Database Development - A View from Sydney'. It covers big data, scalability and machine generated data.