A School Project
SQL
PostgreSQL 9.6, Diagram.net
For a full view of the documentation, please visit the documentation section, here.
This report discusses aspects from a Database Management Project from the University Of Michigan Dearborn class ‘s CIS 421. In this report, I have created a database from the Flag Dataset from UCI Machine Learning Repository using PostgreSQL 9.6. This dataset contains information the year 1990 about country names, flag characteristics, language, religion, land size and population size.
The focus of this project is to be used as a research tool to find how characteristics in a flag relates to other attributes such as language, region, land size, and population size. Ideally, this tool would be helpful for sociologists, researchers or students who want to try to find correlations between flags and the people living in those countries. It is important to note that these queries are used to help a sociologist or student make decision and support their research or ideas from this data. This tool should be used to support already pre-existing information.
The data within this dataset has some interesting flag facts about 195 countries in the world. These facts pertain to the type of symbol on the flag, the number of colors on the flag, and which colors are on the flag. These attributes can be relevant to social factors on where the country is located and what the country represents. The queries in this project can be used to identify some of these factors.
This section depicts ER-diagrams of the schema of the dataset. These images are used to help visualize the relationships between each table.
This diagram demonstrates how each table are connected to get other. In this diagram, we can see that cid is that candidate key that identifies the information pertaining to a country.
Here is a close-up view on the country facts diagram.
In this section, the relational schema for the project will be discussed.
A simplified relational schema of this database is shown below.
countryID(cid, landmass_id, zone_id, lang_id, religion_id) landmass(landmass_id, landmass) zone(zone_id, zone) language(lang_id, language) religion(religion_id, religion) countryFacts(cid, name, area, population) flagColors(cid, colors, red, green, blue, gold, white, black, orange, mainhue, topleft, botright)
flagSymbols(cid, bars, stripes, circles ,crosses, saltires, quarters, sunstars, crescent, triangle, icon, animate, text)
In this section, each table will be shown and described below. The database schema is constructed in second normal form. This helps ensure that database schema avoids update, insertion, and deletion anomalies. Furthermore, the amount of redundant data has been reduced.
It is important to note that many of the attributes in this table set are numeric values that represents areas, languages, religions, and zones. Additional tables have been created to give these numeric values names.
This table is uniquely identified by the primary key “cid”.
create table countryID( cid serial primary key, landmass_id int not null, zone_id int not null, religion_id int not null, lang_id int not null
);
This table has the primary key “landmass_id” that uniquely identifies the name of the continental region that a country belongs to. The landmass_id key does not reference to the countryID table. The use of this table is to identify the landmass. For example, 1 = N.America, 2 = S.America, 3 = Europe, 4 = Africa, 4 = Asia, and 6 = Oceania.
create table landmass( landmass_id int primary key, landmass varchar(11)
);
The table “Zone” identifies cardinal directions such as Northeast (NE) or Southwest (SW) by the key zone_id. Like the table above, this table to describe what zone_id represents.
create table zone( zone_id int primary key, zone varchar(2)
);
This table has the a unique key for language (English, Spanish, etc).
create table language( lang_id int primary key, language varchar(33)
);
Names of religions are identified by the primary key called “religion_id”.
create table religion( religion_id int primary key, religion varchar(22)
);
This table has a foreign key of “cid”, that references to the tables name, area, and population of a country. The attributes area relates to number of how large or small the country is. Moreover, the attribute population is a numeric value that is rounded in millions.
create table countryFacts( cid serial references countryID(cid), name varchar(30) not null, area int not null, population int not null
);
The colors on each of the country’s flags are identified the foreign key “cid” from the country’s table. create table flagColors( cid serial references countryID(cid), colours int, red boolean, green boolean, blue boolean, gold boolean, white boolean, black boolean, orange boolean, mainhue varchar(11), topleft varchar(11), botright varchar(11)
);
The symbols on a flag is identified by the foreign key “cid” from the table country.
create table flagSymbols( cid serial references countryID(cid), bars int, stripes int, circles int, crosses int, saltires int, quarters int , sunstars int, crescent boolean, triangle boolean , icon boolean, animate boolean, text boolean
);
To insert the data from the original data file into the database, a temporary table was created to do so. The original table consisted of Boolean, nominal, and numeric values. This table was not included in the relational schema due to it only being needed to insert the data directly from the data file.
create table flag_data ( name varchar(30) NOT NULL, landmass int NOT NULL, zone int NOT NULL, area int NOT NULL, population int NOT NULL, language int NOT NULL, religion int NOT NULL,
bars int, stripes int, colours int, red boolean, green boolean, blue boolean, gold boolean, white boolean, black boolean, orange boolean,
mainhue varchar(11),
circles int, crosses int, saltires int, quarters int , sunstars int, crescent boolean, triangle boolean , icon boolean, animate boolean, text boolean, topleft varchar(11), botright varchar(11)
);
The functional dependencies of this relational schema is shown below. This schema is in Second Normal Form. The candidate key is “cid”.
cid -> zone_id,lang_id,religion_id,landmass_id
landmass_id -> landmass lang_id -> language zone_id -> zone religion_id -> religion
cid -> colours,red,green,blue,gold,white,black,orange,mainhue,topleft,botright
cid -> bars,stripes,circles,crosses,saltires,quarters,sunstars,crescent,triangle,icon,animate,text cid -> name,area,population
This section shows and explains the results returned the queries. It is important to note that additional queries will be created to better demonstrate the correlate between country and flag.
The following sample queries were specified as a deliverable in the project proposal.
Description:
Crosses are a common symbol found on flags. Generally, crosses are related to Christianity. In this query, a sociologist wants to find how many Christian countries have crosses on their flags in comparison to non-Christian countries with crosses on their flags. This query helps established if a country represents its religion on a flag.
The query below shows that there are 27 countries with crosses on their flags. Out of the 27 countries, 26 of the countries are Christian. Therefore, the sociologist might note that if a country’s flag has a cross on it, it has a 93% chance of being a Christian country.
Query:
select CountryWithCrosses, ChristianCrosses, notChristianCrosses from (select count(*) from countryID natural join (select * from flagSymbols where crosses > 0 or saltires > 0) as crosses where religion_id = 1 or religion_id = 0) as ChristianCrosses, (select count(*) from countryID natural join (select * from flagSymbols where crosses > 0 or saltires > 0) as crosses where religion_id != 0 and religion_id != 1) as notChristianCrosses, (select count(*) as CountryWithCrosses from countryID natural join flagSymbols where crosses > 0 or saltires > 0) as CountryWithCrosses;
Result:
Description:
As the world’s second most popular religion, many countries might represent Islam on their flag. Therefore, a crescent moon and star might appear on their flags. In this query, a sociologist might want to discover how many countries with a crescent moon and star on their flag practice Islam.
The query below shows that there are 10 countries with both a crescent and a star on their flag. Out of the 10 countries, 7 of the countries are Islamic. Therefore, the sociologist might note there is only a 70 % chance that a country with both a crescent and a star on its flag main faith is Islamic.
Query:
select CountryWithCrescentandStar, IslamicCountry, notIslamicCountry from (select count(*) from countryID natural join (select * from flagSymbols where crescent = true and sunstars > 0) as crescents where religion_id = 2) as IslamicCountry, (select count(*) from countryID natural join (select * from flagSymbols where crescent = true and sunstars > 0) as crescents where religion_id != 2) as notIslamicCountry, (select count(*) from countryID natural join flagSymbols where crescent = true and sunstars > 0) as CountryWithCrescentandStar;
Result:
Description: The Christian flag contains the color white in it to represent purity and light. A sociologist wants to know the names of Christian countries who might have potentially incorporated white in their flags to represent that same meaning.
The query determined that 81 Christian Countries included white in their flags.
Query:
select name, religion, white as whiteOnFlag from countryID natural join countryFacts natural join religion natural join FlagColors where white = true and religion_id = 0 or white = true and religion_id = 1;
Result:
Description: A sociologist might want to know how many English-speaking countries have red, white, and blue om their flags.
Query:
select name, language, red as redOnFlag, white as whiteOnFlag, blue as blueonFlag from countryID natural join countryFacts natural join language natural join FlagColors where white = true and red = true and blue = true and lang_id = 1;
Results:
Description: A sociologist wants to find out if languages and colors are related. This query help discover which language is the most associated with a red or partially red flag.
From the results, the sociologist can determine that English-speaking countries are the most associated with red or partially red flags. This is because the “others” is several minor languages grouped into one.
Query:
select lang_ID, language, red, count(lang_ID) as Amount_Of_Languages_With_Red from countryID natural join (select cid, lang_id, language, red from countryID natural join language natural join FlagColors where red = true) as sub group by lang_ID, language, red order by Amount_Of_Languages_With_Red;
Result:
This section discusses some indexing strategies that helped speedup the execution of queries.
The current/updated structure of the schema is in Second Normal Form and Dense Indexing. Second Normal Form helps ensure data is consistent and reduces redundancy. Dense indexing helps make searching faster.
We know that the schema is 2NF since it has no partial dependency as seen by the functional dependencies below. Moreover, we can note that every search key value in the data file has an index record.
Updated/New Functional Dependencies:
cid -> zone_id,lang_id,religion_id,landmass_id
landmass_id -> landmass lang_id -> language zone_id -> zone religion_id -> religion
cid -> colours,red,green,blue,gold,white,black,orange,mainhue,topleft,botright
cid -> bars,stripes,circles,crosses,saltires,quarters,sunstars,crescent,triangle,icon,animate,text cid -> name,area,population
Updated/New Relational Schema:
countryID(cid, landmass_id, zone_id, lang_id, religion_id) landmass(landmass_id, landmass) zone(zone_id, zone) language(lang_id, language)
religion(religion_id, religion) countryFacts(cid, name, area, population) flagColors(cid, colors, red, green, blue, gold, white, black, orange, mainhue, topleft, botright) flagSymbols(cid, bars, stripes, circles ,crosses, saltires, quarters, sunstars, crescent, triangle, icon, animate, text)
These updates where made to attempt to make searching faster.
To show the speed differences, it is important to show the original relational schema.
The old schema is show below.
Old Schema:
country(cid, name) landmass(landmass_id, landmass) zone(zone_id, zone) language(lang_id, language) religion(religion_id, religion) countrySize(cid, landmass_id, zone_id, area) countrySociety(cid, religion_id, lang_id, population) flagColors(cid, colors, red, green, blue, gold, white, black, orange, mainhue, topleft, botright) flagSymbols(cid, bars, stripes, circles ,crosses, saltires, quarters, sunstars, crescent, triangle, icon, animate, text)
The new structure is only slightly faster than the original structure. Here are a few of the queries both of the new structure and old structure had in common.
The new structure is slightly faster than the old structure for these queries.
New Schema
Query: select name, landmass, population, icon from countryID natural join countryFacts natural join landmass natural join FlagSymbols where landmass = 'S.America' and population <= 10 and icon = true;
Planning Time: 0.698 ms Executing Time: 0.330 ms
Image:
Old Schema
Query: select name, landmass, population, icon from flagSymbols natural join (select * from countrySociety natural join (select * from landmass natural join countrySize natural join country where landmass= 'S.America') as seCon where population <= 10) as Southeast where icon = true;
Planning Time: 0.679 ms Executing Time: 0.261 ms Image:
This report provides information regarding the County and Flags database project. The key elements is to collect and understand data from a data set. Moreover, the purpose is to use ER diagrams, create a relational database, and design queries. Lastly, the objective is to use PostgreSQL 9.6 to create and original a dataset into a database. All these goals have been successful accomplished.
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA:
University of California, School of Information and Computer Science.
UCI machine Learning Repository: Flags data set. (n.d.). Retrieved March 26, 2021, from http://archive.ics.uci.edu/ml/datasets/Flags