Chapter 2 Data sources

The data source we used is from the world bank database, and you can easily access the data from this link: https://databank.worldbank.org/source/jobs# and https://databank.worldbank.org/source/world-development-indicators#. This data is from two public databases, jobs and world development indicators, provided by the World Bank. One can go to the link and customize the data he/she wants from the interactive selection area. For our final project, we will be using both two databases, selecting different developing and developed countries including Japan, China, United State, Brazil, India, Egypt, UK and South Africa, and all variables/series they provided. We will set the time range to be the most up to date 4-5 years to analyze. We will set the aggregation rule to be average of data available for each time period because we want to compare the overall situations across all countries.
Some basic information about the dataset:
The job database provides 166 series/variables that we can use. Currently, we chose about 8 countries and world data from most recent available 5 years, which yields over 2500*5 records. We work from this comprehensive dataset and select several variables that could help answer our research questions. We will also avoid using those that have a lot of missing values. The data types include categorical variables like countries, time series data, some variables with factor levels, and numerical variables.
Values for some variables are missing across all countries selected across all years, for example, the literacy rate, and values for variables like primary completion rate are missing for several years. We will further analyze the missing data to decide which ones could be used or not. Furthermore, the dataset is rather raw and will need to be reorganized because they break down into a very detailed way. For instance, many variables/series they provide should actually be the levels of the same variable. We will need to work on organize the data into a more usable form.