The CovidCases analysis class version 6.1
The basic concept of the class is to provide a pandas data frame with a time series of attributes such as the daily infections for a given list of countries. The GeoInfomationWorld class lists ISO-3166-alpha_2 and ISO-3166-alpha_3 codes for 217 countries of the world including their population based on the year 2020. You may take a look at the list here.
Once having the data frame we provide you with functions to plot the data using the PlotterBuilder class or draw a world map as a heatmap using the CovidMap class. Both classes have not been documented yet, but the source code of course includes comments and sample applications are also available on GitHub.
Cache functionality introduced in version 6.1#
With version 5.3 a simple cache mechanism has been introduced that works together with the sub-classes and in this release so far only for the WHO sub-class. The cache support requires the sub-class to read the data from a cache file that uses the same filename as the original data file with an added -cache. E.g. YY-DD-MM-WHO-db.csv becomes YY-DD-MM-WHO-db-cache.csv. The content of the cache is defined by a cacheLevel. The higher the value the more attributes are generated for all countries in the given data-frame (which is typically the data for all countries) and the longer it takes to build the cache. This is the definition of the different cacheLevel:
|0||No additional attributes will be generated and the class behaves as in previous versions|
|1||The cache includes the following pre-calculated attributes: Cases, Deaths, PercentDeaths, CasesPerMillionPopulation, DeathsPerMillionPopulation, DoublingTime|
|2||Includes the attributes of cache level 1 plus DailyCases7 and DailyDeaths7|
|3||Includes the attributes of cache level 2 plus R0|
|3||Includes the attributes of cache level 3 plus R7|
To generate the cache the sub-class got an additional optional parameter called cacheLevel that defaults to 0. If the cacheLevel is greater 0 than the subclass will enforce this base class to generate the cache. Additionally it will pass the optional full filename of the cache file to be generated that defaults to ‘’.
To summarize: The sub-class is oblige to read the cache file if such a file exists and to enforce the base class to build it if it doesn’t exist. The base class is obliged to create the cache file. This split of reading and writing the cache ensures that there is a centralized function to create the cache that might be further optimized in the future and which includes the same attributes for all data-frames.
This is a sample on how to build the cache, once it is created all other programs will show a higher execution speed:
from CovidCases import CovidCases from CovidCasesWHO import CovidCasesWHO import time def main(): try: # get the environment dataDirectory = os.environ['COVID_DATA'] except: dataDirectory = '../data' # get the latests database files as a CSV try: pathToCSV_who = CovidCasesWHO.download_CSV_file(dataDirectory) except FileNotFoundError: # print an error message print('Unable to download the database. Try again later.') return # create an instance to build the cache covidCases_who = CovidCasesWHO(pathToCSV_who, cacheLevel=4) print('created level 4 cache') # afterwards you can get the data for specific countries or build maps. E.g.: # countryList = 'DE, GB, FR, ES, IT, CH, AT' # df = covidCases_who.get_data_by_geoid_string_list(countryList) # execute main if __name__ == "__main__": main()
CovidCases class documentation version 6.1#
This abstract class acts as a parent class for different sub-classes which we call the CovidCases World sub-classes. While the sub-classes are responsible to get the data from different data sources and to provide them as a Pandas
DataFrame this base-class provide functions to process the data.
This are the methods provided by the class:
def __init__(self, dataFrame, filenameCache = '', cacheLevel = 0):
The constructor of the class just takes the Pandas
DataFrame created by a sub-class. The rows of the
DataFrame contain the data for a specific date, they build a time series of data with the latest date in the top (row 0). The columns have to include the following mandatory attributes and may have additional private columns if required:
|GeoName||The name of the country, county or city|
|GeoID||The GeoID of the country. Refer to this post to get a list of GeoIDs and country names.|
|Population||The population of the country, county or city based on 2019 data.|
|Continent||The continent of the country. In case of a city it may be the county. In case of a county it may be a federal state or region. In general it’s a grouping in a level above the meaning of the GeoName - GeoID combination.|
|DailyCases||The daily number of confirmed cases.|
|DailyDeaths||The daily number of deaths of confirmed cases|
Based on the given columns the class will generate the following columns:
|Cases||The overall number of confirmed infections (here called cases) since December 31st. 2019 as published by the data source.|
|Deaths||The overall number of deaths of confirmed cases.|
|PercentDeaths||The percentage of deaths of the confirmed cases. This is also called Case-Fatality-Rate (CFR) which is an estimation for the Infection-Fatality-Rate (IFR) which also includes unconfirmed (hidden or dark) infections|
|DoublingTime||The time in days after which the number of Cases are doubled|
|CasesPerMillionPopulation||The number of Cases divided by the population in million|
|DeathsPerMillionPopulation||The number of Deaths divided by the population in million|
If you use the data from Our World in Data you have additional access to the following attributes:
|DailyVaccineDosesAdministered7DayAverage||New COVID-19 vaccination doses administered (7-day smoothed). For countries that don’t report vaccination data on a daily basis, we assume that vaccination changed equally on a daily basis over any periods in which no data was reported. This produces a complete series of daily figures, which is then averaged over a rolling 7-day window. In OWID words this is the new_vaccinations_smoothed value.|
|VaccineDosesAdministered||Total number of COVID-19 vaccination doses administered. It’s the sum of PeopleReceivedFirstDose and PeopleReceivedAllDoses. In OWID words this is the total_vaccinations value.|
|PeopleReceivedFirstDose||Total number of people who received at least one vaccine dose. In OWID words this is the people_vaccinated value.|
|PercentPeopleReceivedFirstDose||The percentage of people of the population who received at least one vaccine dose.|
|PeopleReceivedAllDoses||Total number of people who received all doses defined by the vaccination protocol. In OWID words this is the people_fully_vaccinated value.|
|PercentPeopleReceivedAllDoses||The percentage of people of the population who received all doses defined by the vaccination protocol.|
add_lowpass_filter_for_attribute (refer to the function definitions below) you will notice additional attributes such as:
|R||An estimation of the reproduction number R. The attribute should finally be low-pass filtered with a kernel size of 7.|
|Incidence7DayPer100Kpopulation||The accumulated 7-day incidence. That is the sum of the daily cases of the last 7 days divided by the population in 100000 people.|
If the CacheLevel is greater 0 and filenameCache is unequal '' the class will create a cache file and store it under the filename filenameCache. Refer to the definition of the different cacheLevel above.
This are additional public member functions that can be accessed:
Returns the name of the cache file after it has been build. The constructor had to been invoked so that the cache was generated. In case that the cache has not been created in the constructor the function returns and empty string.
def get_country_data_by_geoid_list(self, geoIDs, lastNdays=0, sinceNcases=0):
Return a Pandas
DataFrame by a list of strings containing the geoIDs of countries such as
[[DE] [UK]]. Here you will find a list of GeoIDs and countries. Optional parameters are:
lastNdays: returns just the data of the last n days.
sinceNcases: returns just the data since the nth case has been exceeded per country.
def get__data_by_geoid_string_list(self, geoIDstringList, lastNdays=0, sinceNcases=0):
Exactly the same as the function above, but this time the list of GeoIDs is given as a comma separated list such as
def get_all_data(self, lastNdays=0, sinceNcases=0):
The function works as the two functions above, but this time it returns a
DataFrame for all countries in the csv. Notice that it might take some time before the function returns.
def save_df_to_csv(self, df, filename):
Saves the given
DataFrame df to a csv file. The file will contain all columns of the
DataFrame, also those who have been added by the functions below.
def add_r0(self, df):
Adds an attribute to the given
df of each country that is an estimation of the reproduction number R0. Here the number is called ‘R’. The returned
DataFrame will contain low-passed filtered data with a kernel size of 7. If the attribute already exists in the df the function will return the given df.
def add_incidence_7day_per_100Kpopulation(self, df):
Adds an attribute to the df of each country that is representing the accumulated 7-day incidence. That is the sum of the daily cases of the last 7 days divided by the population in 100000 people. If the attribute already exists the function will return the given df.
def add_lowpass_filter_for_attribute(self, df, attribute, n):
Adds an attribute to the given
df of each country that is the low-pass filtered data of the given
attribute (attribute name as a string). The width of the low-pass is given by the number
n. The name of the newly created attribute is the given name with a tailing number n. E.g.
n = 7 will add to a newly added attribute named
Cases7. If the attribute already exists the function will return the given
@staticmethod def create_combined_dataframe_by_geoid_string_list(dfList, geoIDs, lastNdays=0, sinceNcases=0):
Creates a combined data frame from a list of individual data frames. To avoid duplicate country names the method will add a ‘-DATASOURCE’ string behind the country name (e.g. ‘Germany-OWID’). The method takes a tuple of
DataFrame objects as a first parameter.
geoIDs is a list of GeoIDs given as a comma separated list such as
"DE, UK". The optional parameter
lastNdaysallows you to select only the data for the last N days. Alternatively you can align the data based on the day when the first N cases have been reported using the optional parameter
sinceNcases. The function finally returns a combined data frame containing the data from all given data frames.
@abstractmethod def get_available_GeoID_list(self):
DataFrame having just two columns
GeoName. You may want to store the returned
DataFrame as a csv file. The function has to be implemented by all sub-classes.
@abstractmethod def get_data_source_info(self):
DataFrame containing information about the data source. The
DataFrame holds 3 columns:
InfoFullName: The full name of the data source
InfoShortName: A shortname for the data source
InfoLink: The link to get the data
The function has to be implemented by all sub-classes.
@abstractmethod def review_geoid_list(self, geoIDs):
Returns a corrected version of the given geoID list to ensure that mismatches like UK versus GB are corrected by the sub-class. For instance: If the given list contains [‘DE’, ‘UK’] the function will return [‘DE’, ‘GB’] to correct the wrong UK with the ISO-3166-alpha_2 conformal GB.
CovidCases sub classes documentation#
Refer to the CovidCases sub-classes version 6.1 documentation for details about the different sub-classes as these might contain additional features or attributes.