===================================== KDD Sisyphus I - Data set description ===================================== The KDD Sisyphus I data set is an excerpt from a data warehouse system called MASY which is employed to collect and analyse data from the private life insurance business at Swiss Life. Swiss Life is the market leader for life insurances and pension schemes in Switzerland and one of the leading companies in these areas in whole Europe. The data set was prepared by the Swiss Life Information Systems Research group. Please direct all comments, questions and critics to Dr. Joerg-Uwe Kietz or Dr. Martin Staudt Swiss Life, CH/IFUE Postfach CH-8022 Zurich, Switzerland Email: {Uwe.Kietz|Martin.Staudt}@swisslife.ch 0. Contents =========== 1. Overview about files 2. General Schema Explanation 3. Schema Description 4. KDD Sisyphus I Tasks 1. Overview about files ======================= The KDD Sisyphus I data set consists of the following files Readme.txt this file kddERdia.gif diagram of the schema together with .bmp foreign key relationships eadr.dat.gz data file for relation eadr (gzip'ed) hhold.dat.gz data file for relation hhold (gzip'ed) padr.dat.gz data file for relation padr (gzip'ed) parrol.dat.gz data file for relation parrol (gzip'ed) part.dat.gz data file for relation part (gzip'ed) tfkomp.dat.gz data file for relation tfkomp (gzip'ed) tfrol.dat.gz data file for relation tfrol (gzip'ed) vvert.dat.gz data file for relation vvert (gzip'ed) taska.dat.gz data file for task A (gzip'ed) taskb.dat.gz data file for task B (gzip'ed) We plan to produce some background knowledge about products and tariffs in the future these will be available in the files: lvtarf.dat.gz data file for relation lvtarf (gzip'ed) prod.dat.gz data file for relation prod (gzip'ed) Each data file contains records in Prolog-like style one tuple per line: (comp_1,....,comp_n). 2. General Schema Explanation ============================= The relation schema consists of 10 relations describing the relationship between Swiss Life partners, i.e. in particular customers, insurance contracts and components of insurance tariffs. The relation 'part' contains all partners. The relations 'eadr' resp. 'padr' relates them to their electronical resp. postal addresses. Details about the households they are living in can be found in 'hhold'. Most of the houeshold data is derived from the geographical region the household belongs to. Each partner can play roles in certain insurance policies (relation 'vvert') which is realized by a relation 'parrol'. If a partner is the insured person of the contract then tariff role records (relation 'tfrol') specify certain further properties. An insurance contract can have several components (e.g. the main contract part plus a component for insuring the case that the ensured person becomes invalid) each of which (records in 'tfkomp') is related with a tariff role of the respective partners. Finally, each policy concerns a certain product ('prod') and tariff components are bound to dedicated insurance tariffs (relation 'lvtarf'). A graphical overview about the schema is contained in schema.ps. The number of records in each relation is as follows: vvert: 34986 parrol: 111077 part: 17267 hhold: 12934 eadr: 505 padr: 17970 tfkomp: 73502 prod: 184 tfrol: 73332 lvtarf: 239 ------- 340573 3. Schema Description ===================== Relation: vvert - Life Insurance Policy vvid:NUMBER insurance policy number pdid:NUMBER identifier of the product the insurance contract is bound to vvstacd:NUMBER status code of the contract vvinkzwei:NUMBER modus of payment for premium rates vvbeg:NUMBER year of contract start (earliest start date among all tariff components) see attribute tkbeg of relation tfkomp vvend:NUMBER year of contract end (last date among all tariff components) vvinkprl:NUMBER yearly overall premium to be payed by the policy holder sum over all tariff components; see attribute tkinkprl of relation tfkomp vvinkpre:NUMBER single premium to be payed at contract start sum over all tariff components; see attribute tkinkpre of relation tfkomp vvwae:NUMBER currency for payments vvversart:NUMBER type of insurance contract, e.g. capital insurance, pension insurance vvaendart:NUMBER type of last modification vvaendat:NUMBER year of last modification vvabvb:NUMBER responsible agent vvabga:NUMBER responsible general agency vvstifcd:NUMBER place where policy is left; most (own) employee's contracts are kept in a collective depot vvvorscd:NUMBER type of (official Swiss) precaution type of the contract vvbvgcd:NUMBER slot for storing whether the insured person participates in a companys pension insurance scheme vveucd:NUMBER code for included disablement insurance; depends on respective tariff components Relation:parrol - Partner Roles prid:NUMBER role identifier ptid:NUMBER partner identifier; refers to attribute ptid of relation partner vvid:NUMBER identifier of the insurance contract where the partner ptid plays the role prid prtyp:NUMBER type of role, i.e. 6 - special agent, 9 - premium payer, 10 - pension receiver, 11 - insurance holder, 12 - insured person, 14 - pledgee, 15 - swiss representative, 17 - legal representative, 18 - guardian prtypnr:NUMBER role number wrt. type and contract, e.g. insured person 2 Relation:part - Partners Involved in Insurance Contracts ptid:NUMBER partner identifier hhid:NUMBER household identifier (see attribute hhid of relation hhold) epberuf:NUMBER profession code eperweart:NUMBER professional status (independent or employed) epgebudat:NUMBER year of birth epsta:NUMBER personal state, i.e. nothing special (0), dead (1), missed (2) or under revision as disabled person (3), disabled person (4) eplnamecd:NUMBER name code for married persons epsexcd:NUMBER sex: 1 - male, 2 - female epzivstd:NUMBER marital status: 0 - unknown, 1 - single, 2 - married, 3 - widowed, 4 - divorced, 5 - separated Relation:tfkomp - Tariff Components tkid:NUMBER component identifier vvid:NUMBER insurance contract number the component belongs to (see attribute vvid of relation vvert) tknr:NUMBER (running) component number wrt. contract tfid:NUMBER number of tariff the component is bound to; see attribute tfid of lvftarf tkbeg:NUMBER starting date of tariff component tkend:NUMBER end date of tariff component tkexkzwei:NUMBER payment mode (payments per year in particular for pension insurances) tkstacd:NUMBER status of the component (regular premium (1) or single premium (2) tkleist:NUMBER insured benefits of the component tkinkprl:NUMBER regular premium for the component tkinkpre:NUMBER single premium for the component tktarpra:NUMBER calculated (internal) premium tkuebvwsysp:NUMBER type of profit spending for running contracts tkuebvwsysl:NUMBER type of profit spending during pension payment period tkprfin:NUMBER current component state wrt. premium: required or not tkdyncd:NUMBER states whether tariff component is dynamic (i.e. regular adaption of premium dependent on inflationary rate e.g.) 0 - no, 1 - yes, 2 - not applicable or unknown tkausbcd:NUMBER states whether extension guarantee (higher insured sum without additional risk check) was claimed for tariff component 0 - no, 1 - yes, 2 - not applicable or unknown tkrauv:NUMBER type information for the special case of risk extension and exchange insurances tksmed:NUMBER states special medical (risk) cases tkrizucd:NUMBER extra premium charge for certain risk cases 0 - no, 1 - yes, 2 - not applicable or unknown tklfleist:NUMBER (regular) current payment (dynamically increasing) tktodleista:NUMBER payment for death of insured person tkerlleista:NUMBER (initial) payment for regular contract termination tkrenleista:NUMBER (initial) payment of pension tkeuleista:NUMBER (initial) payment for the case of disablement tkunfleista:NUMBER (initial) payment for the case of accident Relation:tfrol - Tariff Roles trid:NUMBER tariff role identifier prid:NUMBER partner identifier (see relation partner) tkid:NUMBER link to tariff component (see attribute tkid of relation tfrol) trteceinal:NUMBER (technical) age of insured person at contract agreement truntcd:NUMBER states whether medical risk check was necessary 0 - no, 1 - yes, 2 - not applicable or unknown trklauscd:NUMBER states whether special contract clause was negotiated 0 - no, 1 - yes, 2 - not applicable or unknown trstafcd:NUMBER states whether staggered payment was negotiated 0 - no, 1 - yes, 2 - not applicable or unknown trricd:NUMBER states that premium surcharge for higher risks was set 0 - no, 1 - yes, 2 - not applicable or unknown Relation:prod - Product pdid:NUMBER product identifier pdbest:NUMBER source of product information ACHTUNG: eine Klassifizierung von Produkten z.B. Rentenvers., Kapitalvers., Einmaleinlage usw. wird derzeit noch erarbeitet und hier ergaenzt Relation:lvtarf - life insurance tariff tfid:NUMBER tariff identifier tfbest:NUMBER source of tariff information risktype:NUMBER basic type of insured risk or intended achievement 1 - invalidity, 2 - deadth, 3 - pension, 4 - capital (saving), 5 - mixed (capital+death), 6 - health premium:NUMBER way of paying the required contributions for the insurance 1 - single, 2 - regular payment:NUMBER way of getting the insurance benefits 1 - single, 2 - regular, 3 - excemption, 4 - regular limited surrender:NUMBER whether someone else has the right to receive the payments 1 - yes, 2 - no insured:NUMBER single or multiple person insured 1 - single, 2 - multiple delayed:NUMBER start of payment (in particular for the case of pensions) 1 - yes, 2 - no, 3 - not applicable guarantee:NUMBER guaranteed payment (even in the case of death, in particular for pensions) 1 - yes, 2 - no, 3 - not applicable Relation:eadr - electronical address ptid:NUMBER partner identifier (see attribute ptid of relation partner) azart:NUMBER type of address usage (e.g. private, business) eatyp:NUMBER type of address (e.g. Fax, Phone etc.) Relation:padr - postal address ptid:NUMBER partner identifier (see attribute ptid of relation partner) azart:NUMBER type of address usage (e.g. private, business) kanton:STRING district (inside Switzerland) gbeadmgeb:NUMBER region relation: hhold - household data (Sorry: due to legal reasons we are not allowed to publish the original attribute names and their semantics If you have special questions please contact us !) hhid:STRING household identifier attr2:STRING attr3:STRING attr4:STRING attr5:STRING attr6:STRING attr7:STRING attr8:STRING attr9:STRING attr10:STRING attr11:STRING attr12:NUMBER attr13:NUMBER attr14:NUMBER attr15:NUMBER attr16:STRING attr17:STRING attr18:STRING attr19:STRING attr20:NUMBER attr21:NUMBER attr22:STRING attr23:NUMBER attr24:NUMBER attr25:NUMBER attr26:NUMBER attr27:NUMBER attr28:STRING attr29:STRING attr30:STRING attr31:STRING attr32:NUMBER attr33:NUMBER attr34:STRING attr35:STRING attr36:STRING attr37:STRING attr38:STRING attr39:STRING attr40:STRING attr41:STRING attr42:STRING attr43:STRING 4. KDD Sisyphus I Tasks ======================= The following 3 tasks should be tackled with arbitrary Data Mining, Statistics or Machine Learning algorithms applied to the KDD Sisyphus I data set. Please note, that the main focus lies on the preprocessing steps which make the algorithms applicable rather than on achieving high-precision mining results. Tasks A and B are concept learning (resp. classification) tasks and aim at finding concept descriptions for subsets of partners (A) and households (B). The tables 'taskA' and 'taskB' relate the respective identifiers 'ptid' (A) and 'hhid' (B) with a value indicating concept membership. relation:taskA - class membership wrt. concept in task A ptid:NUMBER partner identifier (see attribute 'ptid' of relation 'part') class:NUMBER 0 - No, 1 - Yes relation: taskB - class membership wrt. concept in task B hhid:NUMBER household identifier (see attribute 'hhid' of relation 'hhold') class:NUMBER 0 - No, 1 - Yes, 2 - not applicable Task C is a clustering task intended to relate properties of households with the features of their insurance contracts. Task A ------ Try to find a concept description for those partners (attribute 'ptid' of 'taskA') who have value '1' for attribute 'class'. Task B ------ Try to find a concept description for those households (attribute 'hhid' of 'taskB') who have value '1' for attribute 'class'. Task C ------ Try to find clusters on 'hhold' such that members of a cluster resemble each other not only wrt. household attributes but also wrt. the insurance policies they are involved in.