Monday, January 27, 2020

Data Pre-processing Tool

Data Pre-processing Tool Chapter- 2 Real life data rarely comply with the necessities of various data mining tools. It is usually inconsistent and noisy. It may contain redundant attributes, unsuitable formats etc. Hence data has to be prepared vigilantly before the data mining actually starts. It is well known fact that success of a data mining algorithm is very much dependent on the quality of data processing. Data processing is one of the most important tasks in data mining. In this context it is natural that data pre-processing is a complicated task involving large data sets. Sometimes data pre-processing take more than 50% of the total time spent in solving the data mining problem. It is crucial for data miners to choose efficient data preprocessing technique for specific data set which can not only save processing time but also retain the quality of the data for data mining process. A data pre-processing tool should help miners with many data mining activates. For example, data may be provided in different formats as discussed in previous chapter (flat files, database files etc). Data files may also have different formats of values, calculation of derived attributes, data filters, joined data sets etc. Data mining process generally starts with understanding of data. In this stage pre-processing tools may help with data exploration and data discovery tasks. Data processing includes lots of tedious works, Data pre-processing generally consists of Data Cleaning Data Integration Data Transformation And Data Reduction. In this chapter we will study all these data pre-processing activities. 2.1 Data Understanding In Data understanding phase the first task is to collect initial data and then proceed with activities in order to get well known with data, to discover data quality problems, to discover first insight into the data or to identify interesting subset to form hypothesis for hidden information. The data understanding phase according to CRISP model can be shown in following . 2.1.1 Collect Initial Data The initial collection of data includes loading of data if required for data understanding. For instance, if specific tool is applied for data understanding, it makes great sense to load your data into this tool. This attempt possibly leads to initial data preparation steps. However if data is obtained from multiple data sources then integration is an additional issue. 2.1.2 Describe data Here the gross or surface properties of the gathered data are examined. 2.1.3 Explore data This task is required to handle the data mining questions, which may be addressed using querying, visualization and reporting. These include: Sharing of key attributes, for instance the goal attribute of a prediction task Relations between pairs or small numbers of attributes Results of simple aggregations Properties of important sub-populations Simple statistical analyses. 2.1.4 Verify data quality In this step quality of data is examined. It answers questions such as: Is the data complete (does it cover all the cases required)? Is it accurate or does it contains errors and if there are errors how common are they? Are there missing values in the data? If so how are they represented, where do they occur and how common are they? 2.2 Data Preprocessing Data preprocessing phase focus on the pre-processing steps that produce the data to be mined. Data preparation or preprocessing is one most important step in data mining. Industrial practice indicates that one data is well prepared; the mined results are much more accurate. This means this step is also a very critical fro success of data mining method. Among others, data preparation mainly involves data cleaning, data integration, data transformation, and reduction. 2.2.1 Data Cleaning Data cleaning is also known as data cleansing or scrubbing. It deals with detecting and removing inconsistencies and errors from data in order to get better quality data. While using a single data source such as flat files or databases data quality problems arises due to misspellings while data entry, missing information or other invalid data. While the data is taken from the integration of multiple data sources such as data warehouses, federated database systems or global web-based information systems, the requirement for data cleaning increases significantly. This is because the multiple sources may contain redundant data in different formats. Consolidation of different data formats abs elimination of redundant information becomes necessary in order to provide access to accurate and consistent data. Good quality data requires passing a set of quality criteria. Those criteria include: Accuracy: Accuracy is an aggregated value over the criteria of integrity, consistency and density. Integrity: Integrity is an aggregated value over the criteria of completeness and validity. Completeness: completeness is achieved by correcting data containing anomalies. Validity: Validity is approximated by the amount of data satisfying integrity constraints. Consistency: consistency concerns contradictions and syntactical anomalies in data. Uniformity: it is directly related to irregularities in data. Density: The density is the quotient of missing values in the data and the number of total values ought to be known. Uniqueness: uniqueness is related to the number of duplicates present in the data. 2.2.1.1 Terms Related to Data Cleaning Data cleaning: data cleaning is the process of detecting, diagnosing, and editing damaged data. Data editing: data editing means changing the value of data which are incorrect. Data flow: data flow is defined as passing of recorded information through succeeding information carriers. Inliers: Inliers are data values falling inside the projected range. Outlier: outliers are data value falling outside the projected range. Robust estimation: evaluation of statistical parameters, using methods that are less responsive to the effect of outliers than more conventional methods are called robust method. 2.2.1.2 Definition: Data Cleaning Data cleaning is a process used to identify imprecise, incomplete, or irrational data and then improving the quality through correction of detected errors and omissions. This process may include format checks Completeness checks Reasonableness checks Limit checks Review of the data to identify outliers or other errors Assessment of data by subject area experts (e.g. taxonomic specialists). By this process suspected records are flagged, documented and checked subsequently. And finally these suspected records can be corrected. Sometimes validation checks also involve checking for compliance against applicable standards, rules, and conventions. The general framework for data cleaning given as: Define and determine error types; Search and identify error instances; Correct the errors; Document error instances and error types; and Modify data entry procedures to reduce future errors. Data cleaning process is referred by different people by a number of terms. It is a matter of preference what one uses. These terms include: Error Checking, Error Detection, Data Validation, Data Cleaning, Data Cleansing, Data Scrubbing and Error Correction. We use Data Cleaning to encompass three sub-processes, viz. Data checking and error detection; Data validation; and Error correction. A fourth improvement of the error prevention processes could perhaps be added. 2.2.1.3 Problems with Data Here we just note some key problems with data Missing data : This problem occur because of two main reasons Data are absent in source where it is expected to be present. Some times data is present are not available in appropriately form Detecting missing data is usually straightforward and simpler. Erroneous data: This problem occurs when a wrong value is recorded for a real world value. Detection of erroneous data can be quite difficult. (For instance the incorrect spelling of a name) Duplicated data : This problem occur because of two reasons Repeated entry of same real world entity with some different values Some times a real world entity may have different identifications. Repeat records are regular and frequently easy to detect. The different identification of the same real world entities can be a very hard problem to identify and solve. Heterogeneities: When data from different sources are brought together in one analysis problem heterogeneity may occur. Heterogeneity could be Structural heterogeneity arises when the data structures reflect different business usage Semantic heterogeneity arises when the meaning of data is different n each system that is being combined Heterogeneities are usually very difficult to resolve since because they usually involve a lot of contextual data that is not well defined as metadata. Information dependencies in the relationship between the different sets of attribute are commonly present. Wrong cleaning mechanisms can further damage the information in the data. Various analysis tools handle these problems in different ways. Commercial offerings are available that assist the cleaning process, but these are often problem specific. Uncertainty in information systems is a well-recognized hard problem. In following a very simple examples of missing and erroneous data is shown Extensive support for data cleaning must be provided by data warehouses. Data warehouses have high probability of â€Å"dirty data† since they load and continuously refresh huge amounts of data from a variety of sources. Since these data warehouses are used for strategic decision making therefore the correctness of their data is important to avoid wrong decisions. The ETL (Extraction, Transformation, and Loading) process for building a data warehouse is illustrated in following . Data transformations are related with schema or data translation and integration, and with filtering and aggregating data to be stored in the data warehouse. All data cleaning is classically performed in a separate data performance area prior to loading the transformed data into the warehouse. A large number of tools of varying functionality are available to support these tasks, but often a significant portion of the cleaning and transformation work has to be done manually or by low-level programs that are difficult to write and maintain. A data cleaning method should assure following: It should identify and eliminate all major errors and inconsistencies in an individual data sources and also when integrating multiple sources. Data cleaning should be supported by tools to bound manual examination and programming effort and it should be extensible so that can cover additional sources. It should be performed in association with schema related data transformations based on metadata. Data cleaning mapping functions should be specified in a declarative way and be reusable for other data sources. 2.2.1.4 Data Cleaning: Phases 1. Analysis: To identify errors and inconsistencies in the database there is a need of detailed analysis, which involves both manual inspection and automated analysis programs. This reveals where (most of) the problems are present. 2. Defining Transformation and Mapping Rules: After discovering the problems, this phase are related with defining the manner by which we are going to automate the solutions to clean the data. We will find various problems that translate to a list of activities as a result of analysis phase. Example: Remove all entries for J. Smith because they are duplicates of John Smith Find entries with `bule in colour field and change these to `blue. Find all records where the Phone number field does not match the pattern (NNNNN NNNNNN). Further steps for cleaning this data are then applied. Etc †¦ 3. Verification: In this phase we check and assess the transformation plans made in phase- 2. Without this step, we may end up making the data dirtier rather than cleaner. Since data transformation is the main step that actually changes the data itself so there is a need to be sure that the applied transformations will do it correctly. Therefore test and examine the transformation plans very carefully. Example: Let we have a very thick C++ book where it says strict in all the places where it should say struct 4. Transformation: Now if it is sure that cleaning will be done correctly, then apply the transformation verified in last step. For large database, this task is supported by a variety of tools Backflow of Cleaned Data: In a data mining the main objective is to convert and move clean data into target system. This asks for a requirement to purify legacy data. Cleansing can be a complicated process depending on the technique chosen and has to be designed carefully to achieve the objective of removal of dirty data. Some methods to accomplish the task of data cleansing of legacy system include: n Automated data cleansing n Manual data cleansing n The combined cleansing process 2.2.1.5 Missing Values Data cleaning addresses a variety of data quality problems, including noise and outliers, inconsistent data, duplicate data, and missing values. Missing values is one important problem to be addressed. Missing value problem occurs because many tuples may have no record for several attributes. For Example there is a customer sales database consisting of a whole bunch of records (lets say around 100,000) where some of the records have certain fields missing. Lets say customer income in sales data may be missing. Goal here is to find a way to predict what the missing data values should be (so that these can be filled) based on the existing data. Missing data may be due to following reasons Equipment malfunction Inconsistent with other recorded data and thus deleted Data not entered due to misunderstanding Certain data may not be considered important at the time of entry Not register history or changes of the data How to Handle Missing Values? Dealing with missing values is a regular question that has to do with the actual meaning of the data. There are various methods for handling missing entries 1. Ignore the data row. One solution of missing values is to just ignore the entire data row. This is generally done when the class label is not there (here we are assuming that the data mining goal is classification), or many attributes are missing from the row (not just one). But if the percentage of such rows is high we will definitely get a poor performance. 2. Use a global constant to fill in for missing values. We can fill in a global constant for missing values such as unknown, N/A or minus infinity. This is done because at times is just doesnt make sense to try and predict the missing value. For example if in customer sales database if, say, office address is missing for some, filling it in doesnt make much sense. This method is simple but is not full proof. 3. Use attribute mean. Let say if the average income of a a family is X you can use that value to replace missing income values in the customer sales database. 4. Use attribute mean for all samples belonging to the same class. Lets say you have a cars pricing DB that, among other things, classifies cars to Luxury and Low budget and youre dealing with missing values in the cost field. Replacing missing cost of a luxury car with the average cost of all luxury cars is probably more accurate then the value youd get if you factor in the low budget 5. Use data mining algorithm to predict the value. The value can be determined using regression, inference based tools using Bayesian formalism, decision trees, clustering algorithms etc. 2.2.1.6 Noisy Data Noise can be defined as a random error or variance in a measured variable. Due to randomness it is very difficult to follow a strategy for noise removal from the data. Real world data is not always faultless. It can suffer from corruption which may impact the interpretations of the data, models created from the data, and decisions made based on the data. Incorrect attribute values could be present because of following reasons Faulty data collection instruments Data entry problems Duplicate records Incomplete data: Inconsistent data Incorrect processing Data transmission problems Technology limitation. Inconsistency in naming convention Outliers How to handle Noisy Data? The methods for removing noise from data are as follows. 1. Binning: this approach first sort data and partition it into (equal-frequency) bins then one can smooth it using- Bin means, smooth using bin median, smooth using bin boundaries, etc. 2. Regression: in this method smoothing is done by fitting the data into regression functions. 3. Clustering: clustering detect and remove outliers from the data. 4. Combined computer and human inspection: in this approach computer detects suspicious values which are then checked by human experts (e.g., this approach deal with possible outliers).. Following methods are explained in detail as follows: Binning: Data preparation activity that converts continuous data to discrete data by replacing a value from a continuous range with a bin identifier, where each bin represents a range of values. For instance, age can be changed to bins such as 20 or under, 21-40, 41-65 and over 65. Binning methods smooth a sorted data set by consulting values around it. This is therefore called local smoothing. Let consider a binning example Binning Methods n Equal-width (distance) partitioning Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward, but outliers may dominate presentation Skewed data is not handled well n Equal-depth (frequency) partitioning 1. It divides the range (values of a given attribute) into N intervals, each containing approximately same number of samples (elements) 2. Good data scaling 3. Managing categorical attributes can be tricky. n Smooth by bin means- Each bin value is replaced by the mean of values n Smooth by bin medians- Each bin value is replaced by the median of values n Smooth by bin boundaries Each bin value is replaced by the closest boundary value Example Let Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 n Partition into equal-frequency (equi-depth) bins: o Bin 1: 4, 8, 9, 15 o Bin 2: 21, 21, 24, 25 o Bin 3: 26, 28, 29, 34 n Smoothing by bin means: o Bin 1: 9, 9, 9, 9 ( for example mean of 4, 8, 9, 15 is 9) o Bin 2: 23, 23, 23, 23 o Bin 3: 29, 29, 29, 29 n Smoothing by bin boundaries: o Bin 1: 4, 4, 4, 15 o Bin 2: 21, 21, 25, 25 o Bin 3: 26, 26, 26, 34 Regression: Regression is a DM technique used to fit an equation to a dataset. The simplest form of regression is linear regression which uses the formula of a straight line (y = b+ wx) and determines the suitable values for b and w to predict the value of y based upon a given value of x. Sophisticated techniques, such as multiple regression, permit the use of more than one input variable and allow for the fitting of more complex models, such as a quadratic equation. Regression is further described in subsequent chapter while discussing predictions. Clustering: clustering is a method of grouping data into different groups , so that data in each group share similar trends and patterns. Clustering constitute a major class of data mining algorithms. These algorithms automatically partitions the data space into set of regions or cluster. The goal of the process is to find all set of similar examples in data, in some optimal fashion. Following shows three clusters. Values that fall outsid e the cluster are outliers. 4. Combined computer and human inspection: These methods find the suspicious values using the computer programs and then they are verified by human experts. By this process all outliers are checked. 2.2.1.7 Data cleaning as a process Data cleaning is the process of Detecting, Diagnosing, and Editing Data. Data cleaning is a three stage method involving repeated cycle of screening, diagnosing, and editing of suspected data abnormalities. Many data errors are detected by the way during study activities. However, it is more efficient to discover inconsistencies by actively searching for them in a planned manner. It is not always right away clear whether a data point is erroneous. Many times it requires careful examination. Likewise, missing values require additional check. Therefore, predefined rules for dealing with errors and true missing and extreme values are part of good practice. One can monitor for suspect features in survey questionnaires, databases, or analysis data. In small studies, with the examiner intimately involved at all stages, there may be small or no difference between a database and an analysis dataset. During as well as after treatment, the diagnostic and treatment phases of cleaning need insight into the sources and types of errors at all stages of the study. Data flow concept is therefore crucial in this respect. After measurement the research data go through repeated steps of- entering into information carriers, extracted, and transferred to other carriers, edited, selected, transformed, summarized, and presented. It is essential to understand that errors can occur at any stage of the data flow, including during data cleaning itself. Most of these problems are due to human error. Inaccuracy of a single data point and measurement may be tolerable, and associated to the inherent technological error of the measurement device. Therefore the process of data clenaning mus focus on those errors that are beyond small technical variations and that form a major shift within or beyond the population distribution. In turn, it must be based on understanding of technical errors and expected ranges of normal values. Some errors are worthy of higher priority, but which ones are most significant is highly study-specific. For instance in most medical epidemiological studies, errors that need to be cleaned, at all costs, include missing gender, gender misspecification, birth date or examination date errors, duplications or merging of records, and biologically impossible results. Another example is in nutrition studies, date errors lead to age errors, which in turn lead to errors in weight-for-age scoring and, further, to misclassification of subjects as under- or overweight. Errors of sex and date are particularly important because they contaminate derived variables. Prioritization is essential if the study is under time pressures or if resources for data cleaning are limited. 2.2.2 Data Integration This is a process of taking data from one or more sources and mapping it, field by field, onto a new data structure. Idea is to combine data from multiple sources into a coherent form. Various data mining projects requires data from multiple sources because n Data may be distributed over different databases or data warehouses. (for example an epidemiological study that needs information about hospital admissions and car accidents) n Sometimes data may be required from different geographic distributions, or there may be need for historical data. (e.g. integrate historical data into a new data warehouse) n There may be a necessity of enhancement of data with additional (external) data. (for improving data mining precision) 2.2.2.1 Data Integration Issues There are number of issues in data integrations. Consider two database tables. Imagine two database tables Database Table-1 Database Table-2 In integration of there two tables there are variety of issues involved such as 1. The same attribute may have different names (for example in above tables Name and Given Name are same attributes with different names) 2. An attribute may be derived from another (for example attribute Age is derived from attribute DOB) 3. Attributes might be redundant( For example attribute PID is redundant) 4. Values in attributes might be different (for example for PID 4791 values in second and third field are different in both the tables) 5. Duplicate records under different keys( there is a possibility of replication of same record with different key values) Therefore schema integration and object matching can be trickier. Question here is how equivalent entities from different sources are matched? This problem is known as entity identification problem. Conflicts have to be detected and resolved. Integration becomes easier if unique entity keys are available in all the data sets (or tables) to be linked. Metadata can help in schema integration (example of metadata for each attribute includes the name, meaning, data type and range of values permitted for the attribute) 2.2.2.1 Redundancy Redundancy is another important issue in data integration. Two given attribute (such as DOB and age for instance in give table) may be redundant if one is derived form the other attribute or set of attributes. Inconsistencies in attribute or dimension naming can lead to redundancies in the given data sets. Handling Redundant Data We can handle data redundancy problems by following ways n Use correlation analysis n Different coding / representation has to be considered (e.g. metric / imperial measures) n Careful (manual) integration of the data can reduce or prevent redundancies (and inconsistencies) n De-duplication (also called internal data linkage) o If no unique entity keys are available o Analysis of values in attributes to find duplicates n Process redundant and inconsistent data (easy if values are the same) o Delete one of the values o Average values (only for numerical attributes) o Take majority values (if more than 2 duplicates and some values are the same) Correlation analysis is explained in detail here. Correlation analysis (also called Pearsons product moment coefficient): some redundancies can be detected by using correlation analysis. Given two attributes, such analysis can measure how strong one attribute implies another. For numerical attribute we can compute correlation coefficient of two attributes A and B to evaluate the correlation between them. This is given by Where n n is the number of tuples, n and are the respective means of A and B n ÏÆ'A and ÏÆ'B are the respective standard deviation of A and B n ÃŽ £(AB) is the sum of the AB cross-product. a. If -1 b. If rA, B is equal to zero it indicates A and B are independent of each other and there is no correlation between them. c. If rA, B is less than zero then A and B are negatively correlated. , where if value of one attribute increases value of another attribute decreases. This means that one attribute discourages another attribute. It is important to note that correlation does not imply causality. That is, if A and B are correlated, this does not essentially mean that A causes B or that B causes A. for example in analyzing a demographic database, we may find that attribute representing number of accidents and the number of car theft in a region are correlated. This does not mean that one is related to another. Both may be related to third attribute, namely population. For discrete data, a correlation relation between two attributes, can be discovered by a χ ²(chi-square) test. Let A has c distinct values a1,a2,†¦Ã¢â‚¬ ¦ac and B has r different values namely b1,b2,†¦Ã¢â‚¬ ¦br The data tuple described by A and B are shown as contingency table, with c values of A (making up columns) and r values of B( making up rows). Each and every (Ai, Bj) cell in table has. X^2 = sum_{i=1}^{r} sum_{j=1}^{c} {(O_{i,j} E_{i,j})^2 over E_{i,j}} . Where n Oi, j is the observed frequency (i.e. actual count) of joint event (Ai, Bj) and n Ei, j is the expected frequency which can be computed as E_{i,j}=frac{sum_{k=1}^{c} O_{i,k} sum_{k=1}^{r} O_{k,j}}{N} , , Where n N is number of data tuple n Oi,k is number of tuples having value ai for A n Ok,j is number of tuples having value bj for B The larger the χ ² value, the more likely the variables are related. The cells that contribute the most to the χ ² value are those whose actual count is very different from the expected count Chi-Square Calculation: An Example Suppose a group of 1,500 people were surveyed. The gender of each person was noted. Each person has polled their preferred type of reading material as fiction or non-fiction. The observed frequency of each possible joint event is summarized in following table.( number in parenthesis are expected frequencies) . Calculate chi square. Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500 E11 = count (male)*count(fiction)/N = 300 * 450 / 1500 =90 and so on For this table the degree of freedom are (2-1)(2-1) =1 as table is 2X2. for 1 degree of freedom , the χ ² value needed to reject the hypothesis at the 0.001 significance level is 10.828 (taken from the table of upper percentage point of the χ ² distribution typically available in any statistic text book). Since the computed value is above this, we can reject the hypothesis that gender and preferred reading are independent and conclude that two attributes are strongly correlated for given group. Duplication must also be detected at the tuple level. The use of renormalized tables is also a source of redundancies. Redundancies may further lead to data inconsistencies (due to updating some but not others). 2.2.2.2 Detection and resolution of data value conflicts Another significant issue in data integration is the discovery and resolution of data value conflicts. For example, for the same entity, attribute values from different sources may differ. For example weight can be stored in metric unit in one source and British imperial unit in another source. For instance, for a hotel cha Data Pre-processing Tool Data Pre-processing Tool Chapter- 2 Real life data rarely comply with the necessities of various data mining tools. It is usually inconsistent and noisy. It may contain redundant attributes, unsuitable formats etc. Hence data has to be prepared vigilantly before the data mining actually starts. It is well known fact that success of a data mining algorithm is very much dependent on the quality of data processing. Data processing is one of the most important tasks in data mining. In this context it is natural that data pre-processing is a complicated task involving large data sets. Sometimes data pre-processing take more than 50% of the total time spent in solving the data mining problem. It is crucial for data miners to choose efficient data preprocessing technique for specific data set which can not only save processing time but also retain the quality of the data for data mining process. A data pre-processing tool should help miners with many data mining activates. For example, data may be provided in different formats as discussed in previous chapter (flat files, database files etc). Data files may also have different formats of values, calculation of derived attributes, data filters, joined data sets etc. Data mining process generally starts with understanding of data. In this stage pre-processing tools may help with data exploration and data discovery tasks. Data processing includes lots of tedious works, Data pre-processing generally consists of Data Cleaning Data Integration Data Transformation And Data Reduction. In this chapter we will study all these data pre-processing activities. 2.1 Data Understanding In Data understanding phase the first task is to collect initial data and then proceed with activities in order to get well known with data, to discover data quality problems, to discover first insight into the data or to identify interesting subset to form hypothesis for hidden information. The data understanding phase according to CRISP model can be shown in following . 2.1.1 Collect Initial Data The initial collection of data includes loading of data if required for data understanding. For instance, if specific tool is applied for data understanding, it makes great sense to load your data into this tool. This attempt possibly leads to initial data preparation steps. However if data is obtained from multiple data sources then integration is an additional issue. 2.1.2 Describe data Here the gross or surface properties of the gathered data are examined. 2.1.3 Explore data This task is required to handle the data mining questions, which may be addressed using querying, visualization and reporting. These include: Sharing of key attributes, for instance the goal attribute of a prediction task Relations between pairs or small numbers of attributes Results of simple aggregations Properties of important sub-populations Simple statistical analyses. 2.1.4 Verify data quality In this step quality of data is examined. It answers questions such as: Is the data complete (does it cover all the cases required)? Is it accurate or does it contains errors and if there are errors how common are they? Are there missing values in the data? If so how are they represented, where do they occur and how common are they? 2.2 Data Preprocessing Data preprocessing phase focus on the pre-processing steps that produce the data to be mined. Data preparation or preprocessing is one most important step in data mining. Industrial practice indicates that one data is well prepared; the mined results are much more accurate. This means this step is also a very critical fro success of data mining method. Among others, data preparation mainly involves data cleaning, data integration, data transformation, and reduction. 2.2.1 Data Cleaning Data cleaning is also known as data cleansing or scrubbing. It deals with detecting and removing inconsistencies and errors from data in order to get better quality data. While using a single data source such as flat files or databases data quality problems arises due to misspellings while data entry, missing information or other invalid data. While the data is taken from the integration of multiple data sources such as data warehouses, federated database systems or global web-based information systems, the requirement for data cleaning increases significantly. This is because the multiple sources may contain redundant data in different formats. Consolidation of different data formats abs elimination of redundant information becomes necessary in order to provide access to accurate and consistent data. Good quality data requires passing a set of quality criteria. Those criteria include: Accuracy: Accuracy is an aggregated value over the criteria of integrity, consistency and density. Integrity: Integrity is an aggregated value over the criteria of completeness and validity. Completeness: completeness is achieved by correcting data containing anomalies. Validity: Validity is approximated by the amount of data satisfying integrity constraints. Consistency: consistency concerns contradictions and syntactical anomalies in data. Uniformity: it is directly related to irregularities in data. Density: The density is the quotient of missing values in the data and the number of total values ought to be known. Uniqueness: uniqueness is related to the number of duplicates present in the data. 2.2.1.1 Terms Related to Data Cleaning Data cleaning: data cleaning is the process of detecting, diagnosing, and editing damaged data. Data editing: data editing means changing the value of data which are incorrect. Data flow: data flow is defined as passing of recorded information through succeeding information carriers. Inliers: Inliers are data values falling inside the projected range. Outlier: outliers are data value falling outside the projected range. Robust estimation: evaluation of statistical parameters, using methods that are less responsive to the effect of outliers than more conventional methods are called robust method. 2.2.1.2 Definition: Data Cleaning Data cleaning is a process used to identify imprecise, incomplete, or irrational data and then improving the quality through correction of detected errors and omissions. This process may include format checks Completeness checks Reasonableness checks Limit checks Review of the data to identify outliers or other errors Assessment of data by subject area experts (e.g. taxonomic specialists). By this process suspected records are flagged, documented and checked subsequently. And finally these suspected records can be corrected. Sometimes validation checks also involve checking for compliance against applicable standards, rules, and conventions. The general framework for data cleaning given as: Define and determine error types; Search and identify error instances; Correct the errors; Document error instances and error types; and Modify data entry procedures to reduce future errors. Data cleaning process is referred by different people by a number of terms. It is a matter of preference what one uses. These terms include: Error Checking, Error Detection, Data Validation, Data Cleaning, Data Cleansing, Data Scrubbing and Error Correction. We use Data Cleaning to encompass three sub-processes, viz. Data checking and error detection; Data validation; and Error correction. A fourth improvement of the error prevention processes could perhaps be added. 2.2.1.3 Problems with Data Here we just note some key problems with data Missing data : This problem occur because of two main reasons Data are absent in source where it is expected to be present. Some times data is present are not available in appropriately form Detecting missing data is usually straightforward and simpler. Erroneous data: This problem occurs when a wrong value is recorded for a real world value. Detection of erroneous data can be quite difficult. (For instance the incorrect spelling of a name) Duplicated data : This problem occur because of two reasons Repeated entry of same real world entity with some different values Some times a real world entity may have different identifications. Repeat records are regular and frequently easy to detect. The different identification of the same real world entities can be a very hard problem to identify and solve. Heterogeneities: When data from different sources are brought together in one analysis problem heterogeneity may occur. Heterogeneity could be Structural heterogeneity arises when the data structures reflect different business usage Semantic heterogeneity arises when the meaning of data is different n each system that is being combined Heterogeneities are usually very difficult to resolve since because they usually involve a lot of contextual data that is not well defined as metadata. Information dependencies in the relationship between the different sets of attribute are commonly present. Wrong cleaning mechanisms can further damage the information in the data. Various analysis tools handle these problems in different ways. Commercial offerings are available that assist the cleaning process, but these are often problem specific. Uncertainty in information systems is a well-recognized hard problem. In following a very simple examples of missing and erroneous data is shown Extensive support for data cleaning must be provided by data warehouses. Data warehouses have high probability of â€Å"dirty data† since they load and continuously refresh huge amounts of data from a variety of sources. Since these data warehouses are used for strategic decision making therefore the correctness of their data is important to avoid wrong decisions. The ETL (Extraction, Transformation, and Loading) process for building a data warehouse is illustrated in following . Data transformations are related with schema or data translation and integration, and with filtering and aggregating data to be stored in the data warehouse. All data cleaning is classically performed in a separate data performance area prior to loading the transformed data into the warehouse. A large number of tools of varying functionality are available to support these tasks, but often a significant portion of the cleaning and transformation work has to be done manually or by low-level programs that are difficult to write and maintain. A data cleaning method should assure following: It should identify and eliminate all major errors and inconsistencies in an individual data sources and also when integrating multiple sources. Data cleaning should be supported by tools to bound manual examination and programming effort and it should be extensible so that can cover additional sources. It should be performed in association with schema related data transformations based on metadata. Data cleaning mapping functions should be specified in a declarative way and be reusable for other data sources. 2.2.1.4 Data Cleaning: Phases 1. Analysis: To identify errors and inconsistencies in the database there is a need of detailed analysis, which involves both manual inspection and automated analysis programs. This reveals where (most of) the problems are present. 2. Defining Transformation and Mapping Rules: After discovering the problems, this phase are related with defining the manner by which we are going to automate the solutions to clean the data. We will find various problems that translate to a list of activities as a result of analysis phase. Example: Remove all entries for J. Smith because they are duplicates of John Smith Find entries with `bule in colour field and change these to `blue. Find all records where the Phone number field does not match the pattern (NNNNN NNNNNN). Further steps for cleaning this data are then applied. Etc †¦ 3. Verification: In this phase we check and assess the transformation plans made in phase- 2. Without this step, we may end up making the data dirtier rather than cleaner. Since data transformation is the main step that actually changes the data itself so there is a need to be sure that the applied transformations will do it correctly. Therefore test and examine the transformation plans very carefully. Example: Let we have a very thick C++ book where it says strict in all the places where it should say struct 4. Transformation: Now if it is sure that cleaning will be done correctly, then apply the transformation verified in last step. For large database, this task is supported by a variety of tools Backflow of Cleaned Data: In a data mining the main objective is to convert and move clean data into target system. This asks for a requirement to purify legacy data. Cleansing can be a complicated process depending on the technique chosen and has to be designed carefully to achieve the objective of removal of dirty data. Some methods to accomplish the task of data cleansing of legacy system include: n Automated data cleansing n Manual data cleansing n The combined cleansing process 2.2.1.5 Missing Values Data cleaning addresses a variety of data quality problems, including noise and outliers, inconsistent data, duplicate data, and missing values. Missing values is one important problem to be addressed. Missing value problem occurs because many tuples may have no record for several attributes. For Example there is a customer sales database consisting of a whole bunch of records (lets say around 100,000) where some of the records have certain fields missing. Lets say customer income in sales data may be missing. Goal here is to find a way to predict what the missing data values should be (so that these can be filled) based on the existing data. Missing data may be due to following reasons Equipment malfunction Inconsistent with other recorded data and thus deleted Data not entered due to misunderstanding Certain data may not be considered important at the time of entry Not register history or changes of the data How to Handle Missing Values? Dealing with missing values is a regular question that has to do with the actual meaning of the data. There are various methods for handling missing entries 1. Ignore the data row. One solution of missing values is to just ignore the entire data row. This is generally done when the class label is not there (here we are assuming that the data mining goal is classification), or many attributes are missing from the row (not just one). But if the percentage of such rows is high we will definitely get a poor performance. 2. Use a global constant to fill in for missing values. We can fill in a global constant for missing values such as unknown, N/A or minus infinity. This is done because at times is just doesnt make sense to try and predict the missing value. For example if in customer sales database if, say, office address is missing for some, filling it in doesnt make much sense. This method is simple but is not full proof. 3. Use attribute mean. Let say if the average income of a a family is X you can use that value to replace missing income values in the customer sales database. 4. Use attribute mean for all samples belonging to the same class. Lets say you have a cars pricing DB that, among other things, classifies cars to Luxury and Low budget and youre dealing with missing values in the cost field. Replacing missing cost of a luxury car with the average cost of all luxury cars is probably more accurate then the value youd get if you factor in the low budget 5. Use data mining algorithm to predict the value. The value can be determined using regression, inference based tools using Bayesian formalism, decision trees, clustering algorithms etc. 2.2.1.6 Noisy Data Noise can be defined as a random error or variance in a measured variable. Due to randomness it is very difficult to follow a strategy for noise removal from the data. Real world data is not always faultless. It can suffer from corruption which may impact the interpretations of the data, models created from the data, and decisions made based on the data. Incorrect attribute values could be present because of following reasons Faulty data collection instruments Data entry problems Duplicate records Incomplete data: Inconsistent data Incorrect processing Data transmission problems Technology limitation. Inconsistency in naming convention Outliers How to handle Noisy Data? The methods for removing noise from data are as follows. 1. Binning: this approach first sort data and partition it into (equal-frequency) bins then one can smooth it using- Bin means, smooth using bin median, smooth using bin boundaries, etc. 2. Regression: in this method smoothing is done by fitting the data into regression functions. 3. Clustering: clustering detect and remove outliers from the data. 4. Combined computer and human inspection: in this approach computer detects suspicious values which are then checked by human experts (e.g., this approach deal with possible outliers).. Following methods are explained in detail as follows: Binning: Data preparation activity that converts continuous data to discrete data by replacing a value from a continuous range with a bin identifier, where each bin represents a range of values. For instance, age can be changed to bins such as 20 or under, 21-40, 41-65 and over 65. Binning methods smooth a sorted data set by consulting values around it. This is therefore called local smoothing. Let consider a binning example Binning Methods n Equal-width (distance) partitioning Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward, but outliers may dominate presentation Skewed data is not handled well n Equal-depth (frequency) partitioning 1. It divides the range (values of a given attribute) into N intervals, each containing approximately same number of samples (elements) 2. Good data scaling 3. Managing categorical attributes can be tricky. n Smooth by bin means- Each bin value is replaced by the mean of values n Smooth by bin medians- Each bin value is replaced by the median of values n Smooth by bin boundaries Each bin value is replaced by the closest boundary value Example Let Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 n Partition into equal-frequency (equi-depth) bins: o Bin 1: 4, 8, 9, 15 o Bin 2: 21, 21, 24, 25 o Bin 3: 26, 28, 29, 34 n Smoothing by bin means: o Bin 1: 9, 9, 9, 9 ( for example mean of 4, 8, 9, 15 is 9) o Bin 2: 23, 23, 23, 23 o Bin 3: 29, 29, 29, 29 n Smoothing by bin boundaries: o Bin 1: 4, 4, 4, 15 o Bin 2: 21, 21, 25, 25 o Bin 3: 26, 26, 26, 34 Regression: Regression is a DM technique used to fit an equation to a dataset. The simplest form of regression is linear regression which uses the formula of a straight line (y = b+ wx) and determines the suitable values for b and w to predict the value of y based upon a given value of x. Sophisticated techniques, such as multiple regression, permit the use of more than one input variable and allow for the fitting of more complex models, such as a quadratic equation. Regression is further described in subsequent chapter while discussing predictions. Clustering: clustering is a method of grouping data into different groups , so that data in each group share similar trends and patterns. Clustering constitute a major class of data mining algorithms. These algorithms automatically partitions the data space into set of regions or cluster. The goal of the process is to find all set of similar examples in data, in some optimal fashion. Following shows three clusters. Values that fall outsid e the cluster are outliers. 4. Combined computer and human inspection: These methods find the suspicious values using the computer programs and then they are verified by human experts. By this process all outliers are checked. 2.2.1.7 Data cleaning as a process Data cleaning is the process of Detecting, Diagnosing, and Editing Data. Data cleaning is a three stage method involving repeated cycle of screening, diagnosing, and editing of suspected data abnormalities. Many data errors are detected by the way during study activities. However, it is more efficient to discover inconsistencies by actively searching for them in a planned manner. It is not always right away clear whether a data point is erroneous. Many times it requires careful examination. Likewise, missing values require additional check. Therefore, predefined rules for dealing with errors and true missing and extreme values are part of good practice. One can monitor for suspect features in survey questionnaires, databases, or analysis data. In small studies, with the examiner intimately involved at all stages, there may be small or no difference between a database and an analysis dataset. During as well as after treatment, the diagnostic and treatment phases of cleaning need insight into the sources and types of errors at all stages of the study. Data flow concept is therefore crucial in this respect. After measurement the research data go through repeated steps of- entering into information carriers, extracted, and transferred to other carriers, edited, selected, transformed, summarized, and presented. It is essential to understand that errors can occur at any stage of the data flow, including during data cleaning itself. Most of these problems are due to human error. Inaccuracy of a single data point and measurement may be tolerable, and associated to the inherent technological error of the measurement device. Therefore the process of data clenaning mus focus on those errors that are beyond small technical variations and that form a major shift within or beyond the population distribution. In turn, it must be based on understanding of technical errors and expected ranges of normal values. Some errors are worthy of higher priority, but which ones are most significant is highly study-specific. For instance in most medical epidemiological studies, errors that need to be cleaned, at all costs, include missing gender, gender misspecification, birth date or examination date errors, duplications or merging of records, and biologically impossible results. Another example is in nutrition studies, date errors lead to age errors, which in turn lead to errors in weight-for-age scoring and, further, to misclassification of subjects as under- or overweight. Errors of sex and date are particularly important because they contaminate derived variables. Prioritization is essential if the study is under time pressures or if resources for data cleaning are limited. 2.2.2 Data Integration This is a process of taking data from one or more sources and mapping it, field by field, onto a new data structure. Idea is to combine data from multiple sources into a coherent form. Various data mining projects requires data from multiple sources because n Data may be distributed over different databases or data warehouses. (for example an epidemiological study that needs information about hospital admissions and car accidents) n Sometimes data may be required from different geographic distributions, or there may be need for historical data. (e.g. integrate historical data into a new data warehouse) n There may be a necessity of enhancement of data with additional (external) data. (for improving data mining precision) 2.2.2.1 Data Integration Issues There are number of issues in data integrations. Consider two database tables. Imagine two database tables Database Table-1 Database Table-2 In integration of there two tables there are variety of issues involved such as 1. The same attribute may have different names (for example in above tables Name and Given Name are same attributes with different names) 2. An attribute may be derived from another (for example attribute Age is derived from attribute DOB) 3. Attributes might be redundant( For example attribute PID is redundant) 4. Values in attributes might be different (for example for PID 4791 values in second and third field are different in both the tables) 5. Duplicate records under different keys( there is a possibility of replication of same record with different key values) Therefore schema integration and object matching can be trickier. Question here is how equivalent entities from different sources are matched? This problem is known as entity identification problem. Conflicts have to be detected and resolved. Integration becomes easier if unique entity keys are available in all the data sets (or tables) to be linked. Metadata can help in schema integration (example of metadata for each attribute includes the name, meaning, data type and range of values permitted for the attribute) 2.2.2.1 Redundancy Redundancy is another important issue in data integration. Two given attribute (such as DOB and age for instance in give table) may be redundant if one is derived form the other attribute or set of attributes. Inconsistencies in attribute or dimension naming can lead to redundancies in the given data sets. Handling Redundant Data We can handle data redundancy problems by following ways n Use correlation analysis n Different coding / representation has to be considered (e.g. metric / imperial measures) n Careful (manual) integration of the data can reduce or prevent redundancies (and inconsistencies) n De-duplication (also called internal data linkage) o If no unique entity keys are available o Analysis of values in attributes to find duplicates n Process redundant and inconsistent data (easy if values are the same) o Delete one of the values o Average values (only for numerical attributes) o Take majority values (if more than 2 duplicates and some values are the same) Correlation analysis is explained in detail here. Correlation analysis (also called Pearsons product moment coefficient): some redundancies can be detected by using correlation analysis. Given two attributes, such analysis can measure how strong one attribute implies another. For numerical attribute we can compute correlation coefficient of two attributes A and B to evaluate the correlation between them. This is given by Where n n is the number of tuples, n and are the respective means of A and B n ÏÆ'A and ÏÆ'B are the respective standard deviation of A and B n ÃŽ £(AB) is the sum of the AB cross-product. a. If -1 b. If rA, B is equal to zero it indicates A and B are independent of each other and there is no correlation between them. c. If rA, B is less than zero then A and B are negatively correlated. , where if value of one attribute increases value of another attribute decreases. This means that one attribute discourages another attribute. It is important to note that correlation does not imply causality. That is, if A and B are correlated, this does not essentially mean that A causes B or that B causes A. for example in analyzing a demographic database, we may find that attribute representing number of accidents and the number of car theft in a region are correlated. This does not mean that one is related to another. Both may be related to third attribute, namely population. For discrete data, a correlation relation between two attributes, can be discovered by a χ ²(chi-square) test. Let A has c distinct values a1,a2,†¦Ã¢â‚¬ ¦ac and B has r different values namely b1,b2,†¦Ã¢â‚¬ ¦br The data tuple described by A and B are shown as contingency table, with c values of A (making up columns) and r values of B( making up rows). Each and every (Ai, Bj) cell in table has. X^2 = sum_{i=1}^{r} sum_{j=1}^{c} {(O_{i,j} E_{i,j})^2 over E_{i,j}} . Where n Oi, j is the observed frequency (i.e. actual count) of joint event (Ai, Bj) and n Ei, j is the expected frequency which can be computed as E_{i,j}=frac{sum_{k=1}^{c} O_{i,k} sum_{k=1}^{r} O_{k,j}}{N} , , Where n N is number of data tuple n Oi,k is number of tuples having value ai for A n Ok,j is number of tuples having value bj for B The larger the χ ² value, the more likely the variables are related. The cells that contribute the most to the χ ² value are those whose actual count is very different from the expected count Chi-Square Calculation: An Example Suppose a group of 1,500 people were surveyed. The gender of each person was noted. Each person has polled their preferred type of reading material as fiction or non-fiction. The observed frequency of each possible joint event is summarized in following table.( number in parenthesis are expected frequencies) . Calculate chi square. Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500 E11 = count (male)*count(fiction)/N = 300 * 450 / 1500 =90 and so on For this table the degree of freedom are (2-1)(2-1) =1 as table is 2X2. for 1 degree of freedom , the χ ² value needed to reject the hypothesis at the 0.001 significance level is 10.828 (taken from the table of upper percentage point of the χ ² distribution typically available in any statistic text book). Since the computed value is above this, we can reject the hypothesis that gender and preferred reading are independent and conclude that two attributes are strongly correlated for given group. Duplication must also be detected at the tuple level. The use of renormalized tables is also a source of redundancies. Redundancies may further lead to data inconsistencies (due to updating some but not others). 2.2.2.2 Detection and resolution of data value conflicts Another significant issue in data integration is the discovery and resolution of data value conflicts. For example, for the same entity, attribute values from different sources may differ. For example weight can be stored in metric unit in one source and British imperial unit in another source. For instance, for a hotel cha

Sunday, January 19, 2020

Kolb Learning Style Inventory

The Kolb Learning Style Inventory—Version 3. 1 2005 Technical Speci? cations Alice Y. Kolb Experience Based Learning Systems, Inc. David A. Kolb Case Western Reserve University May 15, 2005 Abstract The Kolb Learning Style Inventory Version 3. 1 (KLSI 3. 1), revised in 2005, is the latest revision of the original Learning Style Inventory developed by David A. Kolb. Like its predecessors, KLSI 3. 1 is based on experiential learning theory (Kolb 1984) and is designed to help individuals identify the way they learn from experience.This revision includes new norms that are based on a larger, more diverse, and more representative sample of 6977 LSI users. The format, items, scoring and interpretative booklet remain identical with KLSI 3. The technical speci? cations are designed to adhere to the standards for educational and psychological testing developed by the American Educational Research Association, the American Psychological Association, and the National Council on Measureme nt in Education (1999). Section 1 of the technical speci? cations describes the conceptual foundations of the LSI 3. in the theory of experiential learning (ELT). Section 2 provides a description of the inventory that includes its purpose, history, and format. Section 3 describes the characteristics of the KLSI 3. 1 normative sample. Section 4 includes internal reliability and test-retest reliability studies of the inventory. Section 5 provides information about research on the internal and external validity for the instrument. Internal validity studies of the structure of the KLSI 3. 1 using correlation and factor analysis are reported.External validity includes research on demographics, educational specialization, concurrent validity with other experiential learning assessment instruments, aptitude test performance, academic performance, experiential learning in teams, and educational applications.  © Copyright 2005: Experience Based Learning Systems, Inc. All rights reserved. 1 1. CONCEPTUAL FOUNDATION—EXPERIENTIAL LEARNING THEORY AND INDIVIDUAL LEARNING STYLES The Kolb Learning Style Inventory differs from other tests of learning style and personality used in education by being based on a comprehensive theory of learning and development.Experiential learning theory (ELT) draws on the work of prominent twentieth century scholars who gave experience a central role in their theories of human learning and development-notably John Dewey, Kurt Lewin, Jean Piaget, William James, Carl Jung, Paulo Freire, Carl Rogers, and others-to develop a holistic model of the experiential learning process and a multi-linear model of adult development. The theory, described in detail in Experiential Learning: Experience as the Source of Learning and Development (Kolb 1984), is built on six propositions that are shared by these scholars. . Learning is best conceived as a process, not in terms of outcomes. To improve learning in higher education, the primary focus should be on engaging students in a process that best enhances their learning —a process that includes feedback on the effectiveness of their learning efforts. â€Å"†¦ education must be conceived as a continuing reconstruction of experience: †¦ the process and goal of education are one and the same thing. † (Dewey 1897: 79) 2. All learning is relearning.Learning is best facilitated by a process that draws out the students’ beliefs and ideas about a topic so that they can be examined, tested, and integrated with new, more re? ned ideas. 3. Learning requires the resolution of con? icts between dialectically opposed modes of adaptation to the world. Con? ict, differences, and disagreement are what drive the learning process. In the process of learning, one is called upon to move back and forth between opposing modes of re? ection and action and feeling and thinking. 4. Learning is a holistic process of adaptation to the world.It is not just the result of cognit ion but involves the integrated functioning of the total person—thinking, feeling, perceiving, and behaving. 5. Learning results from synergetic transactions between the person and the environment. In Piaget’s terms, learning occurs through equilibration of the dialectic processes of assimilating new experiences into existing concepts and accommodating existing concepts to new experience. 6. Learning is the process of creating knowledge. ELT proposes a constructivist theory of learning whereby social knowledge is created and recreated in the personal knowledge of the learner.This stands in contrast to the â€Å"transmission† model on which much current educational practice is based, where pre-existing ? xed ideas are transmitted to the learner. ELT de? nes learning as â€Å"the process whereby knowledge is created through the transformation of experience. Knowledge results from the combination of grasping and transforming experience† (Kolb 1984: 41). The ELT model portrays two dialectically related modes of grasping experience-Concrete Experience (CE) and Abstract Conceptualization (AC)-and two dialectically related modes of transforming experience-Re? ctive Observation (RO) and Active Experimentation (AE). Experiential learning is a process of constructing knowledge that involves a creative tension among the four learning modes that is responsive to contextual demands. This process is portrayed as an idealized learning cycle or spiral where the learner â€Å"touches all the bases†Ã¢â‚¬â€experiencing, re? ecting, thinking, and acting-in a recursive process that is responsive to the learning situation and what is being learned. Immediate or concrete experiences are the basis for observations and re? ections. These re? ctions are assimilated and distilled into abstract concepts from which new implications for action can be drawn. These implications can be actively tested and serve as guides in creating new experiences (Figure 1). ELT proposes that this idealized learning cycle will vary by individuals’ learning style and learning context. 2 LSI Technical Manual Concrete Experience Testing Implications of Concepts in New Situations Observation and Reflections Formation of Abstract Concepts and Generalization Figure 1. The experiential learning cycle In The art of changing the brain: Enriching teaching by exploring the biology f learning, James Zull, a biologist and founding director of CWRU’s University Center for Innovation in Teaching and Education (UCITE), sees a link between ELT and neuroscience research, suggesting that this process of experiential learning is related to the process of brain functioning as shown in Figure 2. â€Å"Put into words, the ? gure illustrates that concrete experiences come through the sensory cortex, re? ective observation involves the integrative cortex at the back, creating new abstract concepts occurs in the frontal integrative cortex, and active testing i nvolves the motor brain.In other words, the learning cycle arises from the structure of the brain. † (Zull 2002: 18-19) 3 Figure 2. The experiential learning cycle and regions of the cerebral cortex. Reprinted with permission of the author (Zull 2002) ELT posits that learning is the major determinant of human development and that how individuals learn shapes the course of their personal development. Previous research (Kolb 1984) has shown that learning styles are in? uenced by personality type, educational specialization, career choice, and current job role and tasks. Yamazaki (2002, 2004a) has recently identi? d cultural in? uences as well. The ELT developmental model (Kolb 1984) de? nes three stages: (1) acquisition, from birth to adolescence, where basic abilities and cognitive structures develop; (2) specialization, from formal schooling through the early work and personal experiences of adulthood, where social, educational, and organizational socialization forces shape th e development of a particular, specialized learning style; and (3) integration in midcareer and later life, where nondominant modes of learning are expressed in work and personal life.Development through these stages is characterized by increasing complexity and relativism in adapting to the world and by increased integration of the dialectic con? icts between AC and CE and AE and RO. Development is conceived as multi-linear based on an individual’s particular learning style and life path—development of CE increases affective complexity, of RO increases perceptual complexity, of AC increases symbolic complexity, and of AE increases behavioral complexity.The concept of learning style describes individual differences in learning based on the learner’s preference for employing different phases of the learning cycle. Because of our hereditary equipment, our particular life experiences, and the demands of our present environment, we develop a preferred way of choosin g among the four learning modes. We resolve the con? ict between being concrete or abstract and between being active or re? ective in patterned, characteristic ways.Much of the research on ELT has focused on the concept of learning style, using the Learning Style Inventory (LSI) to assess individual learning styles (Kolb 1971, 1985, 1999). While individuals tested on the LSI show many different patterns of scores, previous research with the instrument has identi? ed four learning styles that are associated with different approaches to learning—Diverging, Assimilating, Converging, and Accommodating. The following summary of the four basic learning styles is based on both research and clinical observation of these patterns of LSI scores (Kolb1984, 1999a). LSI Technical Manual An individual with diverging style has CE and RO as dominant learning abilities. People with this learning style are best at viewing concrete situations from many different points of view. It is labeled Di verging because a person with it performs better in situations that call for generation of ideas, such as a brainstorming session. People with a Diverging learning style have broad cultural interests and like to gather information. They are interested in people, tend to be imaginative and emotional, have broad cultural interests, and tend to specialize in the arts.In formal learning situations, people with the Diverging style prefer to work in groups, listening with an open mind to different points of view and receiving personalized feedback. An individual with an assimilating style has AC and RO as dominant learning abilities. People with this learning style are best at understanding a wide range of information and putting it into concise, logical form. Individuals with an Assimilating style are less focused on people and more interested in ideas and abstract concepts. Generally, people with this style ? d it more important that a theory have logical soundness than practical value. The Assimilating learning style is important for effectiveness in information and science careers. In formal learning situations, people with this style prefer readings, lectures, exploring analytical models, and having time to think things through. An individual with a converging style has AC and AE as dominant learning abilities. People with this learning style are best at ? nding practical uses for ideas and theories. They have the ability to solve problems and make decisions based on ? ding solutions to questions or problems. Individuals with a Converging learning style prefer to deal with technical tasks and problems rather than with social issues and interpersonal issues. These learning skills are important for effectiveness in specialist and technology careers. In formal learning situations, people with this style prefer to experiment with new ideas, simulations, laboratory assignments, and practical applications. An individual with an accommodating style has CE and AE as do minant learning abilities.People with this learning style have the ability to learn from primarily â€Å"hands-on† experience. They enjoy carrying out plans and involving themselves in new and challenging experiences. Their tendency may be to act on â€Å"gut† feelings rather than on logical analysis. In solving problems, individuals with an Accommodating learning style rely more heavily on people for information than on their own technical analysis. This learning style is important for effectiveness in action-oriented careers such as marketing or sales.In formal learning situations, people with the Accommodating learning style prefer to work with others to get assignments done, to set goals, to do ? eld work, and to test out different approaches to completing a project. 5 FACTORS THAT SHAPE AND INFLUENCE LEARNING STYLES The above patterns of behavior associated with the four basic learning styles are shaped by transactions between people and their environment at ? ve different levels—personality, educational specialization, professional career, current job role, and adaptive competencies.While some have interpreted learning style as a personality variable (Garner 2000; Furnam, Jackson, and Miller 1999), ELT de? nes learning style as a social psychological concept that is only partially determined by personality. Personality exerts a small but pervasive in? uence in nearly all situations; but at the other levels, learning style is in? uenced by increasingly speci? c environmental demands of educational specialization, career, job, and tasks skills. Table 1 summarizes previous research that has identi? ed how learning styles are determined at these various levels. Table 1.Relationship Between Learning Styles and Five Levels of Behavior Behavior Level Personality types Educational Specialization Professional Career Current Jobs Adaptive Competencies Diverging Introverted Feeling Arts, English History Psychology Social Service Arts Personal j obs Valuing skills Assimilating Introverted Intuition Mathematics Physical Science Sciences Research Information Information jobs Thinking skills Converging Extraverted Thinking Engineering Medicine Engineering Medicine Technology Technical jobs Decision skills Accommodating Extraverted Sensation Education Communication Nursing Sales Social Service Education Executive jobs Action skills Personality Types Although the learning styles of and learning modes proposed by ELT are derived from the works of Dewey, Lewin, and Piaget, many have noted the similarity of these concepts to Carl Jung’s descriptions of individuals’ preferred ways for adapting in the world.Several research studies relating the LSI with the Myers-Briggs Type Indicator (MBTI) indicate that Jung’s Extraversion/Introversion dialectical dimension correlates with the Active/Re? ective dialectic of ELT, and the MBTI Feeling/Thinking dimension correlates with the LSI Concrete Experience/ Abstract Concep tualization dimension. The MBTI Sensing type is associated with the LSI Accommodating learning style, and the MBTI Intuitive type with the LSI Assimilating style. MBTI Feeling types correspond to LSI Diverging learning styles, and Thinking types to Converging styles. The above discussion implies that the Accommodating learning style is the Extraverted Sensing type, and the Converging style the Extraverted Thinking type.The Assimilating learning style corresponds to the Introverted Intuitive personality type, and the Diverging style to the Introverted Feeling type. Myers (1962) descriptions of these MBTI types are very similar to the corresponding LSI learning styles as described by ELT (Kolb 1984, 83-85). Educational Specialization Early educational experiences shape people’s individual learning styles by instilling positive attitudes toward speci? c sets of learning skills and by teaching students how to learn. Although elementary education is generalized, an increasing proc ess of specialization begins in high school and becomes sharper during the college years. This specialization in the realms of social knowledge in? ences individuals’ orientations toward learning, resulting in particular relations between learning styles and early training in an educational specialty or discipline. For example, people specializing in the arts, history, political science, English, and psychology tend to have Diverging learning styles, while those majoring 6 LSI Technical Manual in more abstract and applied areas such as medicine and engineering have Converging learning styles. Individuals with Accommodating styles often have educational backgrounds in education, communications, and nursing, and those with Assimilating styles in mathematics and physical sciences. Professional Career A third set of factors that shape learning styles stems from professional careers.One’s professional career choice not only exposes one to a specialized learning environment, but it also involves a commitment to a generic professional problem, such as social service, that requires a specialized adaptive orientation. In addition, one becomes a member of a reference group of peers who share a professional mentality and a common set of values and beliefs about how one should behave professionally. This professional orientation shapes learning style through habits acquired in professional training and through the more immediate normative pressures involved in being a competent professional. Research over the years has shown that social service and arts careers attract people with a Diverging learning style. Professions in the sciences and information or research have people with an Assimilating learning style.The Converging learning styles tends to be dominant among professionals in technology-intensive ? elds such as medicine and engineering. Finally, the Accommodating learning style characterizes people with careers in ? elds such as sales, social service , and education. Current Job Role The fourth level of factors in? uencing learning style is the person’s current job role. The task demands and pressures of a job shape a person’s adaptive orientation. Executive jobs, such as general management, that require a strong orientation to task accomplishment and decision making in uncertain emergent circumstances require an Accommodating learning style.Personal jobs, such as counseling and personnel administration, which require the establishment of personal relationships and effective communication with other people, demand a Diverging learning style. Information jobs, such as planning and research, which require data gathering and analysis, as well as conceptual modeling, require an Assimilating learning style. Technical jobs, such as bench engineering and production, require technical and problem-solving skills, which require a convergent learning orientation. Adaptive Competencies The ? fth and most immediate level of for ces that shapes learning style is the speci? c task or problem the person is currently working on. Each task we face requires a corresponding set of skills for effective performance.The effective matching of task demands and personal skills results in an adaptive competence. The Accommodative learning style encompasses a set of competencies that can best be termed Acting skills: Leadership, Initiative, and Action. The Diverging learning style is associated with Valuing skills: Relationship, Helping Others, and Sense Making. The Assimilating learning style is related to Thinking skills: Information Gathering, Information Analysis, and Theory Building. Finally, the Converging learning style is associated with Decision skills like Quantitative Analysis, Use of Technology, and Goal Setting (Kolb1984). 7 2. THE LEARNING STYLE INVENTORY PURPOSE The Learning Style Inventory (LSI) was created to ful? l two purposes: 1. To serve as an educational tool to increase individuals’ understa nding of the process of learning from experience and their unique individual approach to learning. By increasing awareness of how they learn, the aim is to increase learners’ capacity for meta-cognitive control of their learning process, enabling them to monitor and select learning approaches that work best for them in different learning situations. By providing a language for talking about learning styles and the learning process, the inventory can foster conversation among learners and educators about how to create the most effective learning environment for those involved.For this purpose, the inventory is best presented not as a test, but as an experience in understanding how one learns. Scores on the inventory should not be interpreted as de? nitive, but as a starting point for exploration of how one learns best. To facilitate this purpose, a self-scoring and interpretation book that explains the experiential learning cycle and the characteristics of the different learni ng styles, along with scoring and pro? ling instructions, is included with the inventory. 2. To provide a research tool for investigating experiential learning theory (ELT) and the characteristics of individual learning styles. This research can contribute to the broad advancement of experiential learning and, speci? ally, to the validity of interpretations of individual learning style scores. A research version of the instrument, including only the inventory to be scored by the researcher, is available for this purpose. The LSI is not a criterion-referenced test and is not intended for use to predict behavior for purposes of selection, placement, job assignment, or selective treatment. This includes not using the instrument to assign learners to different educational treatments, a process sometimes referred to as tracking. Such categorizations based on a single test score amount to stereotyping that runs counter to the philosophy of experiential learning, which emphasizes individua l uniqueness. When it is used in the simple, straightforward, and open way intended, the LSI usually provides a valuable self-examination and discussion that recognizes the uniqueness, complexity, and variability in individual approaches to learning. The danger lies in the rei? cation of learning styles into ? xed traits, such that learning styles become stereotypes used to pigeonhole individuals and their behavior. † (Kolb 1981a: 290-291) The LSI is constructed as a self-assessment exercise and tool for construct validation of ELT. Tests designed for predictive validity typically begin with a criterion, such as academic achievement, and work backward to identify items or tests with high criterion correlations.Even so, even the most sophisticated of these tests rarely rises above a . 5 correlation with the criterion. For example, while Graduate Record Examination Subject Test scores are better predictors of ? rst-year graduate school grades than either the General Test score o r undergraduate GPA, the combination of these three measures only produces multiple correlations with grades ranging from . 4 to . 6 in various ? elds (Anastasi and Urbina 1997). Construct validation is not focused on an outcome criterion, but on the theory or construct the test measures. Here the emphasis is on the pattern of convergent and discriminant theoretical predictions made by the theory. Failure to con? m predictions calls into question the test and the theory. â€Å"However, even if each of the correlations proved to be quite low, their cumulative effect would be to support the validity of the test and the underlying theory. † (Selltiz, Jahoda, Deutsch, and Cook 1960: 160) Judged by the standards of construct validity, ELT has been widely accepted as a useful framework for learning-centered educational innovation, including instructional design, curriculum development, and life-long learning. Field and job classi? cation studies viewed as a whole also show a patter n of results consistent with the ELT structure of knowledge theory. 8 LSI Technical ManualHISTORY Five versions of the Learning Style Inventory have been published over the last 35 years. During this time, attempts have been made to openly share information about the inventory, its scoring, and its technical characteristics with other interested researchers. The results of their research have been instrumental in the continuous improvement of the inventory. Learning Style Inventory-Version 1 (Kolb 1971, Kolb 1976) The original Learning Style Inventory (LSI 1) was created in 1969 as part of an MIT curriculum development project that resulted in the ? rst management textbook based on experiential learning (Kolb, Rubin, and McIntyre 1971).It was originally developed as an experiential educational exercise designed to help learners understand the process of experiential learning and their unique individual style of learning from experience. The term â€Å"learning style† was coin ed to describe these individual differences in how people learn. Items for the inventory were selected from a longer list of words and phrases developed for each learning mode by a panel of four behavioral scientists familiar with experiential learning theory. This list was given to a group of 20 graduate students who were asked to rate each word or phrase for social desirability. Attempting to select words that were of equal social desirability, a ? nal set of 12 items including a word or phrase for each learning mode was selected for pre-testing.Analysis showed that three of these sets produced nearly random responses and were thus eliminated, resulting in a ? nal version of the LSI with 9 items. These items were further re? ned through item-whole correlation analysis to include six scored items for each learning mode. Research with the inventory was stimulated by classroom discussions with students, who found the LSI to be helpful to them in understanding the process of experient ial learning and how they learned. From 1971 until it was revised in 1985, there were more than 350 published research studies using the LSI. Validity for the LSI 1 was established in a number of ? elds, including education, management, psychology, computer science, medicine, and nursing (Hickcox 1990, Iliff 1994).The results of this research with LSI 1 provided provided empirical support for the most complete and systematic statement of ELT, Experiential Learning: Experience as the Source of Learning and Development (Kolb 1984). Several studies of the LSI 1 identi? ed psychometric weaknesses of the instrument, particularly low internal consistency reliability and test-retest reliability. Learning Style Inventory-Version 2 (Kolb 1985) Low reliability coef? cients and other concerns about the LSI 1 led to a revision of the inventory in 1985 (LSI 2). Six new items chosen to increase internal reliability (alpha) were added to each scale, making 12 scored items on each scale. These chan ges increased scale alphas to an average of . 81 ranging from . 73 to . 88.Wording of all items was simpli? ed to a seventh grade reading level, and the format was changed to include sentence stems (e. g. , â€Å"When I learn†). Correlations between the LSI 1 and LSI 2 scales averaged . 91 and ranged from . 87 to . 93. A new more diverse normative reference group of 1446 men and women was created. Research with the LSI 2 continued to establish validity for the instrument. From 1985 until the publication of the LSI 3 1999, more than 630 studies were published, most using the LSI 2. While internal reliability estimates for the LSI 2 remained high in independent studies, test-retest reliability remained low. Learning Style Inventory-Version 2a (Kolb 1993)In 1991 Veres, Sims, and Locklear published a reliability study of a randomized version of the LSI 2 that showed a small decrease in internal reliability but a dramatic increase in test-retest reliability with the random scoring format. To study this format, a research version of the random format inventory (LSI 2a) was published in 1993. 9 Kolb Learning Style Inventory-Version 3 (Kolb 1999) In 1999 the randomized format was adopted in a revised self-scoring and interpretation booklet (LSI 3) that included a color-coded scoring sheet to simplify scoring. The new booklet was organized to follow the learning cycle, emphasizing the LSI as an â€Å"experience in learning how you learn. † New application information on teamwork, managing con? ct, personal and professional communication, and career choice and development were added. The LSI 3 continued to use the LSI 2 normative reference group until norms for the randomized version could be created. Kolb Learning Style Inventory-Version 3. 1 (Kolb 2005) The new LSI 3. 1 described here modi? ed the LSI 3 to include new normative data described below. This revision includes new norms that are based on a larger, more diverse and representative sample of 697 7 LSI users. The format, items, scoring, and interpretative booklet remain identical to KLSI 3. The only change in KLSI 3. 1 is in the norm charts used to convert raw LSI scores. FORMATThe Learning Style Inventory is designed to measure the degree to which individuals display the different learning styles derived from experiential learning theory. The form of the inventory is determined by three design parameters. First, the test is brief and straightforward, making it useful both for research and for discussing the learning process with individuals and providing feedback. Second, the test is constructed in such a way that individuals respond to it as they would respond to a learning situation: it requires them to resolve the tensions between the abstract-concrete and active-re? ective orientations. For this reason, the LSI format requires them to rank order their preferences for the abstract, concrete, active, and re? ective orientations.Third, and most obviously, it was hoped that the measures of learning styles would predict behavior in a way consistent with the theory of experiential learning. All versions of the LSI have had the same format—a short questionnaire (9 items for LSI 1 and 12 items for subsequent versions) that asks respondents to rank four sentence endings that correspond to the four learning modes— Concrete Experience (e. g. , experiencing), Re? ective Observation (re? ecting), Abstract Conceptualization (thinking), and Active Experimentation (doing). Items in the LSI are geared to a seventh grade reading level. The inventory is intended for use by teens and adults. It is not intended for use by younger children.The LSI has been translated into many languages, including, Arabic, Chinese, French, Japanese, Italian, Portuguese, Spanish, Swedish, and Thai, and there have been many cross-cultural studies using it (Yamazaki 2002). The Forced-Choice Format of the LSI The format of the LSI is a forced-choice format that ranks an indiv idual’s relative choice preferences among the four modes of the learning cycle. This is in contrast to the more common normative, or free-choice, format, such as the widely used Likert scale, which rates absolute preferences on independent dimensions. The forced-choice format of the LSI was dictated by the theory of experiential learning and by the primary purpose of the instrument.ELT is a holistic, dynamic, and dialectic theory of learning. Because it is holistic, the four modes that make up the experiential learning cycle-CE, RO, AC, and AE- are conceived as interdependent. Learning involves resolving the creative tension among these learning modes in response to the speci? c learning situation. Since the two learning dimensions, AC-CE and AE-RO, are related dialectically, the choice of one pole involves not choosing the opposite pole. Therefore, because ELT postulates that learning in life situations requires the resolution of con? icts among interdependent learning modes , to be ecologically valid, the learning style assessment process should require a similar process of con? ct resolution in the choice of one’s preferred learning approach. ELT de? nes learning style not as a ? xed trait, but as a dynamic state arising from an individual’s preferential resolution of the dual dialectics of experiencing/conceptualizing and acting/re? ecting. â€Å"The stability and endurance of these states in individuals comes not solely from ? xed genetic qualities or characteristics of human beings: nor, for that matter, does it come from the stable ? xed demands of environmental circumstances. Rather, stable and enduring patterns of human individuality arise from consistent patterns of transaction between the individual and his or her 10 LSI Technical Manual environment.The way we process the possibilities of each new emerging event determines the range of choices and decisions we see. The choices and decisions we make to some extent determine the e vents we live through, and these events in? uence our future choices. Thus, people create themselves through the choice of actual occasions they live through. † (Kolb 1984: 63-64) The primary purpose of the LSI is to provide learners with information about their preferred approach to learning. The most relevant information for the learner is about intra-individual differences, his or her relative preference for the four learning modes, not inter-individual comparisons.Ranking relative preferences among the four modes in a forced-choice format is the most direct way to provide this information. While individuals who take the inventory sometimes report dif? culty in making these ranking choices, they report that the feedback they get from the LSI gives them more insight than had been the case when we used a normative Likert rating scale version. This is because the social desirability response bias in the rating scales fails to de? ne a clear learning style, that is, they say th ey prefer all learning modes. This is supported by Harland’s (2002) ? nding that feedback from a forced-choice test format was perceived as more accurate, valuable, and useful than feedback from a normative version.The adoption of the forced-choice method for the LSI has at times placed it in the center of an ongoing debate in the research literature about the merits of forced-choice instruments between what might be called â€Å"rigorous statisticians† and â€Å"pragmatic empiricists. † Statisticians have questioned the use of the forced-choice format because of statistical limitations, called ipsativity, that are the result of the ranking procedure. Since ipsative scores represent the relative strength of a variable compared to others in the ranked set, the resulting dependence among scores produces methodinduced negative correlations among variables and violates a fundamental assumption of classical test theory required for use of techniques such as analysis of variance and factor analysis-independence of error variance.Cornwell and Dunlap (1994) stated that ipsative scores cannot be factored and that correlation-based analysis of ipsative data produced uninterpretable and invalid results (cf. Hicks 1970, Johnson et al. 1988). Other criticisms include the point that ipsative scores are technically ordinal, not the interval scales required for parametric statistical analysis; that they produce lower internal reliability estimates and lower validity coef? cients (Barron 1996). While critics of forced-choice instruments acknowledge that these criticisms do not detract from the validity of intra-individual comparisons (LSI purpose one), they argue that ipsative scores are not appropriate for inter-individual comparisons, since inter-individual comparisons on a ranked ariable are not independent absolute preferences, but preferences that are relative to the other ranked variables in the set (Barron 1996, Karpatschof and Elkjaer 2000). Howeve r, since ELT argues that a given learning mode preference is relative to the other three modes, it is the comparison of relative not absolute preferences that the theory seeks to assess. The â€Å"pragmatic empiricists† argue that in spite of theoretical statistical arguments, normative and forced-choice variations of the same instrument can produce empirically comparable results. Karpatschof and Elkjaer (2000) advanced this case in their metaphorically titled paper â€Å"Yet the Bumblebee Flies. † With theory, simulation, and empirical data, they presented evidence for the comparability of ipsative and normative data.Saville and Wilson (1991) found a high correspondence between ipsative and normative scores when forced choice involved a large number of alternative dimensions. Normative tests also have serious limitations, which the forced-choice format was originally created to deal with (Sisson 1948). Normative scales are subject to numerous response biases—ce ntral tendency bias, in which respondents avoid extreme responses, acquiescence response, and social desirability responding-and are easy to fake. Forced- choice instruments are designed to avoid these biases by forcing choice among alternatives in a way that re? ects real live choice making (Hicks 1970, Barron 1996).Matthews and Oddy found large bias in the extremeness of positive and negative responses in normative tests and concluded that when sources of artifact are controlled, â€Å"individual differences in ipsative scores can be used to rank individuals meaningfully† (1997: 179). Pickworth and Shoeman (2000) found signi? cant response bias in two normative LSI formats developed by Marshall and Merritt (1986) and Geiger et al. (1993). Conversely, Beutell and Kressel (1984) found that social desirability contributed less than 4% of the variance in LSI scores, in spite of the fact that individual LSI items all had very high social desirability. 11 In addition, ipsative te sts can provide external validity evidence comparable to normative data (Barron 1996) or in some cases even better (Hicks 1970). For example, attempts to use normative rating versions of theLSI report reliability and internal validity data but little or no external validity (Pickworth and Shoeman 2000, Geiger et al. 1993, Romero et al. 1992, Marshall and Merritt 1986, Merritt and Marshall 1984). Characteristics of the LSI Scales The LSI assesses six variables: four primary scores that measure an individual’s relative emphasis on the four learning orientations—Concrete Experience (CE), Re? ective Observation (RO), Abstract Conceptualization (AC), and Active Experimentation (AE)—and two combination scores that measure an individual’s preference for abstractness over concreteness (AC-CE) and action over re? ection (AE-RO). The four primary scales of the LSI are ipsative because of the forced-choice format of the instrument.This results in negative correlatio ns among the four scales, the mean magnitude of which can be estimated (assuming no underlying correlations among them) by the formula -1/(m – 1) where m is the number of variables (Johnson et al. 1988). This results in a predicted average method- induced correlation of -. 33 among the four primary LSI scales. The combination scores AC-CE and AE-RO, however, are not ipsative. Forced- choice instruments can produce scales that are not ipsative (Hicks 1970; Pathi, Manning, and Kolb 1989). To demonstrate the independence of the combination scores and interdependence of the primary scores, Pathi, Manning, and Kolb (1989) had SPSS-X randomly ? ll out and analyze 1000 LSI’s according to the ranking instructions. While the mean intercorrelation among the primary scales was -. 3 as predicted, the correlation between AC-CE and AE-RO was +. 038. In addition, if AC-CE and AE-RO were ipsative scales, the correlation between the two scales would be -1. 0 according to the above form ula. Observed empirical relationships are always much smaller, e. g. +. 13 for a sample of 1591 graduate students (Freedman and Stumpf 1978), -. 09 for the LSI 2 normative sample of 1446 respondents (Kolb 1999b), -. 19 for a sample of 1296 MBA students (Boyatzis and Mainemelis 2000) and -. 21 for the normative sample of 6977 LSI’s for the KLSI 3. 1 described below. The independence of the two combination scores can be seen by examining some example scoring results.For example, when AC-CE or AE-RO on a given item takes a value of +2 (from, say, AC = 4 and CE = 2, or AC = 3 and CE = 1), the other score can take a value of +2 or -2. Similarly when either score takes a value of +1 (from 4 -3, 3-2, or 2-1), the other can take the values of +3, +1, -1, or -3. In other words, when AC-CE takes a particular value, AERO can take two to four different values, and the score on one dimension does not determine the score on the other. 12 LSI Technical Manual 3. NORMS FOR THE LSI VERSION 3. 1 New norms for the LSI 3. 1 were created from responses by several groups of users who completed the randomized LSI 3. These norms are used to convert LSI raw scale scores to percentile scores (see Appendix 1).The purpose of percentile conversions is to achieve scale comparability among an individual’s LSI scores (Barron 1996) and to de? ne cutpoints for de? ning the learning style types. Table 2 shows the means and standard deviations for KLSI 3. 1 scale scores for the normative groups. Table 2. KLSI 3. 1 Scores for Normative Groups SAMPLE TOTAL NORM GROUP On-line Users Research Univ. Freshmen Lib. Arts College Students Art College UG Research Univ. MBA Distance E-learning Adult UG N 6977 Mn. S. D. 5023 288 CE 25. 39 6. 43 25. 22 6. 34 23. 81 6. 06 24. 51 6. 39 28. 02 6. 61 25. 54 6. 44 23. 26 5. 73 RO 28. 19 7. 07 27. 98 7. 03 29. 82 6. 71 28. 25 7. 32 29. 51 7. 18 26. 98 6. 94 27. 64 7. 04 AC 32. 22 7. 29 32. 43 7. 32 33. 49 6. 91 32. 07 6. 22 29. 06 6. 4 33. 92 7. 37 34 . 36 6. 87 AE 34. 14 6. 68 34. 36 6. 65 32. 89 6. 36 35. 05 7. 08 33. 17 6. 52 33. 48 7. 06 34. 18 6. 28 AC-CE 6. 83 11. 69 7. 21 11. 64 9. 68 10. 91 7. 56 10. 34 1. 00 11. 13 8. 38 11. 77 11. 10 10. 45 AE-RO 5. 96 11. 63 6. 38 11. 61 3. 07 10. 99 6. 80 12. 37 3. 73 11. 49 6. 49 11. 92 6. 54 11. 00 221 813 328 304 TOTAL NORMATIVE GROUP Normative percentile scores for the LSI 3. 1 are based on a total sample of 6977 valid LSI scores from users of the instrument. This user norm group is composed of 50. 4% women and 49. 4% men. Their age range is 17-75, broken down into the following age-range groups: < 19 = 9. 8%, 19-24 = 17. %, 25-34 = 27%, 35-44 = 23%, 45-54 = 17. 2%, and >54 = 5. 8 %. Their educational level is as follows: primary school graduate = 1. 2%, secondary school degree = 32. 1%, university degree = 41. 4%, and post-graduate degree = 25. 3%. The sample includes college students and working adults in a wide variety of ? elds. It is made up primarily of U. S. residents (80%) with the remaining 20% of users residing in 64 different countries. The norm group is made up of six subgroups, the speci? c demographic characteristics of which are described below. 13 On-line Users This sample of 5023 is composed of individuals and groups who have signed up to take the LSI on-line.Group users include undergraduate and graduate student groups, adult learners, business management groups, military management groups, and other organizational groups. Half of the sample are men and half are women. Their ages range as follows: 55 = 8. 1 %. Their educational level is as follows: primary school graduate = 1. 7%, secondary school degree = 18. 2%, university degree = 45. 5%, and postgraduate degree = 34. 6%. Most of the on-line users (66%) reside in the U. S. with the remaining 34% living in 64 different countries, with the largest representations from Canada (317), U. K. (212), India (154), Germany (100), Brazil (75), Singapore (59), France (49), and Japan (42). Research U niversity FreshmenThis sample is composed of 288 entering freshmen at a top research university. 53% are men and 47% are women. All are between the ages of 17 and 22. More than 87% of these students intend to major in science or engineering. Liberal Arts College Students Data for this sample were provided by Kayes (2005). This sample includes 221 students (182 undergraduates and 39 part-time graduate students) enrolled in business courses at a private liberal arts college. Their average age is 22, ranging from 18 to 51. 52% are male and 48% are female. Art College Undergraduates This sample is composed of 813 freshmen and graduating students from three undergraduate art colleges. Half of the sample are men and half are women.Their average age is 20, distributed as follows: 35 = 1%. Research University MBA Students This sample is composed of 328 full-time (71%) and part-time (29%) MBA students in a research university management school. 63% are men and 37% women. Their average age is 27, distributed as follows: 19-24 = 4. 1%, 25-34 = 81. 3%, 35-44 = 13. 8%, 45-54 = 1%. Distance E-learning Adult Undergraduate Students This sample is composed of 304 adult learners enrolled in an e-learning distance education undergraduate degree program at a large state university. 56% are women and 44% men. Their average age is 36, distributed as follows: 19-24 = 6. 3%, 25-34 = 37. 5%, 35-44 = 40. %, 45-54 = 14. 5%, and > 55 = 1. 6%. CUT-POINTS FOR LEARNING STYLE TYPES The four basic learning style types—Accommodating, Diverging, Assimilating, and Converging-are created by dividing the AC-CE and AE-RO scores at the ? ftieth percentile of the total norm group and plotting them on the Learning Style Type Grid (Kolb 1999a: 6). The cut point for the AC-CE scale is +7, and the cut point for the AE-RO scale is +6. The Accommodating type would be de? ned by an AC-CE raw score =7, the Diverging type by AC-CE =7, and the Assimilating type by AC-CE >=8 and AE-RO +12) while the re? ective regions are de? ned by percentiles less than 33. 33% (