Friday, 15 September 2017



starting with a simple example. In economic theory, one can find “relationships” between aggregated variables. For instance, with Philips curve, we want to visualize the relationship between inflation and unemployment. An economic theory might tell you that if unemployment decreases, inflation increases. From that economic theory, we might try to fit an econometric model, describe that “relationship” But usually, econometric models are based on a linear relationship. This is not a prior reason to have a linear relationship. Even if you look at data, you to find something linear.

 The Cowles Commission, which initiated Econometrics (by founding the Econometric Society, and the prestigious journal Econometrica) had the postulate that an econometric model should be based on an economic model. That’s how you get SEM (Structural Equation Model) like Klein model, in the 1950’s,

This is a standard (linear) econometric model, with parameters and, and some linear relationship among them. Nonparametric econometric models started, if models might be nonlinear.

Splines is a rectangular key fitting into grooves in the hub and shaft of a wheel, especially one formed integrally with the shaft which allows movement of the wheel on the shaft., but also local regression is a method for fitting a smooth curve between two variables, or fitting a smooth surface between an outcome and up to four predictor variables. Which is very natural when you think about it if the goal is to get a good estimation you look in the neighborhood. We do not want necessarily a good global model, but a model good in that neighbourhood.

With nonparametric models, we start to have numerical problems, since problems are not as simple as linear ones. So, the first goal is to get an efficient algorithm to solve it. So here we start to have connexions with machine learning. The main difference I see is simple. In econometrics, seen as a mathematical statistics problem, we seek asymptotic results, nice probabilistic interpretations. In econometrics, seen as a machine learning problem, we focus more on the algorithm. We do not care about the output or the interpretability, we want a good algorithm. See for instance gradient boosting techniques against spline regression.

The blue line is a simple (linear) spline regression model, and the red line is a boosted spline algorithm (it is a stepwise procedure). The blue line is simple, easy to understand. The red line, after 200 iterations is a sum of 200 functions.

Author - Poonuraj Nadar


How Big is Big Data?

Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy.
In Today’s world there is about 2.5 Quintillion Bytes of data that is generated every day, and it’s predicted that the amount of data that is created everyday will be significantly increased in the times to come. For an Instance, This makes Analysing Big data a noteworthy field in the times to come.

Current State of Big data

The data is collected from various sources like mobiles, Cameras, Microphones, Internet of things. To keep up with the rapid pace of growth of the data there is a resultant technological development which has resulted in the technological capacity to store data getting doubled every 40 months since 1980’s.
            As of 2012, everyday 2.5 exabytes of data is generated through various sources and the traditional database management systems face problems face difficulties dealing with such large amount of data leading to the development of new and advanced software’s being developed. The definition of big data might be different for each organisation, it majorly depends on the scale at which they work.

Why is big data better than other?

When you talk about data in Rdb’s (Relational Databases), there are 2 types of data viz. Pristine data (Best data) which is accurate, clean and 100% reliable and the other one is Bad data which has lot of inconsistencies, a huge amount of time, money, and accountability is put on to making sure the data is well prepared before loading it in to the database.
Big data consists of both Pristine as well as Bad data but the major difference between Big data and Rdb is that the “Big” in the big data makes the bad data irrelevant, it has enough volume so that the amount of bad data, or missing data becomes statistically insignificant. When the errors in your data are common enough to cancel each other out, and when the missing data is proportionally small enough to be negligible. When your data access requirements and algorithms are functional even with incomplete and inaccurate data, then you have "Big Data".
"Big Data" is not really about the volume, it is about the characteristics of the data.

Characteristics of Big Data

Big Data philosophy encompasses unstructured, semi-structured and structured data, however the main focus is on unstructured data.
There are 5 V’s in Big data viz. Volume, Variety, Velocity, Variability, Veracity.


The quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not.


The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.


In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.


Inconsistency of the data set can hamper processes to handle and manage it.


The quality of captured data can vary greatly, affecting the accurate analysis.

Processing Big Data

Technologies like Alpha & Beta Testing, Machine Learning, Natural Language Processing are used to process Big data.
            Processing Methods like predictive analytics, user behaviour analytics, or certain other advanced data analytics methods are used to extract value from data and use it for performance optimization of a particular firm.
The only major problem with processing big data is that for such a huge amount of data the requirement of processing power is also high which can be met by using advanced computing machines and softwares.

Uses Of Big data

It is also used in the fields such as :
1.    Banking and Securities
2.    Communication, Media and entertainment
3.    Healthcare Provision
4.    Education
5.    Manufacturing
6.    Discovery of Natural Resources
7.    Insurance
8.    Retail and wholesale trade
9.    Transportation
10. Energy and utilities

Analysis of data sets can find new correlations and Patterns to
·       Spot business trends,
·       Prevent diseases,
·       Combat crime.

Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas including:
  •        Internet search,
  •        Fintech,
  •        Urban informatics,
  •        Business informatics.

  •       There is substantial real spending around big data
  •       To capitalize on big data opportunities, you need to:
  •        Familiarize yourself with and understand industry-specific challenges
  •        Understand or know the data characteristics of each industry
  •        Understand where spending is occurring
  •        Match market needs with your own capabilities and solutions
  •      Vertical industry expertise is key to utilizing big data effectively and efficiently.

Author: Keyur Dhavle

Thursday, 14 September 2017



KNIME (pronounced /naɪm/), the Konstanz Information Miner, is an open source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining concept. A graphical user interface allows assembly of nodes for data pre-processing (ETL: Extraction, Transformation, Loading), for modelling and data analysis and visualization. To some extent KNIME can be considered as SAS alternative.
KNIME allows users to visually create data flows (or pipelines), selectively execute some or all analysis steps, and later inspect the results, models, and interactive views. KNIME is written in Java and based on Eclipse and makes use of its extension mechanism to add plugins providing additional functionality. The core version already includes hundreds of modules for data integration (file I/O, database nodes supporting all common database management systems through JDBC), data transformation (filter, converter, combiner) as well as the commonly used methods for data analysis and visualization. With the free Report Designer extension, KNIME workflows can be used as data sets to create report templates that can be exported to document formats like doc, ppt, xls, pdf and others.


KNIMEs core-architecture allows processing of large data volumes that are only limited by the available hard disk space (most other open source data analysis tools work in main memory and are therefore limited to the available RAM). E.g. KNIME allows analysis of 300 million customer addresses, 20 million cell images and 10 million molecular structures.
Additional plugins allows the integration of methods for Text mining, Image mining, as well as time series analysis.
KNIME integrates various other open-source projects, e.g. machine learning algorithms from Weka, the statistics package R project, as well as LIBSVM, JFreeChart, ImageJ etc.
KNIME is implemented in Java but also allows for wrappers calling other code in addition to providing nodes that allow to run Java, Python, Perl and other code fragments.


Python: With origination as an open source scripting language, Python usage has grown over time. Today, it sports libraries (numpy, scipy and matplotlib) and functions for almost any statistical operation / model building you may want to do. Since introduction of pandas, it has become very strong in operations on structured data.
Python is a programming language that is popularly used for data mining types of tasks. Programming languages require you give the computer very detailed, step-by-step instructions of what to do. Memorizing those programming statements is a good deal of what "learning to program" consists of. You can use its add-on packages to minimize your programming effort, but you're still doing some programming.
SAS: SAS has been the undisputed market leader in commercial analytics space. The software offers huge array of statistical functions, has good GUI (Enterprise Guide & Miner) for people to learn quickly and provides awesome technical support. However, it ends up being the most expensive option and is not always enriched with latest statistical functions.
R: R is the Open source counterpart of SAS, which has traditionally been used in academics and research. Because of its open source nature, latest techniques get released quickly. There is a lot of documentation available over the internet and it is a very cost-effective option. R is easy to get started too, but needs around a week of initial reading, before you get started
KNIME is primarily workflow-based packages that try to give you most of the flexibility and power of programming without having to know how to program. Their workflow style is easy to use by dragging and dropping icons onto a drawing window that represent steps of the analysis. What each icon does is controlled by dialog boxes rather than having to remember commands. When finished, the workflow 
1) accomplishes the tasks,
2) documents the steps for reproducibility,
3) shows you the big picture of what was done and
4) allows you to reuse the the steps on new sets of data without resorting to any underlying programming code (as menu-based user interfaces such as SPSS often require.) 
A particularly nice feature of KNIME is that it allow’s you to add nodes to your workflow that contain custom programming. This allows you to combine the two approaches, making the most of each.

Knime needs a basic understanding of the dataset and some logical thinking before getting into analysis. It helps to make our work much easier for analysis rather than remembering the algorithms.

Iris flower data set

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.


Using the the simple regression tree for regression in knime analytics platform.

Node Repository:

The node repository contains all KNIME nodes ordered in categories. A category can contain another category, for example, the Read category is a subcategory of the IO category. Nodes are added from the repository to the workflow editor by dragging them to the workflow editor. Selecting a category displays all contained nodes in the node description view; selecting a node displays the help for this node. If you know the name of a node you can enter parts of the name into the search box of the node repository. As you type, all nodes are filtered immediately to those that contain the entered text in their names:

Drag and drop the file reader from the node repository into the workflow Editor.

Workflow Editor

The workflow editor is used to assemble workflows, configure and execute nodes, inspect the results and explore your data. This section describes the interactions possible within the editor.

File Reader:

This node can be used to read data from an ASCII file or URL location. It can be configured to read various formats. When you open the node's configuration dialog and provide a filename, it tries to guess the reader's settings by analyzing the content of the file. Check the results of these settings in the preview table. If the data shown is not correct or an error is reported, you can adjust the settings manually .
The file analysis runs in the background and can be cut short by clicking the "Quick scan", which shows if the analysis takes longer. In this case the file is not analyzed completely, but only the first fifty lines are taken into account. It could happen then, that the preview appears looking fine, but the execution of the File Reader fails, when it reads the lines it didn't analyze. Thus it is recommended you check the settings, when you cut an analysis short. Load the iris data set into the file reader by configure.


When a node is dragged to the workflow editor or is connected, it usually shows the red status light indicating that it needs to be configured, i.e. the dialog has to be opened. This can be done by either double-clicking the node or by right-clicking the node to open the context menu. The first entry of the context menu is "Configure", which opens the dialog. If the node is selected you can also choose the related button from the toolbar above the editor. The button looks like the icon next to the context menu entry.


Then drag and drop partitioning node from the node repository into the workflow editor. This is done to divide the iris data set into training and testing. The input table is split into two partitions (i.e. row-wise), e.g. train and test data. The two partitions are available at the two output ports. The following options are available in the dialog.

Simple Regression Tree Learner:

Drag and drop simple regression tree learner into the workflow editor. It learns a single regression tree. The procedure follows the algorithm described in CART ("Classification and Regression Trees", Breiman et al, 1984), whereby the current implementation applies a couple of simplifications, e.g. no pruning, not necessarily binary trees, etc.
The currently used missing value handling also differs from the one used in CART. In each split the algorithm tries to find the best direction for missing values by sending them in each direction and selecting the one that yields the best result (i.e. largest gain). The procedure is adapted from the well known XGBoost algorithm and is described

Simple Regression Tree Predictor:

Applies regression from a regression tree model by using the mean of the records in the corresponding child node. Drag and drop simple regression tree predictor from the node repository into the workflow editor.

Column Filter:

This node allows columns to be filtered from the input table while only the remaining columns are passed to the output table. Within the dialog, columns can be moved between the Include and Exclude list Drag and drop column filter from the node repository into the workflow editor.

Line Plot:

Plots the numeric columns of the input table as lines. All values are mapped to a single y coordinate. This may distort the visualization if the difference of the values in the columns is large.
Only columns with a valid domain are available in this view. Make sure that the predecessor node is executed or set the domain with the Domain Calculator node!. Drag and drop line plot from the node repository into the workflow editor.

Numeric Scorer:

This node computes certain statistics between the a numeric column's values (ri) and predicted (pi) values. It computes R²=1-SSres/SStot=1-Σ(pi-ri)²/Σ(ri-1/n*Σri)² (can be negative!), mean absolute error (1/n*Σ|pi-ri|), mean squared error (1/n*Σ(pi-ri)²), root mean squared error (sqrt(1/n*Σ(pi-ri)²)), and mean signed difference (1/n*Σ(pi-ri)). The computed values can be inspected in the node's view and/or further processed using the output table. Drag and drop Numeric Scorer from the node repository into the workflow editor.

Connection of all the nodes in the workflow editor to perform simple linear regression:


You can connect two nodes by dragging the mouse from the out-port of one node to the in-port of another node. Loops are not permitted. If a node is already connected you can replace the existing connection by dragging a new connection onto it. If the node is already connected you will be asked to confirm the resulting reset of the target node. You can also drag the end of an existing connection to a new in-port (either of the same node or to a different node).


In the next step, you probably want to execute the node, i.e. you want the node to actually perform its task on the data. To achieve this right-click the node in order to open the context menu and select "Execute". You can also choose the related button from the toolbar. The button looks like the icon next to the context menu entry. It is not necessary to execute every single node: if you execute the last node of connected but not yet executed nodes, all predecessor nodes will be executed before the last node is executed.

Execute All:

In the toolbar above the editor there is also a button to execute all not yet executed nodes on the workflow.
This also works if a node in the flow is lit with the red status light due to missing information in the predecessor node. When the predecessor node is executed and the node with the red status light can apply its settings it is executed as well as its successors. The underlying workflow manager also tries to execute branches of the workflow in parallel.

Execute and Open View:

The node context menu also contains the "Execute and open view" option. This executes the node and immediately opens the view. If a node has more than one views only the first view is opened.

In workflow editor when you try to view Partitioning node you can see First partition as training data of iris data set which contains 80% and Second partition of testing data of iris data set which contains 20%.

In workflow editor when you try to view Simple Regression tree node you can see the tabulated form of decision tree which can be increased and observed when you click on positive symbol (+) and the chart decreases when you click on negative symbol (-) you can also adjust the zoom in or zoom out range from 60% to 120%.

In workflow editor when you try to view Simple Regression Tree Predictor it displays the predicted value as output from the testing data set.

In workflow editor when you try to view column Filter you can see that the other columns are been filtered and removed , only the petal width and prediction petal (petal width) is kept which will be used to plot an line graph.

In workflow editor when you try to view line plot. You can see that it draws a line plot to visualize the performance of the simple regression tree

Fit to size by clicking that you can fit the plot within the screen to have an better visualization. You can also change the colour by clicking background colour.
In workflow editor when you try to view number score node. It displays an statistical calculation of score prediction.


The mean absolute error to prediction (petal width) is 0.147. In statistics, the mean absolute error (MAE) is a quantity used to measure how close forecasts or predictions are to the eventual outcomes. In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors or deviations—that is, the difference between the estimator and what is estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or quadratic loss. The difference occurs because of randomness or because the estimator doesn't account for information that could produce a more accurate estimate. The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better. Here the MSE is 0.04. The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed. The RMSD represents the sample standard deviation of the differences between predicted values and observed values. These individual differences are called residuals when the calculations are performed over the data sample that was used for estimation, and are called prediction errors when computed out-of-sample. The RMSD serves to aggregate the magnitudes of the errors in predictions for various times into a single measure of predictive power. RMSD is a measure of accuracy, to compare forecasting errors of different models for a particular data and not between datasets, as it is scale-dependent. Although RMSE is one of the most commonly reported measures of disagreement, RMSD is the square root of the average of squared errors, thus RMSD confounds information concerning average error with information concerning variation in the errors. The effect of each error on RMSD is proportional to the size of the squared error thus larger errors have a disproportionately large effect on RMSD. Consequently, RMSD is sensitive to outliers.


 KNIME allows users to visually create data flows (or pipelines), selectively execute some or all analysis steps, and later inspect the results, models, and interactive views. KNIME is written in Java and based on Eclipse and makes use of its extension mechanism to add plugins providing additional functionality. The core version already includes hundreds of modules for data integration (file I/O, database nodes supporting all common database management systems through JDBC), data transformation (filter, converter, combiner) as well as the commonly used methods for data analysis and visualization. With the free Report Designer extension, KNIME workflows can be used as data sets to create report templates that can be exported to document formats like doc, ppt, xls, pdf and others.

KNIMEs core-architecture allows processing of large data volumes that are only limited by the available hard disk space (most other open source data analysis tools work in main memory and are therefore limited to the available RAM). E.g. KNIME allows analysis of 300 million customer addresses, 20 million cell images and 10 million molecular structures. Additional plugins allows the integration of methods for Text mining, Image mining, as well as time series analysis.

KNIME integrates various other open-source projects, e.g. machine learning algorithms from Weka, the statistics package R project, as well as LIBSVM, JFreeChart, ImageJ, and the Chemistry Development Kit. KNIME is implemented in Java but also allows for wrappers calling other code in addition to providing nodes that allow to run Java, Python, Perl and other code fragments.

Overall, this is a very sophisticated and professional piece of software. Because of its flexibility, it is nowadays our chief cheminformatics workhorse, and voting with one’s feet is surely the best possible endorsement. The KNIME philosophy and business model of mixed commercial and free (but Open) software, allows its continued improvement while making it freely available to desktop users. Some minor gripes relate to the fact that it seems only to read but not write .xlsx files—we are confident that someone will write a node to let it do so soon. There is a substantial community of users, increasing all the time, and many training schools and the like. Because of this, I think it will continue to grow in popularity. It is well worth a look for the GP community.


Wednesday, 13 September 2017

Text mining and Getting started with NLTK in PYTHON


 Computing with Language: Texts and Words

We’re all very familiar with text, since we read and write it every day. Here we will treat text as raw data for the programs we write, programs that manipulate and analyse it in a variety of interesting ways. But before we can do this, we have to get started with the Python interpreter. The size of data is increasing at exponential rates day by day. Almost all type of institutions, organizations, and business industries are storing their data electronically. A huge amount of text is flowing over the internet in the form of digital libraries, repositories, and other textual information such as blogs, social media network and e-mails. It is challenging task to determine appropriate patterns and trends to extract valuable knowledge from this large volume of data. Traditional data mining tools are incapable to handle textual data since it requires time and effort to extract information.


Ø To extract the text from anaconda cloud
Ø To analyse the extracted text from anaconda cloud
Ø Commands to be used to analyse the text in all possible ways to get an meaningful output
Ø To know the importance of analysing text
Ø To know how and what algorithm is used to analyse the text and python interpreter works Plotting of dispersion plot from the meaningful analysed text

Text mining
     It is also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering  concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

   Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.

Getting Started with Python

One of the friendly things about Python is that it allows you to type directly into the interactive interpreter—the program that will be running your Python programs. You can access the Python interpreter using a simple graphical interface called the Interactive Development Environment (IDLE). On a Mac you can find this under Applications→ MacPython, and on Windows under All Programs→Python. Under Unix you can run Python from the shell by typing idle (if this is not installed, try typing python). The interpreter will print a blurb about your Python version; simply check that you are running Python 2.4 or 2.5 (here I am using Anaconda spider python 3.6)
The >>> prompt indicates that the Python interpreter is now waiting for input. When
copying examples from this book, don’t type the “>>>”

Getting Started with NLTK

Before going further you should install NLTK, downloadable for free from Follow the instructions there to download the version required for your platform.
Once you’ve installed NLTK, start up the Python interpreter as before, and install the data required for the book by typing the following two commands at the Python prompt, then selecting the book collection as shown

Downloading the NLTK Book Collection: Browse the available packages using The Collections tab on the downloader shows how the packages are grouped into sets, and you should select the line labeled book to obtain all data required for the examples and exercises in this book. It consists of about 30 compressed files requiring about 100Mb disk space. The full collection of data (i.e., all in the downloader) is about five times this size (at the time of writing)
and continues to expand.

Once the data is downloaded to your machine, you can load some of it using the Python interpreter. The first step is to type a special command at the Python prompt, which tells the interpreter to load some texts for us to explore: from import *. This says “from NLTK’s book module, load all items.” The book module contains all the data you will need as you read this chapter. After printing a welcome message, it loads the text of several books (this will take a few seconds). Here’s the command again, together with the output that you will see. Take care to get spelling and punctuation right, and remember that you don’t type the >>>.

Any time we want to find out about these texts, we just have to enter their names at
the Python prompt:

>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text2
<Text: Sense and Sensibility by Jane Austen 1811>

Now that we can use the Python interpreter, and have some data to work with, we are ready to get started.

Searching Text

There are many ways to examine the context of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context. Here we look up the word monstrous in Moby Dick by entering text1 followed by a period, then the term concordance, and then placing "monstrous" in parentheses:

>>> text1.concordance("monstrous")

Try searching for other words; to save re-typing, you might be able to use up-arrow, Ctrl-up-arrow, or Alt-p to access the previous command and modify the word being searched. You can also try searches on some of the other texts we have included. For example, search Sense and Sensibility for the word affection, using text2. concordance ("affection"). Search the book of Genesis to find out how long some people lived, using: text3.concordance("lived"). You could look at text4, the Inaugural Address Corpus, to see examples of English going back to 1789, and search for words like nation, terror, god to see how these words have been used differently over time. We’ve also included text5, the NPS Chat Corpus: search this for unconventional words like im, ur, lol. (Note that this corpus is uncensored!)

Once you’ve spent a little while examining these texts, we hope you have a new sense of the richness and diversity of language. Next you will learn how to access a broader range of text, including text in languages other than English.
A concordance permits us to see words in context. For example, we saw that monstrous occurred in contexts such as the ___ pictures and the ___ size. What other words appear in a similar range of contexts? We can find out by appending the term similar to the name of the text in question, then inserting the relevant word in parentheses:

>>> text1.similar("monstrous")

>>> text2.similar("monstrous")

Observe that we get different results for different texts. Austen uses this word quite differently from Melville; for her, monstrous has positive connotations, and sometimes functions as an intensifier like the word very.

The term common_contexts allows us to examine just the contexts that are shared by two or more words, such as monstrous and very. We have to enclose these words by square brackets as well as parentheses, and separate them with a comma:

>>> text2.common_contexts(["monstrous", "very"])


It is one thing to automatically detect that a particular word occurs in a text, and to display some words that appear in the same context. However, we can also determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot. Each stripe represents an instance of a word, and each row represents the entire text.

Term frequency

Suppose we have a set of English text documents and wish to determine which document is most relevant to the query "the brown cow". A simple way to start out is by eliminating documents that do not contain all three words "the", "brown", and "cow", but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document; the number of times a term occurs in a document is called its term frequency. However, in the case where the length of documents vary greatly, adjustments are often made (see definition below).
The first form of term weighting is due to Hans Peter Luhn (1957) and is based on the Luhn Assumption:

•The weight of a term that occurs in a document is simply proportional to the term frequency.

Inverse document frequency

Because the term "the" is so common, term frequency will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms "brown" and "cow". The term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less common words "brown" and "cow". Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.
Karen Spärck Jones (1972) conceived a statistical interpretation of term specificity called Inverse Document Frequency (IDF), which became a cornerstone of term weighting:

•The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.

tf–idf is the product of two statistics, term frequency and inverse document frequency. Various ways for determining the exact values of both statistics exist.

Term frequency

In the case of the term frequency tf(t,d), the simplest choice is to use the raw count of a term in a document, i.e. the number of times that term t occurs in document d. If we denote the raw count by ft,d, then the simplest tf scheme is tf(t,d) = ft,d. Other possibilities include[5]:128
•Boolean "frequencies": tf(t,d) = 1 if t occurs in d and 0 otherwise;
•term frequency adjusted for document length : ft,d ÷ (number of words in d)
•logarithmically scaled frequency: tf(t,d) = 1 + log ft,d, or zero if ft,d is zero;
•augmented frequency, to prevent a bias towards longer documents, e.g. raw frequency divided by the raw frequency of the most occurring term in the document:

Inverse document frequency:

The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

•N : total number of documents in the corpus N = | D | 
•| { d ∈ D : t ∈ d } | : number of documents where the term t  appears (i.e., t f ( t , d ) ≠ 0). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to 1 + | { d ∈ D : t ∈ d } |.

Term frequency–Inverse document frequency
Then tf–idf is calculated as

A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0.

The above algorithum is processed by  python  when you give the following command

>>>text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])


We will see some striking patterns of word usage over the last 220 years (in an artificial text constructed by joining the texts of the Inaugural Address Corpus end-to-end). You can produce this plot as shown below. You might like to try more words (e.g., liberty, constitution) and different texts. Can you predict the dispersion of a word before you view it? As before, take care to get the quotes, commas, brackets, and parentheses exactly right.

Important: You need to have Python’s NumPy and Matplotlib packages installed in order to produce the graphical plots used. Please see for installation instructions.


The availability of huge volume of text based data need to be examined to extract valuable information. Text mining techniques are used to analyse the interesting and relevant information effectively and efficiently from large amount of unstructured data. text mining techniques that help to improve the text mining process. Specific patterns and sequences are applied in order to extract useful information by eliminating irrelevant details for predictive analysis. Selection and use of right techniques and tools according to help us make the text mining process easy and efficient. knowledge integration, varying concepts granularity, multilingual text refinement, and natural language processing ambiguity are major issues and challenges that arise during text mining process. In future research work, we will focus to design algorithms which will help to resolve issues presented in this work.



Contact Form


Email *

Message *

Follow by Email