Information about Data Science,Analytics,Big data Technologies,Statistics and latest news in DATA SCIENCE field.

Thursday, 18 January 2018

POLITICAL ANALYSIS OF PAWAN KALYAN USING DATA ANALYTICS


             Political parties making use of Digital Platforms to interact with people and party workers, posting campaigning photos on Instagram and videos on YouTube, debating on Twitter and Facebook were strong indicators of the impact of the OSM(open street map) on the General Elections 2019. With hardly any political leader or party not having his account on the micro blogging site Twitter and the surge in the political conversations on Twitter, inspired us to take the opportunity to study and analyze this huge ocean of elections data. Our count of tweets related to pawan kalyan from October 1 ,2017 to January 8 ,2018 with the help of Twitter API.

We analyzed the complete dataset to and interesting patterns in it and also to verify if the trivial things were also evident in the data collected. We found that the activity on Twitter peaked during important events related to pawan kalyan. It was evident from our data that the political behavior of the pawan kalyan affected their followers count and thus popularity on Twitter. Yet another aim of our work was to find an efficient way to classify the political orientation of the users on Twitter. To accomplish this task, i used four different techniques: two were based on the content of the tweets made by the user, one on the user based features and another one based on community detection algorithm on the retweet and user mention networks. We found that the community detection algorithm worked best with an efficiency of more than 80%. It was also seen that the content based methods did not fare well in the classification results. With an aim to monitor the daily incoming data, we built a portal to show the analysis of the tweets of the last 24 hours. This portal analyzed the tweets to and the most trending topics, hashtags, the kind of sentiments received by the users, location of the tweets and also monitored the popularity of various celebrities and political leaders and their parties accounts on Twitter. To the best of our knowledge, this is the first academic pursuit to analyze the elections data and classify the users in the India General Elections 2019.

                                 DOWNLOAD THE COMPLETE REPORT HERE
Share:

Tuesday, 26 December 2017

A CASE STUDY ON P&G : Types of Analytics and how P&G implemented it.





Uncertainty and an overwhelming number of alternatives are two key factors that make decision making difficult. Business Analytics approaches can assist by identifying and mitigating uncertainty and by prescribing the best course of action from a very large number of alternatives. In short business analytics can help us make better informed decisions.

There are three categories of analytics: descriptive, predictive and prescriptive. Descriptive Analytics describes what has happened and includes tools such as reports, data visualization, data dashboards, descriptive statistics, and some data mining techniques. Predictive Analytics consists of techniques that use past data to predict future events and includes regression, data mining, forecasting and simulation. Prescriptive Analytics uses input data to determine the best course of action. This class of analytical techniques include simulation, decision analysis, and optimization. 

Consumer goods giant Procter & Gambler (P&G), the maker of such well-known brands as Tide, Olay, Crest, Bounty and Pampers, sell its products in over 180 countries around the world. Supply chain coordination and efficiency are critical to the company’s profitability. After many years of acquisitions and growth, P&G embarked on an effort known as Strengthening Global Effectiveness. A major piece of that effort was the North American Supply Chain Study, who purpose was to make the supply chain in North America as efficient as possible, while ensuring that customer requirement were met.

A team of P&G analysts and managers partnered with a group of analytics faculty at the University of Cincinnati to create a system to help managers redesign the supply effort in North America. The fundamental question to be answered were:
1.      Which plants should make the product families?
2.      Where should the distribution centres be located?
3.      Which plant should which distribution centres?
4.      Which customers should be served as each distribution centre?

The team’s approach utilized all three categories of business analytics: Descriptive, Predictive and Prescriptive.

At the start of the study, data had to be collected from all aspects of the supply chain. These included demand by the product family, fixed and variable production by costs by plant, and freight costs and handling charges at the distribution centres. Data queries and descriptive statistics were utilized to acquire and better understand the current supply chain data.

Data visualization, in the form of a geographic information system, allowed the proposed solutions to be displayed on a map for more intuitive interpretation by management. Because the supply chain had to be redesigned for the future, predictive analytics was used to forecast product family demand by three-digit zip code for ten years into the future. The future demand was then input along with projected freight and other relevant costs, into an interactive optimization model that minimized costs subject to constraints. The suite of analytical models was aggregated into a single system that could be run quickly on a laptop computer.

P&G product category managers made over a thousand runs of the system before reaching consensus on a small set of alternative design. Each proposed design in this selected set was then subjected to a risk analysis using computer simulation, ultimately leading to a single go-forward design.

The chosen redesign of the supply chain was implemented over time and led to a documented savings in excess of $250 million per year in P&G’s North American supply chain. The system of models were used to streamline the supply chains in Europe and Asia, and P&G has become a world leader in the use of analytics in supply chain management.  


To conclude it can be expressed that Descriptive and Predictive analytics can help us better understand the uncertainty and risk associated with our decision alternatives. Predictive and Prescriptive analytics, also often referred to as advanced analytics, can help us make the best decision when facing a myriad of alternatives.    

Author - Kunal Patel

Visit me at - Kunal Patel

Share:

Friday, 15 September 2017

NONPARAMETRIC ECONOMETRICS OR THE MEANING OF ‘LINEAR’

NONPARAMETRIC ECONOMETRICS OR THE MEANING OF ‘LINEAR’

starting with a simple example. In economic theory, one can find “relationships” between aggregated variables. For instance, with Philips curve, we want to visualize the relationship between inflation and unemployment. An economic theory might tell you that if unemployment decreases, inflation increases. From that economic theory, we might try to fit an econometric model, describe that “relationship” But usually, econometric models are based on a linear relationship. This is not a prior reason to have a linear relationship. Even if you look at data, you to find something linear.



 The Cowles Commission, which initiated Econometrics (by founding the Econometric Society, and the prestigious journal Econometrica) had the postulate that an econometric model should be based on an economic model. That’s how you get SEM (Structural Equation Model) like Klein model, in the 1950’s,

This is a standard (linear) econometric model, with parameters and, and some linear relationship among them. Nonparametric econometric models started, if models might be nonlinear.

Splines is a rectangular key fitting into grooves in the hub and shaft of a wheel, especially one formed integrally with the shaft which allows movement of the wheel on the shaft., but also local regression is a method for fitting a smooth curve between two variables, or fitting a smooth surface between an outcome and up to four predictor variables. Which is very natural when you think about it if the goal is to get a good estimation you look in the neighborhood. We do not want necessarily a good global model, but a model good in that neighbourhood.

With nonparametric models, we start to have numerical problems, since problems are not as simple as linear ones. So, the first goal is to get an efficient algorithm to solve it. So here we start to have connexions with machine learning. The main difference I see is simple. In econometrics, seen as a mathematical statistics problem, we seek asymptotic results, nice probabilistic interpretations. In econometrics, seen as a machine learning problem, we focus more on the algorithm. We do not care about the output or the interpretability, we want a good algorithm. See for instance gradient boosting techniques against spline regression.

The blue line is a simple (linear) spline regression model, and the red line is a boosted spline algorithm (it is a stepwise procedure). The blue line is simple, easy to understand. The red line, after 200 iterations is a sum of 200 functions.

Author - Poonuraj Nadar

Share:

How Big is Big Data?





Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy.
In Today’s world there is about 2.5 Quintillion Bytes of data that is generated every day, and it’s predicted that the amount of data that is created everyday will be significantly increased in the times to come. For an Instance, This makes Analysing Big data a noteworthy field in the times to come.

Current State of Big data

The data is collected from various sources like mobiles, Cameras, Microphones, Internet of things. To keep up with the rapid pace of growth of the data there is a resultant technological development which has resulted in the technological capacity to store data getting doubled every 40 months since 1980’s.
            As of 2012, everyday 2.5 exabytes of data is generated through various sources and the traditional database management systems face problems face difficulties dealing with such large amount of data leading to the development of new and advanced software’s being developed. The definition of big data might be different for each organisation, it majorly depends on the scale at which they work.

Why is big data better than other?

When you talk about data in Rdb’s (Relational Databases), there are 2 types of data viz. Pristine data (Best data) which is accurate, clean and 100% reliable and the other one is Bad data which has lot of inconsistencies, a huge amount of time, money, and accountability is put on to making sure the data is well prepared before loading it in to the database.
Big data consists of both Pristine as well as Bad data but the major difference between Big data and Rdb is that the “Big” in the big data makes the bad data irrelevant, it has enough volume so that the amount of bad data, or missing data becomes statistically insignificant. When the errors in your data are common enough to cancel each other out, and when the missing data is proportionally small enough to be negligible. When your data access requirements and algorithms are functional even with incomplete and inaccurate data, then you have "Big Data".
"Big Data" is not really about the volume, it is about the characteristics of the data.

Characteristics of Big Data

Big Data philosophy encompasses unstructured, semi-structured and structured data, however the main focus is on unstructured data.
There are 5 V’s in Big data viz. Volume, Variety, Velocity, Variability, Veracity.

Volume

The quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not.

Variety

The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.

Velocity

In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.

Variability

Inconsistency of the data set can hamper processes to handle and manage it.

Veracity

The quality of captured data can vary greatly, affecting the accurate analysis.

Processing Big Data

Technologies like Alpha & Beta Testing, Machine Learning, Natural Language Processing are used to process Big data.
            Processing Methods like predictive analytics, user behaviour analytics, or certain other advanced data analytics methods are used to extract value from data and use it for performance optimization of a particular firm.
The only major problem with processing big data is that for such a huge amount of data the requirement of processing power is also high which can be met by using advanced computing machines and softwares.

Uses Of Big data

It is also used in the fields such as :
1.    Banking and Securities
2.    Communication, Media and entertainment
3.    Healthcare Provision
4.    Education
5.    Manufacturing
6.    Discovery of Natural Resources
7.    Insurance
8.    Retail and wholesale trade
9.    Transportation
10. Energy and utilities

Analysis of data sets can find new correlations and Patterns to
·       Spot business trends,
·       Prevent diseases,
·       Combat crime.

Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas including:
  •        Internet search,
  •        Fintech,
  •        Urban informatics,
  •        Business informatics.

Conclusion:
  •       There is substantial real spending around big data
  •       To capitalize on big data opportunities, you need to:
  •        Familiarize yourself with and understand industry-specific challenges
  •        Understand or know the data characteristics of each industry
  •        Understand where spending is occurring
  •        Match market needs with your own capabilities and solutions
  •      Vertical industry expertise is key to utilizing big data effectively and efficiently.



Author: Keyur Dhavle
Share:

Thursday, 14 September 2017

SIMPLE LINEAR REGRESSION WITH KNIME IRIS DATA SET



SIMPLE LINEAR REGRESSION WITH KNIME IRIS DATA SET


ABOUT KNIME:
KNIME (pronounced /naɪm/), the Konstanz Information Miner, is an open source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining concept. A graphical user interface allows assembly of nodes for data pre-processing (ETL: Extraction, Transformation, Loading), for modelling and data analysis and visualization. To some extent KNIME can be considered as SAS alternative.
KNIME allows users to visually create data flows (or pipelines), selectively execute some or all analysis steps, and later inspect the results, models, and interactive views. KNIME is written in Java and based on Eclipse and makes use of its extension mechanism to add plugins providing additional functionality. The core version already includes hundreds of modules for data integration (file I/O, database nodes supporting all common database management systems through JDBC), data transformation (filter, converter, combiner) as well as the commonly used methods for data analysis and visualization. With the free Report Designer extension, KNIME workflows can be used as data sets to create report templates that can be exported to document formats like doc, ppt, xls, pdf and others.



CAPABILITIES OF KNIME:

KNIMEs core-architecture allows processing of large data volumes that are only limited by the available hard disk space (most other open source data analysis tools work in main memory and are therefore limited to the available RAM). E.g. KNIME allows analysis of 300 million customer addresses, 20 million cell images and 10 million molecular structures.
Additional plugins allows the integration of methods for Text mining, Image mining, as well as time series analysis.
KNIME integrates various other open-source projects, e.g. machine learning algorithms from Weka, the statistics package R project, as well as LIBSVM, JFreeChart, ImageJ etc.
KNIME is implemented in Java but also allows for wrappers calling other code in addition to providing nodes that allow to run Java, Python, Perl and other code fragments.

COMPARISON:

Python: With origination as an open source scripting language, Python usage has grown over time. Today, it sports libraries (numpy, scipy and matplotlib) and functions for almost any statistical operation / model building you may want to do. Since introduction of pandas, it has become very strong in operations on structured data.
Python is a programming language that is popularly used for data mining types of tasks. Programming languages require you give the computer very detailed, step-by-step instructions of what to do. Memorizing those programming statements is a good deal of what "learning to program" consists of. You can use its add-on packages to minimize your programming effort, but you're still doing some programming.
SAS: SAS has been the undisputed market leader in commercial analytics space. The software offers huge array of statistical functions, has good GUI (Enterprise Guide & Miner) for people to learn quickly and provides awesome technical support. However, it ends up being the most expensive option and is not always enriched with latest statistical functions.
R: R is the Open source counterpart of SAS, which has traditionally been used in academics and research. Because of its open source nature, latest techniques get released quickly. There is a lot of documentation available over the internet and it is a very cost-effective option. R is easy to get started too, but needs around a week of initial reading, before you get started
KNIME is primarily workflow-based packages that try to give you most of the flexibility and power of programming without having to know how to program. Their workflow style is easy to use by dragging and dropping icons onto a drawing window that represent steps of the analysis. What each icon does is controlled by dialog boxes rather than having to remember commands. When finished, the workflow 
1) accomplishes the tasks,
2) documents the steps for reproducibility,
3) shows you the big picture of what was done and
4) allows you to reuse the the steps on new sets of data without resorting to any underlying programming code (as menu-based user interfaces such as SPSS often require.) 
A particularly nice feature of KNIME is that it allow’s you to add nodes to your workflow that contain custom programming. This allows you to combine the two approaches, making the most of each.

NOTE:
Knime needs a basic understanding of the dataset and some logical thinking before getting into analysis. It helps to make our work much easier for analysis rather than remembering the algorithms.

Iris flower data set

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.



KNIME ANALYTICS PLATFORM

Using the the simple regression tree for regression in knime analytics platform.

Node Repository:

The node repository contains all KNIME nodes ordered in categories. A category can contain another category, for example, the Read category is a subcategory of the IO category. Nodes are added from the repository to the workflow editor by dragging them to the workflow editor. Selecting a category displays all contained nodes in the node description view; selecting a node displays the help for this node. If you know the name of a node you can enter parts of the name into the search box of the node repository. As you type, all nodes are filtered immediately to those that contain the entered text in their names:


Drag and drop the file reader from the node repository into the workflow Editor.

Workflow Editor

The workflow editor is used to assemble workflows, configure and execute nodes, inspect the results and explore your data. This section describes the interactions possible within the editor.

File Reader:

This node can be used to read data from an ASCII file or URL location. It can be configured to read various formats. When you open the node's configuration dialog and provide a filename, it tries to guess the reader's settings by analyzing the content of the file. Check the results of these settings in the preview table. If the data shown is not correct or an error is reported, you can adjust the settings manually .
The file analysis runs in the background and can be cut short by clicking the "Quick scan", which shows if the analysis takes longer. In this case the file is not analyzed completely, but only the first fifty lines are taken into account. It could happen then, that the preview appears looking fine, but the execution of the File Reader fails, when it reads the lines it didn't analyze. Thus it is recommended you check the settings, when you cut an analysis short. Load the iris data set into the file reader by configure.

Configure:

When a node is dragged to the workflow editor or is connected, it usually shows the red status light indicating that it needs to be configured, i.e. the dialog has to be opened. This can be done by either double-clicking the node or by right-clicking the node to open the context menu. The first entry of the context menu is "Configure", which opens the dialog. If the node is selected you can also choose the related button from the toolbar above the editor. The button looks like the icon next to the context menu entry.

Partitioning:

Then drag and drop partitioning node from the node repository into the workflow editor. This is done to divide the iris data set into training and testing. The input table is split into two partitions (i.e. row-wise), e.g. train and test data. The two partitions are available at the two output ports. The following options are available in the dialog.


Simple Regression Tree Learner:

Drag and drop simple regression tree learner into the workflow editor. It learns a single regression tree. The procedure follows the algorithm described in CART ("Classification and Regression Trees", Breiman et al, 1984), whereby the current implementation applies a couple of simplifications, e.g. no pruning, not necessarily binary trees, etc.
The currently used missing value handling also differs from the one used in CART. In each split the algorithm tries to find the best direction for missing values by sending them in each direction and selecting the one that yields the best result (i.e. largest gain). The procedure is adapted from the well known XGBoost algorithm and is described


Simple Regression Tree Predictor:

Applies regression from a regression tree model by using the mean of the records in the corresponding child node. Drag and drop simple regression tree predictor from the node repository into the workflow editor.


Column Filter:

This node allows columns to be filtered from the input table while only the remaining columns are passed to the output table. Within the dialog, columns can be moved between the Include and Exclude list Drag and drop column filter from the node repository into the workflow editor.



Line Plot:

Plots the numeric columns of the input table as lines. All values are mapped to a single y coordinate. This may distort the visualization if the difference of the values in the columns is large.
Only columns with a valid domain are available in this view. Make sure that the predecessor node is executed or set the domain with the Domain Calculator node!. Drag and drop line plot from the node repository into the workflow editor.


Numeric Scorer:

This node computes certain statistics between the a numeric column's values (ri) and predicted (pi) values. It computes R²=1-SSres/SStot=1-Σ(pi-ri)²/Σ(ri-1/n*Σri)² (can be negative!), mean absolute error (1/n*Σ|pi-ri|), mean squared error (1/n*Σ(pi-ri)²), root mean squared error (sqrt(1/n*Σ(pi-ri)²)), and mean signed difference (1/n*Σ(pi-ri)). The computed values can be inspected in the node's view and/or further processed using the output table. Drag and drop Numeric Scorer from the node repository into the workflow editor.


Connection of all the nodes in the workflow editor to perform simple linear regression:

Connections:

You can connect two nodes by dragging the mouse from the out-port of one node to the in-port of another node. Loops are not permitted. If a node is already connected you can replace the existing connection by dragging a new connection onto it. If the node is already connected you will be asked to confirm the resulting reset of the target node. You can also drag the end of an existing connection to a new in-port (either of the same node or to a different node).

Execute:

In the next step, you probably want to execute the node, i.e. you want the node to actually perform its task on the data. To achieve this right-click the node in order to open the context menu and select "Execute". You can also choose the related button from the toolbar. The button looks like the icon next to the context menu entry. It is not necessary to execute every single node: if you execute the last node of connected but not yet executed nodes, all predecessor nodes will be executed before the last node is executed.

Execute All:

In the toolbar above the editor there is also a button to execute all not yet executed nodes on the workflow.
This also works if a node in the flow is lit with the red status light due to missing information in the predecessor node. When the predecessor node is executed and the node with the red status light can apply its settings it is executed as well as its successors. The underlying workflow manager also tries to execute branches of the workflow in parallel.


Execute and Open View:

The node context menu also contains the "Execute and open view" option. This executes the node and immediately opens the view. If a node has more than one views only the first view is opened.

In workflow editor when you try to view Partitioning node you can see First partition as training data of iris data set which contains 80% and Second partition of testing data of iris data set which contains 20%.


In workflow editor when you try to view Simple Regression tree node you can see the tabulated form of decision tree which can be increased and observed when you click on positive symbol (+) and the chart decreases when you click on negative symbol (-) you can also adjust the zoom in or zoom out range from 60% to 120%.


In workflow editor when you try to view Simple Regression Tree Predictor it displays the predicted value as output from the testing data set.


In workflow editor when you try to view column Filter you can see that the other columns are been filtered and removed , only the petal width and prediction petal (petal width) is kept which will be used to plot an line graph.

In workflow editor when you try to view line plot. You can see that it draws a line plot to visualize the performance of the simple regression tree

Fit to size by clicking that you can fit the plot within the screen to have an better visualization. You can also change the colour by clicking background colour.
In workflow editor when you try to view number score node. It displays an statistical calculation of score prediction.


INTERPRETATION:

The mean absolute error to prediction (petal width) is 0.147. In statistics, the mean absolute error (MAE) is a quantity used to measure how close forecasts or predictions are to the eventual outcomes. In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors or deviations—that is, the difference between the estimator and what is estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or quadratic loss. The difference occurs because of randomness or because the estimator doesn't account for information that could produce a more accurate estimate. The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better. Here the MSE is 0.04. The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed. The RMSD represents the sample standard deviation of the differences between predicted values and observed values. These individual differences are called residuals when the calculations are performed over the data sample that was used for estimation, and are called prediction errors when computed out-of-sample. The RMSD serves to aggregate the magnitudes of the errors in predictions for various times into a single measure of predictive power. RMSD is a measure of accuracy, to compare forecasting errors of different models for a particular data and not between datasets, as it is scale-dependent. Although RMSE is one of the most commonly reported measures of disagreement, RMSD is the square root of the average of squared errors, thus RMSD confounds information concerning average error with information concerning variation in the errors. The effect of each error on RMSD is proportional to the size of the squared error thus larger errors have a disproportionately large effect on RMSD. Consequently, RMSD is sensitive to outliers.


CONCLUSION:

 KNIME allows users to visually create data flows (or pipelines), selectively execute some or all analysis steps, and later inspect the results, models, and interactive views. KNIME is written in Java and based on Eclipse and makes use of its extension mechanism to add plugins providing additional functionality. The core version already includes hundreds of modules for data integration (file I/O, database nodes supporting all common database management systems through JDBC), data transformation (filter, converter, combiner) as well as the commonly used methods for data analysis and visualization. With the free Report Designer extension, KNIME workflows can be used as data sets to create report templates that can be exported to document formats like doc, ppt, xls, pdf and others.

KNIMEs core-architecture allows processing of large data volumes that are only limited by the available hard disk space (most other open source data analysis tools work in main memory and are therefore limited to the available RAM). E.g. KNIME allows analysis of 300 million customer addresses, 20 million cell images and 10 million molecular structures. Additional plugins allows the integration of methods for Text mining, Image mining, as well as time series analysis.

KNIME integrates various other open-source projects, e.g. machine learning algorithms from Weka, the statistics package R project, as well as LIBSVM, JFreeChart, ImageJ, and the Chemistry Development Kit. KNIME is implemented in Java but also allows for wrappers calling other code in addition to providing nodes that allow to run Java, Python, Perl and other code fragments.

Overall, this is a very sophisticated and professional piece of software. Because of its flexibility, it is nowadays our chief cheminformatics workhorse, and voting with one’s feet is surely the best possible endorsement. The KNIME philosophy and business model of mixed commercial and free (but Open) software, allows its continued improvement while making it freely available to desktop users. Some minor gripes relate to the fact that it seems only to read but not write .xlsx files—we are confident that someone will write a node to let it do so soon. There is a substantial community of users, increasing all the time, and many training schools and the like. Because of this, I think it will continue to grow in popularity. It is well worth a look for the GP community.

Share:

Contact Form

Name

Email *

Message *

Follow by Email