Information about Data Science,Analytics,Big data Technologies,Statistics and latest news in DATA SCIENCE field.

Wednesday, 13 June 2018

5 things to know about Data in the real world:




5 things to know about Data in the real world: 1. It is messy and dirty. Real world data often has issues like missing values, wrongly inputted data, data type mismatch etc. 2. It is disorganized. If you're lucky you get your data from one source directly but many companies have multiple data storage points. 3. Excel sheets and PDFs contain a lot more important data than you can imagine. Parsing them and storing them in machine readable form is a task. 4. Data collection at source is a task as well. Lots of mechanisms have to be put in place to accurately gather data. 5. In research, often datasets are either in the form of small databases or CSV files. In the industry, right infrastructure has to be in place for storing the data in a way that enables easy retrieval. What do you think? #data #datascience
Share:

Tuesday, 12 June 2018

Today In The World Of Analytics




Today In The World Of Analytics:* Databricks releases open source machine learning platform MLflow aimed to standardize ML workflows https://goo.gl/6YZkMz The amazing way Zomato uses data science for success https://goo.gl/ahPofp Deepinder Dhingra of Noodle AI wants to revolutionise organisations with a 10x approach to EnterpriseAI https://goo.gl/26Lhhg Do deep learning networks scale effectively? https://goo.gl/6YZkMz How does Project Aiur, an open source AI-engine substantiate scientific knowledge https://goo.gl/nANA9Z Apple WWDC 2018: 7 things around machine learning announced this year https://goo.gl/tua7yD
Share:

Monday, 11 June 2018

For People starting a Career in Data Science



Why becoming a good Data Analyst is paramount on your path to becoming a Data Scientist I see a lot of recent graduates and people who are thinking of switching career want to be #DataScientists because of latest uptick in media mentions as well as 6 figure #salary potential They think becoming a data scientist involves grasping good command over 10 #MachineLearning algorithms using #Titanic or #Iris dataset and they are ready to tackle real world challenges. IT DOESN'T WORK LIKE THAT! I have been in #data industry for over 2 years now and I still feel there is so much more to do, you know why? Because, there are times when the business problem presented in front of me involves everything data without the scientists part. I am responsible for Data Gathering, Data Cleaning and EDA. Once the data is ready, that's when you are supposed to do the scientists part. Data Gathering, EDA, Data Cleaning, Data Visualization are all part of being a good data analyst. Your ability to take the business problem, understanding the underlying data and making "descriptive" recommendations is what separates you from an "analyst" and a "good analyst" I spend 50-70% of my time identifying my data sources. So focus on becoming a good analyst first before deciding if you are ready to be scientists.
Share:

Saturday, 19 May 2018

MACHINE LEARNING IN TALENT ACQUISITION


With the increasing demand for technical talent, it’s no surprise that the recruiting arms race continues to push companies to create new ways of attracting the very best people… One of the newest weapons in the arsenal in machine learning.


What is Machine Learning?
Machine learning is giving computers the ability to learn without explicit programming. This is primarily accomplished through various pattern recognition processes. For example, if you tell a computer to find the best candidates for a job by identifying patterns in data that produce the best results. In this way, a computer will find correlations and patterns that a human would overlook — leading to increasingly higher quality candidates.Beyond that, the possibilities are infinite! Imagine that we’re able to predict the crash of a certain market, and know that there will be a surplus of available candidates in that specific field, you can then focus your efforts as a recruiter to take advantage of that situation. The ability to predict upticks and downticks in a given market is a revolutionary new possibility thanks to machine learning. It would give recruiters a serious advantage over their competitors and would allow them to reach out and make connections much earlier in the recruiting process.

How is machine learning changing recruiting?

The biggest issue for recruiters right now is that we all have these massive networks but we haven’t really had an effective way to leverage those connections without committing a significant amount of time and resources. If you have 5,000 LinkedIn connections, what does that do except allow you to brag about it?This is where machine learning comes into play. Recruiters can start to recognize pure data points of candidates’ contact information, their profile, their work history, etc. and be able to match those with opportunities. Machine learning does not automatically select the best candidate, instead it narrows the field of search and allows us to focus on analyzing the intangibles. From this, a stronger hire is made, leading to a greater R.O.I. as well as a high L.T.V. from each candidate.Going deeper, on top of quickly sorting through all of this data, machine learning will be able to take a broader view of trends in specific industries and even specific job titles.For example, machine learning could determine that a certain developer has been at their job for a year and a half and there is a, let’s say, 98% chance that they will leave their job in the next three months. That is a huge insight that can be determined very quickly with machine learning. There is an enormous R.O.I. for every singular effort that the recruiter makes. Time equals money, and machine learning will save recruiters unimaginable amounts of time.


Today, machine learning is emerging as a strategy to help employers more efficiently conduct talent sourcing and recruitment.
The biggest issue for recruiters right now is that we all have these massive networks but we haven’t really had an effective way to leverage those connections without committing a significant amount of time and resources. If you have 5,000 LinkedIn connections, what does that do except allow you to brag about it?This is where machine learning comes into play. Recruiters can start to recognize pure data points of candidates’ contact information, their profile, their work history, etc. and be able to match those with opportunities. Machine learning does not automatically select the best candidate, instead it narrows the field of search and allows us to focus on analyzing the intangibles. From this, a stronger hire is made, leading to a greater R.O.I. as well as a high L.T.V. from each candidate.Going deeper, on top of quickly sorting through all of this data, machine learning will be able to take a broader view of trends in specific industries and even specific job titles.For example, machine learning could determine that a certain developer has been at their job for a year and a half and there is a, let’s say, 98% chance that they will leave their job in the next three months. That is a huge insight that can be determined very quickly with machine learning. There is an enormous R.O.I. for every singular effort that the recruiter makes. Time equals money, and machine learning will save recruiters unimaginable amounts of time.
To gauge the role of machine learning in recruitment and hiring, we researched this sector in depth to help answer questions business leaders are asking today, including:

·         What types of AI applications are currently in use for recruitment and hiring?
·         How has the market responded to these AI applications?
·         Are there any common trends among these innovation efforts – and how could these trends possibly affect the future of recruitment and hiring?

In this article we break down applications of artificial intelligence in the domain of recruitment and hiring to provide business leaders with an understanding of current and emerging trends that may impact their sector. We’ll begin with a synopsis of the sectors we covered:
Recruitment and Hiring AI Applications Overview

Based on our assessment of the applications in this sector, the majority of talent acquisition use-cases appear to fall into three major categories:

1.     Talent Recruitment: Companies are training machine learning algorithms to help employers automate repetitive aspects of the recruitment process such as resume and application review
2.     Talent Sourcing: Companies are using machine learning to help identify top candidates from large candidate pools.
3.     Candidate Screening and Engagement: Companies are developing AI assistants to pre-screen candidates and to respond to inquiries regarding positions using natural language processing.
we’ll explore the AI applications of each application by section and provide representative examples.

 





Share:

Monday, 9 April 2018

DEGREE OF FREEDOM IN STATISTICS




In statistics, degrees of freedom (DF) is a difficult concept to explain. However, it is an important idea that appears in many different contexts throughout statistics including hypothesis tests, probability distributions, and regression analysis.  Learn more about degrees of freedom in this blog post!

I’ll start by defining degrees of freedom. However, I’ll quickly move on to practical examples in a variety of contexts because they make this concept easier to understand.

Definition of Degrees of Freedom

Degrees of freedom are the number of independent values that a statistical analysis can estimate. You can also think of it as the number of values that are free to vary as you estimate parameters. I know, it’s starting to sound a bit murky!

Degrees of freedom encompasses the notion that the amount of independent information you have limits the number of parameters that you can estimate. Typically, the degrees of freedom equal your sample size minus the number of parameters you need to calculate during an analysis. It is usually a positive whole number.

Degrees of freedom is a combination of how much data you have and how many parameters you need to estimate. It indicates how much independent information goes into a parameter estimate. In this vein, it’s easy to see that you want a lot of information to go into parameter estimates to obtain more precise estimates and more powerful hypothesis tests. So, you want many degrees of freedom!

Independent Information and Restrictions on Values

The definitions talk about independent information. You might think this refers to the sample size, but it’s a little more complicated than that. To understand why, we need talk about the freedom to vary. The best way to illustrate this concept is with an example.

Suppose we collect the random sample of observations shown below. Now, imagine that we know the mean, but we don’t know the value of an observation—the X in the table below.

Value -6 8 5 9 6 8 4 11 7 X

The mean is 6.9, and it is based on 10 values. So, we know that the values must sum to 69 based on the equation for the mean.

Using simple algebra (64 + X = 69), we know that X must equal 5.

 Estimating Parameters Imposes Constraints on the Data


As you can see, that last number has no freedom to vary. It is not an independent piece of information because it cannot be any other value. Estimating the parameter, the mean in this case, imposes a constraint on the freedom to vary. The last value and the mean are entirely dependent on each other. Consequently, after estimating the mean, we have only 9 independent pieces of information even though our sample size is 10.

That’s the basic idea for degrees of freedom in statistics. In a general sense, DF are the number of observations in a sample that are free to vary while estimating statistical parameters. You can also think of it as the amount of independent data that you can use to estimate a parameter.
Share:

Friday, 23 February 2018

A COMPLETE LOOK AT "BLOCKCHAIN" TECHNOLOGY






                     Blockchain is a shared, distributed ledger that facilitates the process of recording transactions and tracking assets in a business network. An asset can be tangible — a house, a car, cash, land — or intangible like intellectual property, such as patents, copyrights, or branding. Virtually anything of value can be tracked and traded on a blockchain network, reducing risk and  cutting costs for all involved.

One solution that has been developed to address the complexities, vulnerabilities, inefficiencies, and costs of current transaction systems is bitcoin — a digital currency that was launched in 2009 by a mysterious person (or persons) known only by the pseudonym Satoshi Nakamoto. Unlike traditional currencies, which are issued by central banks, bitcoin has no central monetary authority. No one controls it. Bitcoins aren’t printed like dollars or euros; they’re “mined” by people and increasingly by businesses, running computers all around the world, using software that solves mathematical puzzles. Rather than rely on a central monetary authority to monitor, verify, and approve transactions and manage the money supply, bitcoin is enabled by a peer-to-peer computer network made up of its users’ machines, akin to the networks that underpin BitTorrent and Skype.

Bitcoin is actually built on the foundation of blockchain, which serves as bitcoin’s shared ledger. Think of blockchain as an operating system, such as Microsoft Windows or MacOS, and bitcoin as only one of the many applications that can be run on that operating system. Blockchain provides the means for recording bitcoin transactions — the shared ledger — but this shared ledger can be used to record any transaction and track the movement of any asset whether tangible, intangible, or digital. For example, blockchain enables securities to be settled in minutes instead of days. It can also be used to help companies manage the flow of goods and related payments, or enable manufacturers to share production logs with original equipment manufacturers (OEMs) and regulators to reduce product recalls. The takeaway lesson: Bitcoin and blockchain are not the same.
Blockchain provides the means to record and store bitcoin transactions, but blockchain has many uses beyond bitcoin. Bitcoin is only the first use case for blockchain.

For example..

Imagine, Joe is your best friend. He is traveling overseas, and on the fifth day of his vacation, he calls you and says, “Dude, I need some money. I have run out of it.”

You reply, “Sending some right away,” and hung up.
You then call your account manager at your bank and tell him, “Please transfer $1000 from my account to Joe’s account.”

Your account manager replies, “Yes, sir.”

He opens up the register, checks your account balance to see if you have enough balance to transfer $1000 to Joe. Because you’re a rich man, you have plenty; thus, he makes an entry in the register like the following:
Note: We’re not talking about computers only to keep things simple.

You call Joe and tell him, “I’ve transferred the money. Next time, you’d go to your bank, you can withdraw the $1000 that I have just transferred.”
What just happened? You and Joe both trusted the bank to manage your money. There was no real movement of physical bills to transfer the money. All that was needed was an entry in the register. Or more precisely, an entry in the register that neither you nor Joe controls or owns.

And that is the problem of the current systems.

To establish trust between ourselves, we depend on individual third-parties.
For years, we’ve depended on these middlemen to trust each other. You might ask, “what is the problem depending on them?”

The problem is that they are singular in number. If a chaos has to be injected in the society, all it requires is one person/organization to go corrupt, intentionally or unintentionally.

What if that register in which the transaction was logged gets burnt in a fire?
What if, by mistake, your account manager had written $1500 instead of $1000?
What if he did that on purpose?
For years, we have been putting all our eggs in one basket and that too in someone else’s.
Could there be a system where we can still transfer money without needing the bank?

To answer this question, we’ll need to drill down further and ask ourselves a better question (after all, only better questions lead to better answers).

Think about it for a second, what does transferring money means? Just an entry in the register. The better question would then be —

Is there a way to maintain the register among ourselves instead of someone else doing it for us?

Now, that is a question worth exploring. And the answer is what you might have already guessed. The blockchain is the answer to the profound question.

It is a method to maintain that register among ourselves instead of depending on someone else to do it for us.

Are you still with me? Good. Because now, when several questions have started popping in your mind, we will learn how this distributed register works.

Yes, but tell me, how does it work?

The requirement of this method is that there must be enough people who would like not to depend on a third-party. Only then this group can maintain the register on their own.

“It might make sense just to get some Bitcoin in case it catches on. If enough people think the same way, that becomes a self-fulfilling prophecy.” — Satoshi Nakamoto in 2009
How many are enough? At least three. For our example, we will assume ten individuals want to give up on banks or any third-party. Upon mutual agreement, they have details of each other’s accounts all the time — without knowing the other’s identity.


An Empty Folder
Everyone contains an empty folder with themselves to start with. As we’ll progress, all these ten individuals will keep adding pages to their currently empty folders. And this collection of pages will form the register that tracks the transactions.

When A Transaction Happens
Next, everyone in the network sits with a blank page and a pen in their hands. Everyone is ready to write any transaction that occurs within the system.

Now, if #2 wants to send $10 to #9.



To make the transaction, #2 shouts and tells everyone, “I want to transfer $10 to #9. So, everyone, please make a note of it on your pages.”
Everyone checks whether #2 has enough balance to transfer $10 to #9. If she has enough balance, everyone then makes a note of the transaction on their blank pages.
The transaction is then considered to be complete.

Transactions Continue Happening
As the time passes, more people in the network feel the need to transfer money to others. Whenever they want to make a transaction, they announce it to everyone else. As soon as a person listens to the announcement, (s)he writes it on his/her page.

This exercise continues until everyone runs out of space on the current page. Assuming a page has space to record ten transactions, as soon as the tenth transaction is made, everybody runs out of the space.
It’s time to put the page away in the folder and bring out a new page and repeat the process from the step 2 above.

Putting Away The Page
Before we put away the page in our folders, we need to seal it with a unique key that everyone in the network agrees upon. By sealing it, we will make sure that no one can make any changes to it once its copies have been put away in everyone’s folder — not today, not tomorrow and not even after a year. Once in the folder, it will always stay in the folder — sealed. Moreover, if everyone trusts the seal, everyone trusts the contents of the page. And this sealing of the page is the crux of this method.

[Jargon Box] It is called ‘mining’ on the page to secure it, but for the simplicity of it, we’ll keep calling it ‘sealing.’

Earlier the third-party/middleman gave us the trust that whatever they have written in the register will never be altered. In a distributed and decentralized system like ours, this seal will provide the trust instead.
Interesting! How do we seal the page then?

Before we learn how we can seal the page, we’ll know how the seal works, in general. And as a pre-requisite to it is learning about something that I like to call…

The Magic Machine

Imagine a machine surrounded by thick walls. If you send a box with something inside it from the left, it will spit out a box containing something else.

[Jargon Box] This machine is called ‘Hash Function,’ but we aren’t in a mood to be too technical. So, for today, these are ‘The Magic Machines.’



Suppose, you send the number 4 inside it from the left, we’d find that it spat out the following word on its right: ‘dcbea.’


How did it convert the number 4 to this word? No one knows. Moreover, it is an irreversible process. Given the word, ‘dcbea,’ it is impossible to tell what the machine was fed on the left. But every time you’d feed the number 4 to the machine, it will always spit out the same word, ‘dcbea.’
What if I ask you the following question now:

“Can you tell me what should I send from the left side of the machine such that I get a word that starts with three leading zeroes from the right side of it? For example, 000ab or 00098 or 000fa or anything among the others.”


Think about the question for a moment.

I’ve told you the machine has a property that we cannot calculate what we must send from the left after we’re given the expected output on the right. With such a machine given to us, how can we answer the question I asked?

I can think of one method. Why not try every number in the universe one by one until we get a word that starts with three leading zeroes?


Being optimistic, after several thousand attempts, we’ll end up with a number that will yield the required output on the right.
It was extremely difficult to calculate the input given the output. But at the same time, it will always be incredibly easy to verify if the predicted input yields the required output. Remember that the machine spits out the same word for a number every time.

How difficult do you think the answer is if I give you a number, say 72533, and ask you the question, “Does this number, when fed into the machine, yields a word that starts with three leading zeroes?”

All you need to do is, throw the number in the machine and see what did you get on the right side of it. That’s it.

The most important property of such machines is that — “Given an output, it is extremely difficult to calculate the input, but given the input and the output, it is pretty easy to verify if the input leads to the output.”

We’ll remember this one property of the Magic Machines (or Hash Functions) through the rest of the post:

Given an output, it is extremely difficult to calculate the input, but given an input and output, it is pretty easy to verify if the input leads to the output.
How to use these machines to seal a page?

We’ll use this magic machine to generate a seal for our page. Like always, we’ll start with an imaginary situation.

Imagine I give you two boxes. The first box contains the number 20893. I, then, ask you, “Can you figure out a number that when added to the number in the first box and fed to the machine will give us a word that starts with three leading zeroes?”


This is a similar situation as we saw previously and we have learned that the only way to calculate such a number is by trying every number available in the entire universe.

After several thousand attempts, we’ll stumble upon a number, say 21191, which when added to 20893 (i.e. 21191 + 20893 = 42084) and fed to the machine, will yield a word that satisfies our requirements.


In such a case, this number, 21191 becomes the seal for the number 20893. Assume there is a page that bears the number 20893 written on it. To seal that page (i.e. no one can change the contents of it), we will put a badge labeled ‘21191’ on top of it. As soon as the sealing number (i.e. 21191) is stuck on the page, the page is sealed.
[Jargon Box] The sealing number is called ‘Proof Of Work,’ meaning that this number is the proof that efforts had been made to calculate it. We are good with calling it ‘sealing number’ for our purposes.

If anyone wants to verify whether the page was altered, all he would have to do is — add the contents of the page with the sealing number and feed to the magic machine. If the machine gives out a word with three leading zeroes, the contents were untouched. If the word that comes out doesn’t meet our requirements, we can throw away the page because its contents were compromised, and are of no use.

We’ll use a similar sealing mechanism to seal all our pages and eventually arrange them in our respective folders.

Finally, sealing our page…



To seal our page that contains the transactions of the network, we’ll need to figure out a number that when appended to the list of transactions and fed to the machine, we get a word that starts with three leading zeroes on the right.
Note: I have been using the phrase ‘word starting with three leading zeroes’ only as an example. It illustrates how Hashing Functions work. The real challenges are much more complicated than this.

Once that number is calculated after spending time and electricity on the machine, the page is sealed with that number. If ever, someone tries to change the contents of the page, the sealing number will allow anyone to verify the integrity of the page.

Now that we know about sealing the page, we will go back to the time when we had finished writing the tenth transaction on the page, and we ran out of space to write more.

As soon as everyone runs out of the page to write further transactions, they indulge in calculating the sealing number for the page so that it can be tucked away in the folder. Everyone in the network does the calculation. The first one in the network to figure out the sealing number announces it to everyone else.


Immediately on hearing the sealing number, everyone verifies if it yields the required output or not. If it does, everyone labels their pages with this number and put it away in their folders.

But what if for someone, say #7, the sealing number that was announced doesn’t yield the required output? Such cases are not unusual. The possible reasons for this could be:

He might have misheard the transactions that were announced in the network
He might have miswritten the transactions that were announced in the network
He might have tried to cheat or be dishonest when writing transactions, either to favor himself or someone else in the network
No matter what the reason is, #7 has only one choice — to discard his page and copy it from someone else so that he too can put it in the folder. Unless he doesn’t put his page in the folder, he cannot continue writing further transactions, thus, forbidding him to be part of the network.

Whatever sealing number the majority agrees upon, becomes the honest sealing number.
Then why does everyone spend resources doing the calculation when they know that someone else will calculate and announce it to them? Why not sit idle and wait for the announcement?

Great question. This is where the incentives come in the picture. Everyone who is the part of the Blockchain is eligible for rewards. The first one to calculate the sealing number gets rewarded with free money for his efforts (i.e. expended CPU power and electricity).

Simply imagine, if #5 calculates the sealing number of a page, he gets rewarded with some free money, say $1, that gets minted out of thin air. In other words, the account balance of #5 gets incremented with $1 without decreasing anyone else’s account balance.

That’s how Bitcoin got into existence. It was the first currency to be transacted on a Blockchain (i.e. distributed registers). And in return, to keep the efforts going on in the network, people were awarded Bitcoins.

When enough people possess Bitcoins, they grow in value, making other people wanting Bitcoins; making Bitcoins grow in value even further; making even more people wanting Bitcoins; making them grow in value even further; and so on.

The rewards make everyone keep working in the network.
And once everyone tucks away the page in their folders, they bring out a new blank page and repeat the whole process all over again — doing it forever.

[Jargon Box] Think of a single page as a Block of transactions and the folder as the Chain of pages (Blocks), therefore, turning it into a Blockchain.

And that, my friends, is how Blockchain works.

Except that there’s one tiny thing I didn’t tell you. Yet.

Imagine there are five pages in the folder already — all sealed with a sealing number. What if I go back to the second page and modify a transaction to favor myself? The sealing number will let anyone detect the inconsistency in the transactions, right? What if I go ahead and calculate a new sealing number too for the modified transactions and label the page with that instead?

To prevent this problem of someone going back and modifying a page (Block) as well as the sealing number, there’s a little twist to how a sealing number is calculated.

Protecting modifications to the sealing numbers

Remember how I told you that I had given you two boxes — one containing the number 20893 and another empty for you to calculate? In reality, to calculate the sealing number in a Blockchain, instead of two boxes, there are three — two pre-filled and one to be calculated.

And when the contents of all those three boxes are added and fed to the machine, the answer that comes out from the right side must satisfy the required conditions.

We already know that one box contains the list of transactions and one box will contain the sealing number. The third box contains the output of the magic machine for the previous page.



With this neat little trick, we have made sure that every page depends on its previous page. Therefore, if someone has to modify a historical page, he would also have to change the contents and the sealing number of all the pages after that, to keep the chain consistent.

If one individual, out of the ten we imagined in the beginning, tries to cheat and modify the contents of the Blockchain (the folder containing the pages with the list of transactions), he would have to adjust several pages and also calculate the new sealing numbers for all those pages. We know how difficult it is to calculate the sealing numbers. Therefore, one dishonest guy in the network cannot beat the nine honest guys.

What will happen is, from the page the dishonest guy tries to cheat, he would be creating another chain in the network, but that chain would never be able to catch up with the honest chain — simply because one guy’s efforts and speed cannot beat cumulative efforts and speed of nine. Hence, guaranteeing that the longest chain in a network is the honest chain.

Longest chain is the honest chain..


When I told you that one dishonest guy cannot beat nine honest guys, did it ring any bell in your head?

What if, instead of one, six guys turn dishonest?

In that case, the protocol will fall flat on its face. And it is known as “51% Attack”. If the majority of the individuals in the network decides to turn dishonest and cheat the rest of the network, the protocol will fail its purpose.

And that’s the only vulnerable reason why Blockchains might collapse if they ever will. Know that, it is unlikely to happen but we must all know the vulnerable points of the system. It is built on the assumption that the majority of a crowd is always honest.

And that, my friends, is all there is about Blockchains. If you ever find someone feeling left behind and wondering, “WTF is the Blockchain?” you know where you can point them to. Bookmark the link.





                     


Share:

Contact Form

Name

Email *

Message *

Follow by Email