作为一名数据科学家,如何解决这个真实的业务问题?
How to solve this real business problem as a data scientist?
译文简介
网友:数据科学家将如何解决这个商业问题?
原始问题细节:“假设我有一个包含了过去7年里所有卖船的船主的数据集……
正文翻译

How to solve this real business problem as a data scientist?
作为一名数据科学家,如何解决这个真实的业务问题?
原创翻译:龙腾网 http://www.ltaaa.cn 转载请注明出处
评论翻译
相关链接
-
- 中国崛起,还是亚洲崛起?各国gdp数据可视化 2022/11/28 25635 27 0
-
- 印度媒体报道女神科学家颜宁归国 2022/11/09 22254 46 0
-
- 中国科学家首次合成淀粉将有什么应用前景? 2022/09/17 15327 21 0
-
- 网友讨论:1980到2017年,金砖国家各项数据对比,中国VS印度VS俄罗 2022/07/26 21470 41 0
-
- 为什么Quora网友Huijian Wu要谎称越南和越南人民是肮脏的,懒惰的 2022/03/05 31120 0 0
-
- 我怎样才能成为一名数据科学家(二) 2022/02/23 7463 0 0
-
- 我怎样才能成为一名数据科学家(一) 2022/02/15 6376 6 0
-
- 数据显示,英国大公司的性别薪酬差距正在扩大 2022/01/07 12603 0 0
How would a data scientist solve this business problem?
Original question details: “Suppose I have a dataset of all the boat owners that sold a boat for the last 7 years....
If I'm trying to create a predictive model of people that own boats and are likely to be selling their boats...in the near future
What sort of data points and data sets as well as software tools would be needed”
There are two issues with your dataset.
You are trying to solve a two-outcome (binomial) classification problem.
You want to predict, based on who owns a boat today, what will be the outcome in the future - sell, or don't sell.
数据科学家将如何解决这个商业问题?
原始问题细节:“假设我有一个包含了过去7年里所有卖船的船主的数据集。如果我试图创建一个预测模型,预测那些拥有船只并可能在不久的将来出售船只的人。需要什么样的数据点、数据集以及软件工具?”
你的数据集存在两个问题。
你正试图解决一个有两个结果(二项)的分类问题。
你想根据今天谁拥有一艘船来预测未来的结果——卖掉还是不卖掉。
What you need is a dataset of all boat owners, regardless of whether they sold in a given year or not. Then you can start to build a meaningful classifier.
The second issue to be wary of with your dataset, is that if it is based on sales data it may only contain information that was known after the sale event, for example the sale price. You want to make sure that this sort of information is not included in the model, since it is not known at the time that we are making our predictions.
A final thought on other datasets that might be useful.
Why do people sell boats? Too expensive? They don't use it any more? Upgrade to a better boat? Moving city and can't take it with them?
If you can get an expert to give you a breakdown of the main reasons that people sell boats, that will help point you to the data sources that will give you the most predictive value.
不幸的是,你的数据集中的每个人都卖掉了他们的船。这意味着你的模型总是会预测销售结果,因为这是它所知道的一切。
你需要的是所有船主的数据集,不管他们是否在某一年卖出了船。然后你可以开始构建一个有意义的分类器。
关于数据集要注意的第二个问题是,如果它是基于销售数据,那么它可能只包含销售事件之后已知的信息,例如销售价格。你要确保这类信息不包含在模型中,因为我们在做预测时还不知道这些信息。
关于其他可能有用的数据集的最后一个想法。
人们为什么要卖船?太贵了?他们不再用了吗?升级到更好的船?搬来搬去却不能带走吗?
如果你能让专家告诉你人们卖船的主要原因,这将有助于你找到最具预测价值的数据源。
I'll try to go through all of the points you raised.
Features don't really apply here. There's no modelling involved to solve this problem, there might be but you don't actually need it. Picking the top words is a straightforward statistical test. I don't know your dataset or ability to create the one you need though.
The target value depends on what you want to test. This depends on a number of variables.
In the end you mention both models and problems which hints me that you don't really know what you want to do. My best advice is do not do it yourself.
The best solution to this is to hire a statistician or data analyst, preferably one that is able to handle observational studies.
我会尽力把你提出的所有观点都仔细研究一遍。
特性在这里并不适用。解决这个问题不需要建模,可能会有用,但实际上你并不需要它。挑选最热门的单词是一个简单的统计测试。我不知道你的数据集,也不知道你是否有能力创建你需要的数据集。
目标值取决于您想要测试的内容。这取决于许多变量。
最后,你提到了模型和问题,这暗示我你真的不知道自己想做什么。我最好的建议是不要自己做。
对此,最好的解决方案是聘请一名统计学家或数据分析师,最好是能够处理观察性研究的人。
As a data scientist, what's the complex real life problem you have ever solved (using data science)?
By far, problems involving human behavior and biological responses given a very incomplete set of predictors. At Kaplan, we've been able to predict which students will drop out at a given point of time and which students are at risk for failing exit exams with a high degree of accuracy and predictors I wouldn't have guessed were associated with the behavior at the start of projects (usually >95% accuracy with current models).
作为一名数据科学家,你(使用数据科学)解决过什么复杂的现实生活问题?
到目前为止,涉及人类行为和生物反应的问题给出了一组非常不完整的预测因素。在卡普兰,我们已经能够高度准确地预测哪些学生将在特定时间点退学,哪些学生有可能在毕业考试中不及格,而我不会想到它们与项目开始时的行为有关(通常>95%的现有模型准确率)。
How can a data scientist negotiate with other parties to get access to their data?
Originally Answered: What is a data scientist without data? How can you negotiate with other parties to gain access to data?
A data scientist without data is like a printer without ink - full of capability but unable to function without the required raw material.
But printers need more than ink to function well. Just as a printer is tasked to print useful content rather than spraying ink randomly on a page, data scientists also need use cases and well formed projects to investigate.
The art of translating business problems into data science use cases - and getting the requisite data to do so - is a crucial but underrated skill. And I suspect the lack of this skill is the cause of many failures in data science investments. Having senior executives hire a team of data scientists to declare with much fanfare ‘go forth and do AI and machine learning!’ is neither necessary nor sufficient.
数据科学家如何与其他各方协商以获取他们的数据?
最初回答:没有数据的数据科学家是什么?你如何与其他各方协商以获得数据访问权限?
没有数据的数据科学家就像没有墨水的打印机——有足够的能力,但没有所需的原材料就无法运作。
但是打印机需要的不仅仅是墨水。就像打印机的任务是打印有用的内容,而不是在页面上随机喷射墨水一样,数据科学家也需要案例和形式良好的项目来进行调查。
将业务问题转化为数据科学用例的艺术——并获得必要的数据——是一项至关重要但被低估的技能。我怀疑缺乏这项技能是数据科学投资失败的原因。高管们聘请了一个数据科学家团队,大张旗鼓地宣布“去做人工智能和机器学习吧!”这句话既不必要也不充分。
Don’t show off tech - show off the potential to solve problems.
Align every project to business strategy. There are multiple possible projects in every business department. But the one that will ultimately get support are usually the ones that most directly and significantly impact business metrics.
Run open learning sessions using data generated to approximate real data sets. There is a lot of interest in data science, machine learning and AI at the moment.
And most importantly:
Communicate with empathy to your audience.
在与我的团队一起进行了数十次用例搜索和数据获取之后,我发现以下5件事情有助于访问数据:
不要炫耀技术——炫耀解决问题的潜力。
使每个项目与业务战略保持一致。每个业务部门都有多个可能的项目。但是最终得到支持的通常是那些对业务指标产生最直接和最显著影响的。
使用生成的接近真实数据集的数据运行开放式学习会话。目前,人们对数据科学、机器学习和人工智能很感兴趣。
最重要的是:
带着同理心与你的听众交流。
What do data scientists find unexpectedly hard?
Getting the project managers (PMs) to understand that data science projects cannot be managed the way they manage typical IT or software development projects. Because we have that great machine learning idea for that problem doesn't mean a deliverable will come out of it. For all we know, that fancy idea could be wishful thinking in practice.
Getting your project manager (PM) to understand that you are not idling about or not doing your work just because you could not deliver on that promising idea. Accuracy of 95% doesn't always mean we didn't do our job well, the problem could be truly hard. Heck, that 95% on test/validation set does not always translate to the same in production and that still doesn't invalidate our effort.
Getting the PMs to understand that continuous change management is a natural part of the data science process. Data science involves alot of discoverying as well as trial-and-error which almost always means initial plans will need adapt to new understanding of the data and problem at hand.
Even when your solution or model is promising, it's incredibly hard getting domain experts to buy into your idea or collaborate well for achieve success. They could be holding you with suspicion of chasing them out of jobs or sometimes outrightly unwilling to put in work to get you quality data.
数据科学家发现什么出乎意料的困难?
让项目经理(PM)明白,数据科学项目不能像管理典型的IT或软件开发项目那样进行管理。因为我们对这个问题有很好的机器学习想法,但这并不意味着我们就能从中得到成果。据我们所知,这种奇特的想法在实践中可能只是一厢情愿。
让你的项目经理(PM)明白,你不会因为不能实现那个有希望的想法而游手好闲或不做你的工作。95%的准确率并不总是意味着我们没有做好我们的工作,这个问题可能真的很难。见鬼,95%的测试/验证集并不总是转化为生产相同的内容,这仍然不能否定我们的努力。
让项目经理明白,持续变更管理是数据科学过程的自然组成部分。数据科学涉及大量的发现和试错,这几乎总是意味着初始计划需要适应对手头数据和问题的新理解。
即使你的解决方案或模型很有前景,也很难让领域专家接受你的想法或为取得成功而进行良好的合作。他们可能会怀疑你会把他们赶出工作岗位,或者有时根本不愿意投入工作来为你提供高质量的数据。
What is the difference between a data scientist and a data analyst?
We typically separate the data roles into 3 distinct but overlapping positions; The Data Analyst, Data Scientist and Data Engineer.
The Data Analyst typically runs queries against new data to find trends important for the organization and to help prepare data for the Data Scientists. Data Analysts are typically very good at SQL as well as being knowledgable of the core metrics an organization deems important. They can also write scxts and produce intuitive visuals.
The Data Scientist is primarily tasked with building models using machine learning. These models are expected to engender an organization’s software with product features that predict and explain; making applications adaptive. The quality of a Data Scientist’s models depends directly on how well they understand and prepare data, thus they will work with the Data Analyst when it comes to understanding and preparing data to build better models.
The Data Engineer takes what is created in the “lab” and helps put it into production. They work with Data Scientists to make sure the engineering they put in place handles machine learning models correctly (how much do models need to scale, how are the models trained, how are models kept fresh, etc). In some companies Data Engineers will also work with Data Analysts to ensure data ingestion and conversion is taking into account the right metrics, from the right sources etc.
数据科学家和数据分析师有什么区别?
我们通常将数据角色分为3个不同但重叠的位置;数据分析师、数据科学家和数据工程师。
数据分析师通常查询新数据,以找到对组织重要的趋势,并帮助数据科学家准备数据。数据分析师通常非常擅长结构化查询语言,并且知道组织认为重要的核心指标。他们还可以编写脚本并产生直观的视觉效果。
数据科学家的主要任务是使用机器学习建立模型。这些模型被期望产生具有预测和解释的产品特性的组织软件;使应用程序具有自适应性。数据科学家模型的质量直接取决于他们对数据的理解和准备程度,因此,在理解和准备数据以构建更好的模型时,他们将与数据分析师合作。
数据工程师接受“实验室”中创建的内容,并帮助将其投入生产。他们与数据科学家合作,确保他们实施的工程能够正确处理机器学习模型(模型需要多大的比例,模型是如何训练的,如何保持模型的新鲜度等)。在一些公司,数据工程师还将与数据分析师合作,以确保数据摄取和转换考虑到来自正确来源的正确指标等。
原创翻译:龙腾网 http://www.ltaaa.cn 转载请注明出处
As a data scientist, what potentials can we harness for businesses?
Before talking about Machine Learning, your first advantage as a Data Scientist is the capacity to automate data processing tasks. I have been always frustrated to see the number of resources wasted in manual data crunching tasks (using excel) that can be easily automated.
A second advantage is your capacity to bring visibility to the business. A majority of managers are making decisions based on instincts, gut feelings or subjective opinions of their team members. Numbers (or data) never lie, by using data you bring facts. If you share insights or advice based on data, you will be the loudest voice in the room, even as a junior data scientist.
And finally, you can build models that will extract patterns and create lixs in complex processes using large amounts of data: Machine Learning.
作为一名数据科学家,我们能为企业挖掘哪些潜力?
在谈论机器学习之前,作为数据科学家,你的第一个优势是自动化数据处理任务的能力。我总是沮丧地看到,在可以轻松自动化的手动数据处理任务(使用excel)中浪费了大量资源。
第二个优势是你为业务带来可见性的能力。大多数经理都是根据直觉、直觉或团队成员的主观意见来做决定的。数字(或数据)永远不会说谎,只有用数据才能带来事实。如果你分享基于数据的见解或建议,即使你是一名初级数据科学家,你的话也是有分量的。
最后,你可以构建模型,在使用大量数据的复杂流程中提取模式并创建链接:机器学习。
What business skills do data scientists need?
You don’t need any.
Firstly, the data science role is dead. Read this for some real-world insight.
Now that you know that role has the shelf life of Bernie Sanders… your next question should be…
What business skills do I need as a machine learning engineer? (It’s the top job on earth)
Zero is the correct answer.
I’m in very few meetings with the business types and when I am they are usually short.
My question for any project is easy? What do you need to predict and where is that data located?
I’d focus on SQL, Python, Stats… etc. You have enough to learn with those.
If you want to manage and push paper then get an MBA, it’s much easier.
数据科学家需要什么商业技能?
你不需要任何东西。
首先,数据科学的角色已经死亡。阅读这篇文章,你会发现一些真实世界的见解。
现在你知道这个角色的寿命和伯尼·桑德斯一样长,你的下一个问题应该是……
作为一名机器学习工程师,我需要什么商业技能?(这是世界上顶级的工作)
正确的答案是“什么都不需要”。
我很少参加商业类型的会议,即使我参加,通常也很短暂。
我对任何项目的问题都很简单:你需要预测什么?这些数据位于哪里?
我会专注于SQL、Python、Stats……等等。你要学的东西够多了。
如果你想管理和处理文件,那就考个MBA吧,这要容易得多。