Monday, June 30, 2008

When all models are unnecessary, ...

"All models are wrong, and increasing you can succeed without them."

最近一期 Wired 的封面故事:The End of Theory: The Data Deluge Makes the Scientific Method Obsolete,標題驚人,內容頗具爭議性,在部落圈和學術界掀起一陣討論和討伐之風。

這篇文章談到資料挖掘在Google 的成功中扮演的角色,以及可能在未來科學研究中扮演的角色,企圖雄偉,但是立論薄弱,而且對有些基本的東西有誤解,所以文章一經發表,讀者反應激烈,用句俗諺來形容,可以說是捅了馬蜂窩

Wired 網站裡的讀者回應區,立刻有人反應作者做了能力範圍(out of league)外的事情,也有人認為他越線(crossed the line)了,甚至有位密西根大學的教授(Cosma Shalizi)在自己的部落格說出 I recently made the mistake of trying to kill some waiting-room time with Wired 的狠話。

這篇文章由Wired 雜誌主編 Chris Anderson長尾理論的發明者)執筆,Chris Anderson 的確不愧為暢銷書作者,文采斐然沒有話說,先以統計學家 George Box 的著名警句 All models are wrong, but some are useful. 破題,然後以優美的排比句子,揭示 Petabyte 時代的來臨:

Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database.

Petabyte Age 不是文人的夸飾,隨著資訊科技的進步,人類累積和儲存資料的本事越來越大,今年初(January 2008),Google 發表的 MapReduce 論文,透露了 Google 一天要處理 20 Petabytes 的資料。大量的數據,加上資料挖掘以及統計的幫助,讓 Google 的競爭力如虎添翼,谷歌本身的成就和他們贊助的生物資訊研究,充分說明了資料(數據)的重要性。所以 Peter NorvigGeorge Box 的名句改成 All models are wrong, and increasingly you can succeed without them.

之前筆者也曾撰文討論過資料在數據挖掘研究裡的重要性,但是 Chris Anderson 在這裡走進了推演的誤區,把資料的重要性無限上綱,得到了只要有大量數據和應用數學(applied math;顯然他想說的是 data mining),天下沒有辦不到的事。甚至他認為這是 paradigm (有人翻譯為範式,還有更好的翻譯嗎?)的轉移,所以才有 End of Science 這樣驚人的標題。簡而言之,他認為老套的做學問的方法過時了:

It's science. The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.

..... omitted....

Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.  But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete.

他進一步闡述他的看法,有了數據、電腦、演算法,把數據丟進運算機器之後,我們只要等待結果就行了,不需要假設、模型,也不需要相關的知識。就像 Google 應用統計結果做機器翻譯和拼字檢查,不需要懂語言,也能得到很棒的結果:

There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

Chris Anderson 所描述的新科學,在論述上並不充分完整。誠然,資料+運算能力+資料挖掘演算法的公式,在可以產生大量資料的領域,例如:太空、物理、製藥、基因等等,可以得到很棒的結果。但是,並不是所有的研究領域都有這樣的條件(可收集大量數據),這個新公式是否無往不利,包山包海,是很大的疑問。再者,不同領域的研究方法論也不同,貿然進入結論,認為新公式就是新的科學典範,是太草率了。比如說,有人質疑,如果這個模式成立,我們如何發現新的東西,因為我們不知道「新」東西的資料要從哪裡來?

這種把科學的過程目的過度簡化的 "science without model & correlation supersedes causation" 理論,是有大問題的,Chris Anderson 所描述的新科學 — 從大量資料裡發現值得注意的資訊或知識 ,只是知其然的地步,這才是科學的起步而已;科學家的使命是知其所以,要知道事物的細節和所有發生現象的解釋,才是推動科學(和科技)進步的動力。

John Timmer (ars technica) 在他的文章裡面,問了一個有趣的問題,如果一個理論不能提供可驗證的假設(testable hypotheses),我們怎麼知道我們錯的有多嚴重?推翻 testable hypotheses 的的必要性,我們甚至不知道結果是對是錯?

而且 Chris Anderson 的說法也很容易誤導大家理解資料挖掘的真意,每個執行過資料挖掘工作的人(不論是學者、分析師、工程師)都知道,在找出規則(rules)和型樣(patterns)之後,如何判斷找出的資訊是否有用、有效,正需要上面所述知其所以的能力,才能充分利用挖掘出的資訊得到最大效益。資料挖掘絕對不只是 number crunching 的黑箱,沒有理論,沒有假設,沒有關聯,是不可能完成一個資料挖掘任務的。

所以我完全同意 John Timmer 所說的: At a more fundamental level, in spite of what Chris Anderson has to say, science is about explanations, coherent models and understanding。他對關聯(correlations)和模型的解釋,更是簡明有力,深得我心:

Correlations are a way of catching a scientist's attention, but the models and mechanisms that explain them are how we make the predictions that not only advance science, but generate practical applications.

在賓州大學任教的 Fernando Pereira 針對這篇文章的評論也很具參考價值:

I like big data as much as the next guy, but this is deeply confused. Where does Anderson think those statistical algorithms come from? Without constraints in the underlying statistical models, those "patterns" would be mere coincidences. Those computational biology methods Anderson gushes over all depend on statistical models of the genome and of evolutionary relationships.

Those large-scale statistical models are different from more familiar deterministic causal models (or from parametric statistical models) because they do not specify the exact form of observable relationships as functions of a small number of parameters, but instead they set constraints on the set of hypotheses that might account for the observed data. But without well-chosen constraints — from scientific theories — all that number crunching will just memorize the experimental data.

個人認為當Chris Anderson 說出 Forget taxonomy, ontology and psychology 時,顯然是走得太遠,有點忘形了。雖然從作文章的角度來看,這在修辭上是很有講究的句子,但是這些文字透露了推理的輕率和治學態度的傲慢,我想這是眾多部落格作者和學者看不過去的原因之一。

更多數據對於科學家絕對是好事,越多數據越能驗證假設的正確性,但是更豐富充足的數據絕對不代表我們就可以揚棄「大膽假設小心求證」的治學原則了。不過,整體而言,Chris Anderson 的說法倒也不是全無道理,Google 的成功方程式對於學術界還是有一定影響的,Kevin Kelly 在文章裡引用 George Dyson說法很值得參考:

What Chris Anderson is hinting at is that Science (and some very successful business) will increasingly be done by people who are not only reading nature directly...,They accomplish what science does, although not in the traditional manner...

更多的數據讓科學家們多了一種有別于傳統的工作方式,但並不代表傳統的終結, John Timmer 的結語說得好:

Overall, the foundation of the argument for a replacement for science is correct: the data cloud is changing science, and leaving us in many cases with a Google-level understanding of the connections between things. Where Anderson stumbles is in his conclusions about what this means for science. The fact is that we couldn't have even reached this Google-level understanding without the models and mechanisms that he suggests are doomed to irrelevance.

除了以上整理的觀點之外,也有像 Matthew Hurst 這樣保持冷靜的作者,雖然他也對 Chris Anderson 的文章不滿,但他希望在整理出完整而有意義的想法之後,再發表自己的意見,讓我們拭目以待吧...

 

(Strongly recommended : 我在 Diigo上建了一個 List: End of Theory,相關參考資料都加入我的資料庫裡了,讀者可以閱讀這個 WebSlide,瀏覽資料庫的內容。)

參考資料:

 

Share this post :

No comments:

Post a Comment

如果我的心是一朵蓮花

~ 林徽因 · 馬雁散文集 · 蓮燈 ~ 馬雁 在她的散文《高貴一種,有詩為證》裡,提到「十多年前,還不知道林女士的八卦及成就前,在期刊上讀到別人引用的《蓮燈》」 覺得非常喜歡,比之卞之琳、徐志摩,別說是毫不遜色,簡直是勝出一籌。前面的韻腳和平仄的處理顯然高於戴...