Our first Hadoop User Group meeting will be at 11.12.13(interesting date :) Wednesday, Istanbul Marmara University Haydarpasa Campus. Hortonworks will be the keynote speaker. Teradata is the sponsor of the meeting.
November 23, 2013
November 21, 2013
Let’s have a quick look for the answers of Hadoop for the enterprise level operational needs.
First of all we have Ambari for Provisioning, Managing and Monitoring needs of Hadoop Clusters.
Operation – Provisioning & Configuration
provides step-by-step wizard for installing Hadoop services across any number of hosts
handles configuration of Hadoop services for the cluster.
Operation – Management
provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster.
Operation – Monitoring
provides a dashboard for monitoring health and status of the Hadoop cluster.
leverages Ganglia for metrics collection.
leverages Nagios for system alerting and will send emails when your attention is needed
Hadoop by default has three copies of each block. For backup and recovery we created the scenario pictured below and tested in our PoC. For each 18 nodes production environment we planned to have 6 nodes replication environment with single block copy. Hadoop files are copied to this replica Hadoop cluster in parallel. After installation of FUSE/NFS drivers for Hadoop, files in HDFS are seen as regular files on Linux. EMC Legato Networker drivers are installed on second cluster and files are copied on Backup infrastructure. Restore tests are also applied successfully. Advantage of this method is restore operation can be done onto any Linux file system, HDFS is not mandatory.
Operation – Security
Authentication is secured by Kerberos v5.
No impersonation of other users.
Jobs have own ACLs that specifies who can view logs, counters or kill them.
Tasks run as the user who launched it.
Task isolation on the same TaskTracker.
Operation – File Access Auditing
All HDFS access requests can be logged.
Log information includes: Path of requested file, User and group name of requestor, IP address of requestor, Applied command on file and many more.
November 3, 2013
Let’s start with a quick summary; Big Data can be turned into Big Value, of cource depending to your needs this value may come from different perspectives like reducing costs per TB or creating money from new data sources as I mentioned in the prior post.
Also the prior post positioned Hadoop as a complementary solution for the traditional DW solutions. I recieved some comments about this positioning. I am aware that Hadoop provides a full DW solution from head to toe, this is an ecosystem of technologies where you can find all the answers to your DWing needs. But what I really wanted to underline was Hadoop is not a only internet companies’(Yahoo, Facebook, Twitter etc.) technology anymore for sometimes now, it is already being used or being considered to be used for Telcos and Financial companies too. This is why I called the series “Hadoop in the Enterprise”. Every journet starts with the first steps, of course these companies use traditional DW solutions for many years and Hadoop will be complementary initially.
Before talking about the technolgy ecosystem of HAdoop let’s briefly mention the stroyline of Hadoop, it goes back to 2003 Google ‘s GFS and Mapreduce papers. At 2006 Apache Hadoop project started for Yahoo ‘s needs. At 2007 Cloudera founded and at 2009 First commercial Hadoop distribution is released, enterprise level support is available since than. At 2011 Hortonworks founded. At 2012 the Hadoop ecosystem reached 300+ companies.
Hadoop Core consists of two solutions; HDFS for the stroge and Mapreduce for the computation. Hadoop Distributed File System-HDFS is a self healing, byte streams, high bandwidth and clustered storage. MapReduce is mathemetical algorithm concept where (key,value) pairs are firstly transformed into another (key,value) pair to be grouped on keys and processed in reduction phase. It is a schema on read, fault tolerant, distributed processing framework.
The power of Hadoop comes from the technology ecosystem using HDFS and Mapreduce as their core;
- Hive: SQL dialect and datawarehousing platform for accessing HDFS files, developed at Facebook, reported to be mostly (~%98) SQL-compatible at the end of 2014
- Pig: Data transformation language, developed at Yahoo, looks like RDBMS execution plans, extensible via User Defined Functions – UDFs
- Mahout: Machine learning library for MapReduce Framework
- Hbase: Column-store residing on HDFS, allowing online access to HDFS
- Zookeeper: Open-source distributed system coordinator
- Oozie: Job workflow engine
- Sqoop: Integration tool that allows extraction or loading for RDBMSs
- Flume: Integration tool that allows capturing streming data and storing on HDFS
and the list goes on with Ambari, Hcatalog, Drill .. There are even related projects outside of Apache like Spark, Shark, Impala.
SQL is the natural language for a DW specialist so Hive has a special place in the ecosystem in my opinion, Hive is the most efficient tool for adoption. With Stringer Initiative Hive is currently the most invested project in the domain. Also with Hadoop 2.0 HDFS has Namenode Auto-Failover, HDFS Snapshots(point in time recovery of HDFS) and HDFS Federation(enable different domains using Hadoop, to access their files in an isolated way) as hot new features. And with Hadoop 2.0, MapReduce has YARN, With YARN Hadop platform will be an enabler for not only Batch processing but also several additional key areas.
October 30, 2013
I wanted to create a series of Hadoop experience posts based on my past two years observations. Currrently we run a Hadoop production cluster on Cloudera, some billions of records daily processed on this complementary environment to our enterprise datawarehouse. Recently also we finished an assesment with three vendors’ Hadoop Appliance solutions; Oracle, Teradata and HP. During these PoCs we also had the chance to compare Hive and Pig generators like Talend, Informatica and ODI too. Lots of information gathered and I will try to share short posts from a Telco DW development manager perspective on Hadoop.
So here comes the first question; why Hadoop should be complementary to traditional DW solutions in a Telco environment. I will leave scalability to the next posts of this series, I believe major motivation should be reducing the cost of each TB processed and stored in your DW environment. To gain the best you should be sure that you are using the right tool at the right place. It is important to choose the appropriate ETL job types to Hadoop, for example we call a typical group as LOG ETL(CDRs, VAS logs, SS7 logs, VoIP logs etc), low transformation and huge volume is where you will gain most. Another use case maybe positioning HDFS and Hive together as the Archive of your EDW.
Another important motivation can be the flexiblity of processing both structured and unstructured data together on the same infrastructure, yes both, some DW vendors prefer to position Hadoop for just unstructured but with Hadoop you can benetif at both worlds. As unstructured data sources grow very rapidly, creating value from this new data sources with extreme scale economical is an important competency.
What about the support of Hadoop at Enterprise level? This is not an issue for sometime now. Open-source is not the way for the Enterprise DWs for many years, but similar to Linux world’s Red Hat and SuSe, Hadoop has Cloudera and Hortonworks to work together for your Enterprise level support needs. And the last words for this post will come from the masters of the DW world, on Hadoop as a complementary solution to the traditional DW solutions:
September 15, 2013
This December we will be meeting, the call for papers period will be finishing at the end of this month, you can follow the meeting details from the LinkedIn group.
June 4, 2013
Selam, ikinci Türk Oracle Kullanıcıları Derneği, BI/DW özel ilgi grubu toplantımız 21 Haziran günü İTÜ Maslak ‘da gerçekleşecek. Draft plan şu şekilde:
09:00 – 09:30 Kayıt ve açılış
09:30 – 10:15 Ersin İhsan Ünkar / Oracle Big Data Appliance & Oracle Big Data Connectors – Hadoop Introduction
10:30 – 11:15 Ferhat Şengönül / Exadata TBD
11:30 – 12:15 Sponsor sunumu
12:30 – 13:30 Öğle arası
13:30 – 14:15 Ahmet Selahattin Güngörmüş / OBIEE TBD
14:30 – 15:15 Sanem Seren Sever / ODI TBD
15:30 – 16:15 TBD / DW SQL Tuning
16:30 – 17:15 TBD / ODM TBD
Katılmayı arzu edebilirsiniz düşüncesi ile takviminize işlemeniz için önden iletiyorum, içerik netleştiğinde genel bir duyuru da yapacağız. Salon uygun olduğu için ilgileneceğini düşündüğünüz arkadaşlara bilgi verebilirsiniz.
December 28, 2012
2013 ‘e ışık tutması açısından, 2012 ‘de TROUG ile yaşadığınız deneyim paralelinde sizlerin geri beslemelerini merak ediyorum. Hatırlatma amaçlı 2012 ‘de ne gibi aktiviteler gerçekleşti sitemiz anasayfadaki “Geçmiş Etkinlikler” bölümünü inceleyebilirsiniz: http://troug.org/
November 28, 2012
Oracle-Apex-Best Practices has been released by Pact-Pub recently. First impression on me was that it’s not a beginner guide to APEX. It’s an enterprise level Apex development features book. When developers begin to learn APEX, they think “we can do simple web projects with APEX but not more”. But this book shows that everything can be done with APEX. If you don’t have experience about APEX, I recommend reading a beginner APEX books before you read this book.
- In the first chapters it tells the APEX architecture and gives the key points how to design and manage single or multiple applications. I wish there could be much more examples on this chapter.
- The book covers also advanced level oracle reporting features such as analytic and grouping functions and tells how to use them in APEX.
- There is a large chapter about printing of reports on different platforms and many examples about how to implement that.
- All enterprise applications must have good implemented security levels. Application level security steps from database to user interface items are well-explained with examples.
- Many APEX developers complain about debugging on APEX. There is a chapter about how to debug and alternative debugging tools from database level to user interface level. Also that is so helpful to understand what happens when user clicks a button from end to end.
- If it’s an enterprise application, we should have source version system and deployment procedure. Last chapter examines how to manage this with examples.
Everything is understandable and is easy to implement for our production environment. Screen shots are taken from 4.1 version of APEX but all can be used in 4.2 too. Also there is a mini appendix for RESTfull web services which is the new great feature in 4.2 version of APEX. I think ever APEX developers should read this book for better designed APEX application. Many thanks to writers of this book: Learco Brizzi, Illon Ellen-Wolf and Alex Nuijten for this great job.
ps. this review is written by my collegue A.Yavuz Barutçu
October 28, 2012
Hüsnü kendi sunumundan bahsetmiş, 15 Kasım’da İstanbul’da gerçekleşecek Oracle Day 2012 toplantısına katılım ücretsiz, katılacaklar başta Hüsnü ‘nün sunumu olmak üzere diğer TROUG sunumlarını kaçırmayın.
Not: Turkcell Grup CIO ‘su Sayın İlker Kuruöz ‘ün de aynı gün 10:30 ‘da “Dönüşümsel Bulut Yolculuğu” isimli bir sunumu görünüyor, tüm ajandaya bu bağlantıdan ulaşabilirsiniz.
September 26, 2012
troug.org ‘u takip edenler için çok da yeni bir haber değil, 11 Ekim ‘de senelik büyük toplantımızı yapacağız. İlk toplantıda olduğu gibi yine Beşiktaş boğaz kenarı lokasyonun avantajı nedeni ile Bahçeşehir Üniv. ‘ni seçtik. Ajanda heyecan verici, birçok yeni yüz de bu sene sunucu olarak katılıyor. Henüz kayıt olmadı iseniz pazarlama sunumu görmeyeceğiniz, gerçek hayat hikayelerinin konuşulacağı bu toplantı için hemen ücretsiz kaydınızı yaptırın.