Newbie Gets Confused

A relatively younger developer got tasked with doing some performance tests on a lot of new code. First task was to get a lot of data ready for the new code to operate on. Two weeks went by. He was still working on the script. During that time, I had a similar task. Luckily I have some experience in this department. So I generated the data, ran some small tests, then executed the big test.

On the third week, the developer came to me for some help. He said his code was slow. So he decided to dynamically drop all the triggers on the table he was working with. He wanted to issue an ALTER TABLE command from his script, which consisted of one large PL/SQL block. Of course it was not working. I told him you can't just drop a DDL statement in the middle of that block. You can only do that for DML. However you can run the DDL inside an EXECUTE IMMEDIATE. He seemed happy.

Later the same guy came back and said he was still having problems. The error indicated that the table he was trying to access did not exist. Well I told him that the table is most likely there. But he was running the script as a different user. I thought there was a public synonym which should have let his script resolve the table name. The guy was in a hurry. So I told him he could qualify the table name with the schema name prepended. That got him a bit further.

The problems this guy was running into stemmed from some lack of knowledge. And experience is the best teacher here. He had a lot more pain. For some reason he chose the wrong type for one parameter he was passing. Then when he used the variable of wrong type, Oracle needed to convert the value, causing it to not use the index that was needed to make everything fast. Ah this should not be this hard.

New World for the Data Engineer

I read an article on how to become a data engineer. It is interesting to see all the tools you would need to know. Obviously you need to know SQL. But you should also know Python? Since everything is in the cloud these days, you should know a vendor or two (AWS or Azure). Of course there is the whole NoSQL movement along with obligatory Big Data.

It seems there are a lot of Apache products ripe for the Big Data universe. There is Apache Kafka for messaging. Hadoop lets you do jobs where stuff is stored on disk (HDFS). Spark let's you run the jobs and store results in memory. There is also a Druid I have not heard about previously that is for real time.

In addition to the article I read, there is a Hacker News follow up discussion. There the cutting edge crew says Hadoop and Spark are already legacy. Kafka, RedShift, and a host of other technologies replace them. Man, even in the database world, things move fast.