PlaidCTF 2014 – Website post-mortem

First, we would like to thank everyone for participating in PlaidCTF 2014. Thanks to all of you, this year was bigger than ever with over 850 teams partcipating and 3 sponsors (THANK YOU!) providing $14K in cash prizes and covering all of our operating costs. Congratulations go out to 0xffa, Dragon Sector, and MoreSmokedLeetChicken for finishing in 1st, 2nd, and 3rd. We hope everyone had a lot of fun with the challenges and maybe learned a thing or two, as well.

Now, as everyone who played pCTF this year noticed, things got off to a very rocky start. This was very disappointing for us because we spent a lot of time this year building a simpler interface that would be more usable and, we hoped, more stable. It turns out that we had a few mistakes materialize during the CTF, some nastiness with a particular WSGI server, and more teams than ever before.

The first issue that came up was with our frontend / caching server. Like previous years, we used nginx to serve the static files, but unlike previous years, we also used nginx as the caching layer instead of varnish. This turned out to be a fatal error because nginx does not support the standard Vary HTTP header, which is used by the backend to distinguish between pages that should be cached based on the cookie and those that should not. As such, all pages were cached without the cookie which meant a user might see a different team’s page instead of their own. As a quick fix, we added the cookie header to the cache key, but now all of the pages that we thought would be regenerated once per second needed to be regenerated once per second per team. And as I will discuss later, we had a lot of teams.

The second issue, and the one that ended up consuming a lot of our time and causing the most frustration, was a strange interaction between nginx and gunicorn. This year we stuck to a pretty standard setup: django + gunicorn + nginx. Unfortunately, due to high load, or other factors that we are still unaware of, gunicorn was hanging on a recvfrom call on its socket. The problem mysteriously fixed itself a couple hours into the competition, but then came back an hour or so later. At that point, we started replacing parts of our infrastructure (e.g. nginx -> Apache, gunicorn -> uwsgi). The working setup ended up being uwsgi + nginx, but it took us a couple of hours to get to this point. Now, the backend worker processes were no longer hanging and the website was mostly working except for some load issues.

As I mentioned above, we had more teams this year than ever before with over 850 teams solving the trivial sanity check key. Once the nginx caching was working properly, we found that we were still overloaded during the voting phases so we brought up another copy of the www server and used an Elastic Load Balancer as the new frontend. I would like to thank Amazon for making its load balancer very easy to setup, use, and monitor. After this, all load issues were resolved and the website continued to be stable for the rest of the competition. We lost IP address information for the majority of the competition because we were not logging the X-Forwarded-For header, but that was a small price for having a working website.

Requests per minute processed by the Elastic Load Balancer

Requests per minute processed by the Elastic Load Balancer

There were two other bugs in the game logic that we fixed over the course of the 48 hours. The first one was an issue with the database transactions in Django. The default isolation level for the postgresql database backend was “Read Committed” which resulted in transaction interleaving and allowing teams to submit the same key multiple times. We resolved the issue by changing the isolation level and enforcing uniqueness of (team_id,problem_key_id) in the database. The other issue was related to the “chance” tiles. During the first half of the competition, it was impossible for the “chance” tile to return something besides “Unlucky!” Once we resolved the problem, we changed the probabilities to compensate for the lost attempts.

All in all, we think the competition went well and we thank everyone for their patience. Every year we try to do something a bit different to keep things interesting and with that comes some untested infrastructure. I assure you that we continue to learn from these mistakes and will continue to work hard to improve the competition for next year.

- awesie on behalf of PPP

Why CTF

A recent blog post has been going around talking about the differences between competitions like CCDC (Collegiate Cyber Defense Competition) and CTFs (Capture the Flag). It’s a good read and I thought a lot of interesting points were brought up, and figured it could be beneficial to try to respond to them some place to encourage discussion (and besides, our blog needs any excuse it can get for new content). For what it’s worth, I didn’t watch Chris Eagle’s presentations, though I imagine I don’t agree with everything he said. Likewise, I don’t disagree with everything from Matt Weeks’ blog post. This post is mostly motivated by the fact that while a discussion on security competitions is going, it makes sense to talk about some related things I have been thinking about.

First off, as many of you probably know, the Plaid Parliament of Pwning has not played in purely defensive exercises. To those that may say “well aren’t you going to be biased towards capture the flag events?”, I answer “almost certainly”. I’m not going to say CCDC is worthless and CTFs are the epitome of security education, but my claim is that CTFs provide a much better platform and environment for players learning about a broad range of security topics. Further, I’m going to use the term CCDC when I suppose in general I mean “defensive security competition”. This probably means I will muddle some specific details.

1. Problem Solving

Fundamentally, CTFs are about problem solving. Although I know defensive competitions have many problems that their competitors have to deal with, I believe CTFs intrinsically allow for more “out of the box” thinking. Let’s start with rules. Most CTF competitions are pretty loose in this respect: maybe team size is limited, challengers aren’t allowed to share answers, and attacking game infrastructure is forbidden, but that’s about it. The nature of defensive competitions means they can’t be so simple. Although the rules for CCDC aren’t too bad, the rule book for the CyberPatriot competition is over one hundred pages long. There’s a reason for this: there are “easy solutions” that would break these competitions. What if you block malicious network blocks? set up a honey pot? rewrite services to be bug free? delete user accounts only the red team is using? proxy all traffic through an IPS? upgrade Windows XP boxes to Windows 7? run separate services in sandboxes?

Yes, not all of these options are feasible in the real world in every scenario: they may be expensive, or constrained by some other resources. But what is the lesson here? If you are put in a real world situation, do all these solutions become blocked just like in a competition? Of course not! You need to evaluate the merits and consequences of each one and decide for yourself. In most defense competitions, these decisions are made for you, and that limits your abilities to think out of the box. In many cases teams that succeed at CCDC do so because they manage to get as out of the box as possible. This ends up meaning they “game the game” rather than make good security practices. For example, maybe you aren’t allowed to delete users on a system, because in the real world that would restrict the abilities of legitimate users to access the system. Fine, change all their passwords to a 100 character hex string. The real world value of this isn’t any better, but it is more acceptable to do in CCDC.

In a Capture the Flag contest, just about anything goes. Over the years I’ve seen crazy hacks to get things working and working quickly: dumping memory by patching in a jump to MessageBox, reverse image searching to find plausible ECC bits for broken QR codes, forcing a core dump and using a file carver to extract files from obfuscated programs… The type of thinking and problem solving promoted by CTFs is fundamental to what it means to be a security researcher.

2. Feedback

When reading through some comments about the CCDC post, I came across someone complaining about the lack of transparency in CCDC. Although this may not be a universally held opinion, it doesn’t apply to Capture the Flag events. Let’s look closely at the procedure for earning points in both:

  1. Solve problem
  2. Submit problem to score server
  3. Observe scoreboard add points or not

Scoring system in CTF

  1. Maintain service uptime and fend off the Red Team
  2. Complete “injects” (business tasks)
  3. Write reports documenting injects and attacks
  4. Submit reports to organizers to get points,reduce deductions
  5. Wait until end of competition to see final standings

Scoring system in CCDC

 

 

 

 

 

 

 

 

Okay, so why is this important? First, scoring in a CTF is objective–there are no reports to be read and graded, and instant feedback to verify the system works correctly. What this means is CTF problem solving can be an iterative process: attempt a solution in a particular way, see the results, and modify your behavior accordingly. If you read about the importance of feedback in learning, you’ll notice a few things CTFs do right: feedback is immediate, feedback is individual (both on a problem and team basis), and feedback isn’t final (if you get a flag wrong, try, try again).

Let’s contrast this with a defense competition. Feedback is not immediate at all: sometimes a team standing is updated daily, and sometimes not until the end of the competition! Feedback is not granular: your team might get a final score at the end of the competition, but it’s not clear where your points came from. Even if points are broken down by advisories, SLA, and so forth, it’s not clear which actions are being re-enforced. Finally, feedback isn’t iterative. Although you may be able to fill out a form to request a regrade on a report, you can’t just submit a bunch of reports and use the best one.

One might argue: the real world doesn’t provide immediate feedback, isn’t objective, and isn’t always fair, so defense competitions are a better approximation of real scenarios! There is a lot of truth to this: CTFs aren’t depicting day-to-day scenarios of system administrators or work in a security operations center. But you know what? That’s kind of the whole point, and what makes CTFs fun.

3. Fun

I’ll admit outright, “fun” is clearly not universal. I know people out there thinks blocking exploits is far more fun than throwing them, but let’s be honest, this isn’t a popular opinion. Heck, even the NCCDC website talks about how much cooler the Red Team is than the contestants. Again, “well the real world isn’t always fun”. Fine. But if your goal is to encourage people to learn and try new things, is the best way to do it really to throw them into the middle of a hostile network, or would it be better to give them fun tasks that guide them through the learning process?

From the NCCDC website

Red teaming sure sounds a lot more fun than “getting OWNED”

Jeopardy style CTFs are a lot of fun. They’re relaxing, you sit down and solve problems. Does this emulate high stress-real world scenarios in hostile networks? No, but that’s not a fun learning environment. DefconCTF is not a place for beginners. I am not sure how someone could argue that a teaching method that is less fun could ever be superior.

Of course, not all people like breaking things. That’s fine. Capture the Flags aren’t always about binary exploitation. There are plenty of forensics, programming, cryptography, and other skills that involve building things. The only skills that CCDC has which I have not seen in a CTF are writing reports and running updates. Looking for backdoors and malicious software, finding intrusions in log files, changing default settings and modifying web applications are all things that have shown up in CTFs (though I admit those aren’t the kind of tasks that I enjoy).

4. The “1%”

One of the points that struck me on the blog post regarding CCDC was the idea of appealing to the “1%”. Restated simply, the claim is that CCDC makes use of skills that 99% of security workers will use while CTFs really only teach 1% of relevant skills. The numbers aren’t really important here, but the sentiment is. Skills like writing reports, reviewing log files, and running updates are the sort of skills most “security workers” will use. Blech. What is that really saying? To get a job as a security worker I don’t need to understand ASLR bypass techniques, NTFS alternate data streams, or CBC Padding Oracle attacks? Yes, that is probably true! That writing a report is more valuable than understanding modern offensive security? Perhaps! But personally, I think that if CCDC claimed the reason it is a superior platform to CTFs is because it involves writing more reports and running Windows update, people would start losing interest. Why not just participate in a technical writing contest? Do we really need a contest which stresses the most mundane aspects of security?

It is probably worth mentioning that some CTFs also require contestants write up solutions. In one instance of RuCTF (2009), players could receive points for patches or writeups of vulnerabilities. Many CTFs with prizes or final rounds also require a writeup be submitted from teams that place. Some competitions also give separate prizes for highest points/first to complete challenges, and best writeup. Still, I think it is fair to say that writing reports is stressed much more in CCDC than CTF events, for better or for worse.

The fact of the matter is that students wanting to learn security are told they should participate in competitions like CCDC. And sure, security is heavily involved in these competitions, but not at the same level as in CTFs. If someone wants to learn about security, I think it’s pretty clear a CTF is a better avenue than a defensive contest. This is important to me because I often come across people who want to get jobs in security and so they play in CCDC. Although I think it’s wonderful for them to take that initiative, I can’t help but feel that their time would be much better spent playing a CTF.

One thing I have heard oft repeated which I whole-heartedly agree with is that in defense, you are always playing catch up. Unless you think as an attacker, you can only react to known vulnerabilities, and so you are always a step behind. The CCDC approach seems to be to focus on defense, and to succeed you will probably learn something about attacks; the CTF approach is that you should focus on attacks, as the defense comes naturally afterwards. I would posit that the top CTF players would be better suited to defending a network against advanced attackers than the top CCDC players for precisely this reason.

Aside from all this, I think everyone would be much safer if security professionals knew more about the 1% that they so rarely see. Maybe if the person reading the IDS logs knew more about writing polymorphic shellcode than what is in Metasploit, they’d be better prepared to prevent threats more advanced than someone using Shikata ga nai.

5. Conclusion

Both CCDC and CTFs address different goals. As far as I can tell, CCDC’s goal is to assess how well someone can administrate and document systems, while CTFs focus on teaching security. Both of these have their place. But telling fledgling security professionals the better way to learn is through CCDC seems dishonest, and forces them into a niche they might not enjoy. Although the skills taught in CTFs are not always used day-to-day, they cover a far broader range of security topics that are relevant to anyone working in the security field. As someone who tries to promote the value of CTFs, it’s important to me that those interested in security education recognize that CTFs are an incredibly powerful and should be encouraged where possible.

-tylerni7

Codegate 2014: membership (800pt pwnable) write-up

This is a write-up for 800 point pwnable challenge called ‘membership’ from Codegate CTF 2014 Pre-qual round. PPP was the only solver for this challenge during the competition, so I have decided to do a write-up for the challenge. Enjoy.  (awesie and ricky solved it during the competition.)

=== If you have any trouble with poor formatting here, you can read the original post at this blog ===

1. Challenge overview

You can download the copy of the binary here.
During the competition, we could ssh into one of their machines to exploit and read the flag.

membership_problem

As you can see, it’s a 32-bit ELF binary. So, let’s open it up in IDA and start reversing.

The program looks really simple. It just installs a couple signal handlers (specifically for SIGSEGV and SIGFPE) and calls a main (interesting) function, where it prompts us for the userid and password. Then, the program calculates SHA256 hash of the given password and compares with the stored hash. If the password hashes do not match, a runtime error exception is thrown and the program is aborted. If we pass in the correct password, we get a shell :)

Since guessing the correct password or cracking the hash is not viable option for us, we try to locate some other bugs that can be useful.

As highlighted above, the program dereferences a null pointer (*0 = 0) if the length of the password is greater than or equal to 16 bytes. Obviously it is going to trigger SIGSEGV, but do you remember what we said earlier about installing the signal handlers? And yes, one of them was SIGSEGV handler.

So, instead of crashing it miserably, the handler will be called.
Let’s examine what this handler does.

SIGSEGV handler installer

Now, if we look at the SIGSEGV_handler, we may think it doesn’t really do anything useful.
Note that it just fills up exception information and calls __cxa_throw to throw exception.

At this point, we could go on and explain what SIGFPE_handler does as well, but we’ll skip it since it’s not that interesting and is not needed for a successful exploitation.
You may ask… so, what’s left?

 

2. Vulnerability

Notice that this is a C++ program with exception throwing. We should check how C++ exception handling works.

It uses a thing called, DWARF, which is a standardized debugging data format for unwinding the stack and handling exceptions.

There was a CTF problem in the past that involved DWARF (called Khazad from Ghost in the Shellcode 2012): Check out these write-ups if you are interested!

Anyways, you can find DWARF information can be displayed by using binutils such as objdump or readelf:

Take a close look at the entry with “pc=08048fa7..0804904d”.
This entry basically describes what should happen when the exception is thrown between that PC range. Note that the SIGSEGV_handler throws an exception at 0x0804901A , which is in that range (that range is precisely  SIGSEGV_handler function).

Ok. Now, we have to make sense of what all those operations mean :)
DW_CFA_val_expression contains CFA expressions that are defined here.

Luckily, it’s not that hard to understand the expressions. We can simply think of it as a stack machine:

So, in short, it checks if the username is “stdchpie” and the password[2:5] is equal to “\xb1\x2e\x40″.
If any of the condition fails, it transfers execution to 0x8048f18 , which does exit(0) .

What happens if we satisfy the conditions? Good question.
It basically dumps us to the following code:

This code prints out “nested” string and writes password[5:9] to *password[1:5]. Meaning, we get to write anything in 0x402eb1??  address space with any 4 byte value we choose. 4-byte write is pretty strong tool in exploitation, but when we are limited to 256 byte range, it’s difficult to make it useful. Also, it immediately jumps to 0x8048cc0 , where it does another null pointer dereference causing SIGSEGV to happen — thus, we get infinite ‘nested’ string printed out.

Alright. Let’s summarize what we know and have so far.

  1. We can trigger a null pointer dereference, causing SIGSEGV handler to get executed (and thus, DWARF CFA expressions), by sending a password that’s >= 16 bytes.
  2. With carefully constructed password, we can overwrite any 4-byte value to any address in between 0x402eb100  and 0x402eb1ff .

The natural question is, then, what is mapped on that memory address?
With ulimit -s unlimited ,

As we can see above (highlighted), the address range falls into libgcc‘s memory — specifically, it matched portion of its .bss section.

So, what is there in libgcc_s.so.1, you ask.

Precisely, this.
And that’s it.

At this point, we downloaded and opened up libgcc source code to look at where some of these data structures are used, and tried to look for ways to get an EIP control.

So the journey begins.

3. libgcc source code analysis

Note that this step took the longest since we had to actually understand part of the gcc code when it does stack unwinding and handling the exception.

You can download the source for gcc here (gcc-4.8.1, Ubuntu 13.01).

During the competition, we chose each data structure of interest and traced backwards to find out whether by controlling said structure we can influence anything (e.g. function pointer) on callers while handling exceptions to hijack the control flow.

Since we now know which one can be used to control EIP, we will start from there: frame_hdr_cache_head is our target. [It is very well be possible to solve the challenge with different method/structure, but this is the one that we ended up using during the CTF.]

If we locate the place that frame_hdr_cache_head is referenced, we land in the middle of _Unwind_IteratePhdrCallback function in libgcc/unwind-dw2-fde.dip.c.

frame_hdr_cache_head points to the first element of a singly linked list that contains frame_hdr_cache_element(s).
The code iterates through the list and finds the entry for data->pc  in cache. data->pc  is the program counter of the frame we are trying to handle the exception for.

This cache is filled in as the program discovers exception handler frames (eh_frame).

The following is the struct definition for frame_hdr_cache_element:

So, if we control where frame_hdr_cache_head points to, we can also construct/control the elements inside. Before we dive into what happens when we find an element in the cache and ‘goto found‘, let’s step back for a minute and see if we can even get to here and what that allows us to do.

The function we just looked at (_Unwind_IteratePhdrCallback) is called from _Unwind_Find_FDE in unwind-dw2-fde-dip.c.
Then, _Unwind_Find_FDE function is called from uw_frame_state_for function in unwind-dw2.c.
uw_frame_state_for function is called from _Unwind_RaiseException function in unwind.inc, which provides an interface to raise an exception given an exception object.

Where does _Unwind_RaiseException get called, then?
It gets called by __cxa_throw, and if you remember, our SIGSEGV_handler invokes this function to raise an exception.

Alright. We now have confirmed that we can get to that code by causing the binary to throw an exception and letting libgcc unwinds/handles the exception.

But is there anything interesting in this code path such that we can give us EIP control? Yes.

Let’s review _Unwind_RaiseException a little bit:

Notice the highlighted lines. What do you see?

A function pointer getting called! And we *may* be able to control fs.personality .
Let’s find out!

Remember that the struct pointer that we are interested in tracing is fs (aka 2nd argument).
Wee see here that _Unwind_Find_FDE is used to get fde (which is used to get cie), and extract_cie_info takes cie and fs as its first and third argument, respectively.

So, what happens in extract_cie_info?

Cool.
extract_cie_info parses cie and updates fs->personality . We’ll work out the details later.

Okay, now, we have to look into _Unwind_Find_FDE function to find out what it returns (fde) is:

As we discussed earlier, _Unwind_Find_FDE calls _Unwind_IteratePhdrCallback, which fills the data struct.
Then, it returns data.ret.

Whoa. After that chain of functions, we now came back to where we started – _Unwind_IteratePhdrCallback.
Warning: This is a really long function :p

To show a good idea of the call stack, here’s a diagram:

Fortunately, we do not have to look at all of its details. As we learned earlier, the cache for eh_frame_hdr is looked up and the following is performed in case the entry was found:

Note that data->ret  is set to f on line 415, where f is a FDE pointer found by performing binary search.
Comments from unwind-dw2-fde.h briefly describes FDE & CIE lookup:

Let’s review some of the primitive structs and functions that are used in above code to get a better understanding of what’s going on. We will make references to these as we explain the code later.

And, these are some functions that are used when parsing data:

That was a lot of stuff, but don’t worry about understanding/remembering all of them since we will go over the logic at somewhat high-level.

When an exception is thrown, the PC is looked up to find a correct FDE for the current function.

  1. First, they search the shared library cache linked-list (which we control the head pointer).
  2. Once the entry is found, they get unw_eh_frame_hdr (hdr variable) by adding p_vaddr and load_base. Then, they make sure the version of hdr is 1.
    • hdr also contains the flags for encoding schemes for eh_frame_ptr, fde_count, and table.
    • Encoding flag is defined in unwind-pe.h, but important ones are: DW_EH_PE_pcrel (0×10, pc-relative), DW_EH_PE_absptr (0×00, absolute),  DW_EH_PE_sdata4 (0x0b, signed 4 byte), DW_EH_PE_udata4 (0×03, unsigned 4 byte).
  3. Parse eh_frame and fde_count
  4. Perform binary search in table for the data->pc  against table[i].initial_loc + data_base , where data_base is hdr.
  5. When found an element in table, set f to table[mid].fde + data_base  (thus, calculating the FDE pointer).
  6. Final check is done by parsing the range to ensure that this FDE record covers data->pc
    ( table[mid].initial_loc + data_base <= data->pc < table[mid].initial_loc + data_base + range )
  7. data->ret  is filled with f.

It’s important to carefully construct a (fake) FDE record since it holds CIE_delta field, which is used to locate the CIE record to be parsed later (for personality function pointer).

Only piece that we haven’t visited yet is extract_cie_info, but we will visit it as we develop an exploit payload :)

4. Exploit development

Finally, we can start writing some evil awesome payload to pwn this binary.

Here’s our plan for the attack:

  1. Overwrite frame_hdr_cache_head (0x402eb118) to point to our stdin buffer (0×40025000 + 0x1c for skipping userid/password/padding).
  2. Construct fake structs:
    • cache_entry (frame_hdr_cache_element)
    • p_eh_frame_hdr (Elf32_Phdr)
    • hdr  (unw_eh_frame_hdr)
    • table (fde_table)
    • fde (dwarf_fde)
    • cie (dwarf_cide)
  3. When creating a fake cie struct, we make the personality function pointer 0x8048E97 , where it does execlp("/bin/sh", "/bin/sh", 0) , and get a shell!!

Note that the some of the fields in structs are relative offsets, so we need to plan where to put things and link them correctly.

4-1. Trigger

Let’s start with a simple payload that would pass the check and trigger the bug.

As we can see in action, this payload triggers the bug and causes infinite SIGSEGV.
We currently chose 0x402eb101 for no particular reason, but we can see that memory is successfully written.

4-2. cache_entry & p_eh_frame_hdr construction

Now, we overwrite frame_hdr_cache_head to point to our stdin buffer.

We are going to start building fake structs from our buffer + 0x1c.

So what values should we use?
To not worry about the search too much, we are going to set pc_low to 0×0 and pc_high to 0xFFFFFFFF. This basically says that this cache entry should be used for any exception thrown in this range of addresses — so we’ll catch everything. Also, to make it easy to do math, we are going to make load_base to 0. Finally, we have to set p_eh_frame_hdr pointer to the fake Elf32_Phdr struct. We will put this fake phdr struct right after our fake cache_entry struct that we are currently building. The rest of the fields are not really used (for our purpose), so we can put dummy values.

This gives us this:

For p_eh_frame_hdr struct, we only care about p_vaddr which is used to calculate hdr (unw_eh_frame_hdr).

Let’s see in action.

So, this payload basically lets us to execute goto found;  code (unwind-dw2-fde-dip.c:225) since the data->pc  will be in between pc_low and pc_high.

Then, on line 315, hdr is calculated by adding p_eh_frame_hdr->p_vaddr and load_base, thus pointing 0×40025054.
Time to build a fake hdr struct!

4-3. hdr & table construction

Starting at +0×54 from our buffer comes the hdr struct.
It’s a 4 byte struct and we fill in reasonable values here, according to the encoding scheme mentioned above.

Then, as we saw earlier, eh_frame is read. Since the value is supposedly encoded with  (DW_EH_PE_pcrel | DW_EH_PE_sdata4) , this value in our data should be an offset from where the hdr is. However, the value of eh_frame isn’t really related to what we do, so we can put any value (read_encoded_value_with_base actually does the calculation given the base to correctly compute eh_frame’s value).

Ok, next check is the following:

We have picked the values for encoding schems such that we satisfy both conditions.
Then, fde_count is read.
Since we do not want to create more than one set of fake structs (to be searched with binary search later), we will force this to be 1.

So with this data appended, we so far have this as our payload:

Then, the table comes next. fde_table struct has two fields: initial_loc and fde.

As mentioned earlier, in order for the search to succeed, we need to satisfy  table[mid].initial_loc + data_base <= data->pc < table[mid].initial_loc + data_base + range .

Note that data_base is pointing at hdr (0×40025054). So we can set initial_loc to 0xBFFDAFAC such that initial_loc + data_base == 0x40025054 + 0xBFFDAFAC  == 0x0 .

Also, the fde field is actually an (signed) offset from hdr — due to (DW_EH_PE_datarel | DW_EH_PE_sdata4) encoding. So, we set it to 0×14 to indicate that our fake dwarf_fde struct will be located at 0×40025068.

Fake hdr and table construction is done, and we now have this:

The current payload, when fed to the program, will result in a crash since it will read an invalid value for the range.
To make data->pc < initial_loc + data_base + range  true, we need to construct a fake dwarf_fde now.

4-4. fde & cie construction

As a final step, we are going to construct fde and cie records in our payload.

dwarf_fde struct has length, CIE_delta, and pc_begin fields (followed by fde_augmentation length, which should be 0).

We are going to make the length 0x1C, and CIE_delta to 0xFFFFFFE4 (such that &CIE_delta - CIE_delta == 0x40025088 – this will be explained later). We will set pc_begin to 0×0 (doesn’t really matter what we put here).

What comes after pc_begin is the range. To explain a little bit, on line 412 in unwind-dw2-fde-dip.c, range is read from f->pc_begin[f_enc_size] where f_enc_size is 4, making the 4 byte right after pc_begin be the range. Since we made the init_loc to be 0×0, we will make the range to be 0xFFFFFFFF. Then, we pad the last few bytes (so, technically we can fix the length, but that’s what we used during the competition).

This yields our payload to be:

We are almost there!!!

Above payload will result in data->ret  to contain a pointer to our FDE struct and return to _Unwind_Find_FDE.

In _Unwind_Find_FDE, nothing interesting happens, and the same (a pointer to our fake FDE struct) is returned.

We are now back to uw_frame_state_for function (line 1180 in unwind-dw2.c). Since fde is not null, extract_cie_info is called with the cie pointer that is based on our fde.

Looking at the get_cie function, we can see why we put 0xFFFFFFE4 for CIE_delta value in our FDE struct. With our setup, get_cie will return the CIE struct’s address, which will be right after our fake FDE struct (aka 0×40025088).

Now, we have 1 final function that we need to understand: extract_cie_info.

This function is mostly parsing stuff and filling in the _Unwind_Frame_State data based on the CIE record.

dwarf_cie struct has length, CIE_id, version, and augmentation — and depending on augmentation content, more data follows.

Here’s the values we set for our fake CIE struct:

Data that follows after augmentation string (code_alignment, data_alignment, return_addr_col) are read in first.
We chose these values just because we saw these in normal CIE struct, but it shouldn’t matter what the values are.

Then, the rest of the data is parsed as augmentation contents (aka ‘zPLR’).

  1. If the first byte is ‘z’, it sets fs->saw_z flag  and note that the length of the extra augmentation data (which follows the length itself) is 0×07.
  2. ‘P’ indicates a personality routine  is specified in CIE (extra) augmentation, and basically read the personality_ptr value (4-byte) based on the personality_enc encoding scheme — which we set as 0×0 to make it absptr type.
  3. ‘L’ indicates a byte showing how the LSDA pointer is encoded. No idea what that is, but it’s not relevant — we put 0×0.
  4. ‘R’ indicates a byte indicating how FDE addresses are encoded. We put some sane value that we saw earlier, but shouldn’t matter either.

Alright, now with some padding bytes to make the total length 0x1c, we are set.

Thus far, we have built the following payload:

And corresponding output when we run this payload against the binary:

YAY!!! WE HAVE EIP CONTROL!!!!111!!11!

Ok, now on to the final and easiest step: getting a shell.

4-5. Give me a shell

Remember (from a while ago…) that there was code that does execlp("/bin/sh", "/bin/sh", 0) ?
For those who don’t remember, it’s located at  0x8048E97 .

All we have to do at this point is to replace 0×41424344 (personality routine pointer) to 0x8048e97.

AND

Voila! We have our shell (and the flag, of course!)

 

5. Closing

I hope you enjoyed reading this write-up. (Although I suspect not.. due to its obscene length)

I apologize that this ended up being a LOT longer than I anticipated when I started writing, but I think it contains quite a bit of details that people can follow and reproduce the result.

Try it while their server is up!! Otherwise you will have to patch the binary such that the addresses work out.

Thank you for reading, and feel free to leave comments if you have any questions or suggestions.

 

Write-up by Cai (Brian Pak) [https://www.bpak.org]

 

GiTS 2014: gitsmsg

tl;drgitsmsg is a messaging server. A heap overflow led to arbitrary read / write and eventual code exec
after circumventing RELRO.

Read more

GiTS 2014: fuzzy

tl;dr – fuzzy is a “super secure parsing engine”, that includes a histogram function. The histogram ascii text uses a buffer on the stack, but will increment
buckets past the end of the buffer if non ascii text is provided, allowing us to
rop.

Read more

GiTS 2014: Gitzino

Gitzino was the 400-point crypto problem for Ghost in the Shellcode 2014. It looked like a standard
“predict-the-RNG” problem: there’s a PRNG, a card game, and hopefully the output it gives you
provides enough data about the internal state of the PRNG to predict the future and win the game
repeatedly.

Read more

Exploiting a Go Binary

Introduction

Earlier this year, tylerni7 showed us a proof of concept for a 32 bit Go exploit using this issue. geohot and I had a wager over who could get the first remote code execution on play.golang.org: he won, but just barely ;-). Props also to ricky for helping to find the underlying cause/writing the patch. Here is a summary of how we did it.

Note: play.golang.org is properly sandboxed, so code execution there does not actually let you do anything. Had this been a more serious bug that could actually be used for anything malicious, we would have reported it and not used it as a CTF problem.

This post is cross posted on my personal blog, original post there.

Read more

PHDay quals bin300

The recent Positive Hack Days qualifier round had a lot of fun problems. Binary 300 was the only problem that was solved in the competition but not solved by PPP. It was also a very nice crypto problem which was a lot of fun. We ended up having a brute forcer finish the challenge a couple hours too late, which got me interested in seeing how fast a brute forcer could go, if we had more time to write it.

Problem overview

So to start off, we are given a compiled python file. Decompiling the python bytecode with your favorite tool, we get the following code:

 

Anyway, we can see that this is a pretty straightforward problem, the user input is converted to hex and prepended with its length, and then repeated a number of times to give is a 512 bit exponent. We then raise g to this power modulo p, and compare it to our target value, x. If they match, we will get our key!

Solution

However, this straightforward problem looks an awful lot like the standard discrete logarithm problem. Unfortunately, discrete log is pretty hard to do quickly, leaving us out of luck. A brute force solution would require that we try on the order of  260 modular exponentiations, which would be far more than feasible in the time allotted. Luckily, this problem lends itself very well to a meet-in-the-middle attack, which is a special time-memory tradeoff technique.

Rather than performing all 260 modular exponentiations, what we can do is to break our potential exponents in half, let’s call them eli and eri for exponent left and exponent right. We can then generate two tables. In the first, we will raise g to each of eli, and in the second, we divide our target value x by g raised to each of our eri values. Then we will search through our tables to find a value in common. That is, we wish to find an m such that g raised to the eli is equal to x divided by g raised to the eri, which means that the concatenation of eli and erwill be the full exponent.

From a theoretical standpoint, that’s all there is to this problem. We’ve reduced 60 bits of security to about 30, which is great. However, we want to actually get a key out of this and fast. It turns out that modular exponentiation, especially to 512 bit, is quite slow, and on a single good computer, a naive solution to this problem will take many hours (see this writeup). Sometimes in a CTF this is fine, but if we have the chance, why not do things a little better ;)

Optimizations

To make this go fast, we’re going to code it up on GPUs. Most brute force problems are easily parallelized, I happen to have a machine with a few reasonable GPUs, and I’ve been meaning to mess around with OpenCL more, so this seems like a good opportunity. The downsides of writing GPU code is that there are no libraries (that I know of) for arbitrary, or even 512 bit arithmetic already implemented. Luckily this isn’t too hard to do.

Now, hopefully everyone is aware of the standard algorithms for modular exponentiation. The repeated squaring algorithm is the most popular, and works pretty well in most cases. The basic idea is to use the binary representation of our exponent, and to repeatedly square g to fill in each of the spots with 1s. This results in only a logarithmic number of multiplies of 512 bit numbers and modulo p steps. Unfortunately, modulo p amounts to 512 bit division, which is quite slow. We can replace this by using Montgomery Reduction (this was actually a new technique to me, and was pointed out by Reinhart, one of the great guys from Eindbazen).

Montgomery Reduction is great, because it is a bit easier to program, and it’s also a bit faster. Our OpenCL implementation probably provides very little actual benefit compared to just taking our answers modulo p, but if we were using other platforms that natively supported a few more of our operations, then it would be far more beneficial. It also still allows us to use repeated squaring, which we know is very fast.

The next speedup comes from precomputing each of the possible g raised to powers of 2. By providing a lookup table for these, we can avoid quite a few multiplications that we would otherwise end up repeating every time we calculate an exponentiation. Again, this also not only helps our code performance, but also makes things a bit easier to program.

At this point, a single GPU will get around 100 thousand modular exponentiations per second, which seems great. However, once we start using larger numbers (or more precisely, numbers with higher hamming weights), this rate starts dropping pretty fast. For full, 512 bit exponents, our rates are closer to 10 thousand modular exponentiations, which is a bit too slow for our tastes. So what can we do to fix this?

Again, we can use precomputation to fix this issue. We can break up our exponent into just four or five different pieces; one for each character in the half key we are using. For example, if we are calculating the key half for “abcde”, our exponent is

This has a hamming weight of 126, which means a naive repeated squaring solution would require 126 multiplications to calculate the exponent.

However, we can easily break this value apart. For example, when testing 9 character strings, we know that the key will always have the bits set

Similarly, for a character with hex value XX, the exponent will have a mask of bits set

Because these are disjoint masks for each character that are later added together, this means we can precalculate g raised to each of these “masks”, and then multiply them together separately.

This mask generation requires precomputation of 4 or 5 times the number of characters in our characterset, 62. Now, using this precomputation, we go from having 126 modular multiplications to only 5! This brings our exponentiation speed back to the range of 100 thousand per second per GPU.

At this point, our GPU machine will give us our two tables (our left key half and right key half) in about 7 minutes! The first thing to do is to not store all 512 bits of the result. With around 100 million entires in 2 tables, that amounts to around 10 gigabytes of data. Luckily modular exponentiation is a good hash function, which means that any subset of bits we look at will have a mostly uniform distribution. Doing the math, we see that we need about 60 bits of data to avoid accidental collisions from the birthday bound, so we’ll just use 64 bits. This brings us back down to around 1 gigabyte worth of tables.

Now we just need to find the mid-point. Obviously the simple way to find this point is to try all combinations, which is O(n2). With a hundred million table entries, this seems like a bad idea.

Sorting both of these tables can be done in O(n log n), and then we can simply do a simultaneous linear sweep to find a point in common. Sadly, the machine I happen to have with GPUs only has 2GB of RAM. What we do is simply sort a portion of the list, and do a parallel binary search with our GPUs to search that section for the midpoint. This keeps our RAM usage low, and is still quite fast.

This takes another 7 minutes or so on our machine, and gives us our discrete log key.

So, our discrete log key is kA0xSmk39, which we can confirm pretty easily. And taking the md5sum of this presumably would give us the key to score points during the CTF.

Feel free to take a look at the python driver and OpenCL code, available here, just be warned that it is still hacked together and buggy!

Pai Mei on Mac OSX 10.8

tl;dr

Pai Mei is an open source windows reverse engineering framework. At one point, it was ported to Mac OSX but the project is not very actively maintained and the current instructions are quite lacking. This post hopes to offer some guidance and reduce some of the frustration involved in installing Pai Mei on Mac OSX.

Read more

CSAW Quals 2012 – exp500

Overview

challenge1 is an Linux x86 binary that has a buffer overflow. Using information disclosed from another problem, we can use libc gadget to jump to our shellcode.

Read more